-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
What is the problem this feature will solve?
Modern sky surveys produce catalogs that fundamentally exceed available RAM on a typical workstation:
- Gaia DR3: 1.8 billion rows, ~60 GB (main source catalog)
- Rubin/LSST: ~15 TB/night, ~10 billion detections/year
- JWST deep field catalogs: 1–50 GB per field
Any user attempting:
from astropy.table import Table
t = Table.read('gaia_dr3_source.fits') # MemoryError — needs ~60 GB RAM...either crashes immediately or is forced to abandon astropy entirely for vaex, polars, or raw pandas. All of these alternatives silently drop astropy-native types: units (Quantity), coordinates (SkyCoord), and time (Time columns).
This is not a new pain point. Two previous issues were opened but never resolved:
- Integrate with dask #7748 ("Integrate with dask") — opened Aug 2018, closed by stale bot with zero implementation
- Experiment with interfacing with Dask #8227 ("Experiment with interfacing with Dask") — opened Dec 2018, still open after 6+ years, labeled only
Experimentalwith no concrete deliverable
Those issues identified specific blockers (Quantity eagerness, repr densification, SkyCoord as ndarray subclass) but no scoped implementation plan was ever proposed. This issue aims to fix that.
astropy is used by 55,000+ downstream packages including LSST pipelines, JWST reduction tools, and eROSITA analysis software. All of them hit this wall.
Describe the desired outcome
Add an optional lazy=True keyword to Table.read(), QTable.read(), and TimeSeries.read() that returns a LazyTable — a thin Table subclass backed by dask.array columns — without loading any data into RAM until explicitly requested.
Proposed API
from astropy.table import Table
# No data loaded — only FITS header / ECSV metadata parsed eagerly
t = Table.read('gaia_dr3_source.fits', lazy=True)
print(t.colnames) # ['source_id', 'ra', 'dec', ...] — instant, no I/O
print(t['ra'].unit) # deg — from FITS header, no data loaded
print(len(t)) # 1_811_709_771 — from FITS NAXIS2, no data loaded
# Build a lazy filtered view
bright = t[t['phot_g_mean_mag'] < 10.0]
# Materialize only when needed
result = bright.compute() # returns a normal astropy Table
# Stream through large files without ever loading everything
for chunk in t.iterchunks(chunk_size=100_000):
process(chunk) # each chunk is a normal Table of ~100k rowsPhase 1 Scope
In scope:
LazyTableandLazyColumnclasses with a repr that does not trigger data reads- Lazy FITS BinaryTable reader via
dask.array.from_delayedper column - Lazy ECSV reader
- Column selection and scalar/numeric boolean row filtering (lazy)
.compute()to materialize to normalTable.iterchunks(chunk_size=N)for streaming pipelines- Unit metadata preserved on
LazyColumnwithout triggering computation QTable.read(..., lazy=True)—Quantityconstructed only on.compute()TimeSeries.read(..., lazy=True)withtime_columnpreserved in metadata- Raises
ImportErrorwith a helpful message if dask is not installed
Out of scope (explicit non-goals for Phase 1):
- Full lazy
SkyCoordcolumn support (blocked by Quantity/ndarray subclass refactor in Experiment with interfacing with Dask #8227) - Full lazy
Timecolumn support (same blocker) - Dask distributed / multi-node scheduler support
- HDF5 lazy support
Key Design Decisions
Handling the Quantity eagerness blocker (identified in #8227): Rather than modifying Quantity globally, LazyColumn stores unit as metadata and constructs Quantity only during .compute(). No changes to astropy.units needed.
Handling repr densification (identified in #8227): LazyTable.__repr__ explicitly avoids accessing column data:
Additional context
No response