Skip to content

ENH: Lazy/Chunked Table and TimeSeries I/O with Dask Backend for Out-of-Core Astronomical Catalogs #19345

@aditya-pandey-dev

Description

@aditya-pandey-dev

What is the problem this feature will solve?

Modern sky surveys produce catalogs that fundamentally exceed available RAM on a typical workstation:

  • Gaia DR3: 1.8 billion rows, ~60 GB (main source catalog)
  • Rubin/LSST: ~15 TB/night, ~10 billion detections/year
  • JWST deep field catalogs: 1–50 GB per field

Any user attempting:

from astropy.table import Table
t = Table.read('gaia_dr3_source.fits')  # MemoryError — needs ~60 GB RAM

...either crashes immediately or is forced to abandon astropy entirely for vaex, polars, or raw pandas. All of these alternatives silently drop astropy-native types: units (Quantity), coordinates (SkyCoord), and time (Time columns).

This is not a new pain point. Two previous issues were opened but never resolved:

Those issues identified specific blockers (Quantity eagerness, repr densification, SkyCoord as ndarray subclass) but no scoped implementation plan was ever proposed. This issue aims to fix that.

astropy is used by 55,000+ downstream packages including LSST pipelines, JWST reduction tools, and eROSITA analysis software. All of them hit this wall.

Describe the desired outcome

Add an optional lazy=True keyword to Table.read(), QTable.read(), and TimeSeries.read() that returns a LazyTable — a thin Table subclass backed by dask.array columns — without loading any data into RAM until explicitly requested.

Proposed API

from astropy.table import Table

# No data loaded — only FITS header / ECSV metadata parsed eagerly
t = Table.read('gaia_dr3_source.fits', lazy=True)

print(t.colnames)         # ['source_id', 'ra', 'dec', ...] — instant, no I/O
print(t['ra'].unit)       # deg — from FITS header, no data loaded
print(len(t))             # 1_811_709_771 — from FITS NAXIS2, no data loaded

# Build a lazy filtered view
bright = t[t['phot_g_mean_mag'] < 10.0]

# Materialize only when needed
result = bright.compute()   # returns a normal astropy Table

# Stream through large files without ever loading everything
for chunk in t.iterchunks(chunk_size=100_000):
    process(chunk)          # each chunk is a normal Table of ~100k rows

Phase 1 Scope

In scope:

  • LazyTable and LazyColumn classes with a repr that does not trigger data reads
  • Lazy FITS BinaryTable reader via dask.array.from_delayed per column
  • Lazy ECSV reader
  • Column selection and scalar/numeric boolean row filtering (lazy)
  • .compute() to materialize to normal Table
  • .iterchunks(chunk_size=N) for streaming pipelines
  • Unit metadata preserved on LazyColumn without triggering computation
  • QTable.read(..., lazy=True)Quantity constructed only on .compute()
  • TimeSeries.read(..., lazy=True) with time_column preserved in metadata
  • Raises ImportError with a helpful message if dask is not installed

Out of scope (explicit non-goals for Phase 1):

  • Full lazy SkyCoord column support (blocked by Quantity/ndarray subclass refactor in Experiment with interfacing with Dask #8227)
  • Full lazy Time column support (same blocker)
  • Dask distributed / multi-node scheduler support
  • HDF5 lazy support

Key Design Decisions

Handling the Quantity eagerness blocker (identified in #8227): Rather than modifying Quantity globally, LazyColumn stores unit as metadata and constructs Quantity only during .compute(). No changes to astropy.units needed.

Handling repr densification (identified in #8227): LazyTable.__repr__ explicitly avoids accessing column data:

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions