So, how does this work?
Overview¶
zarrwhals bridges DataFrame libraries with Zarr storage. It handles serialization and deserialization while delegating computation to your preferred DataFrame engine.
Narwhals Integration¶
zarrwhals uses Narwhals for DataFrame-agnostic operations, specifically it converts your DataFrame to a Narwhals DataFrame, then this intermediate representation gets written out to Zarr.
pandas / polars / dask → Narwhals → zarrwhals → Zarr Storage
Data Flow¶
Writing DataFrames¶
- Call
zw.to_zarr(df, "store.zarr") - Convert to Narwhals DataFrame
- For each column: encode data + metadata → create Zarr array (or group for categoricals)
- Write group metadata
- Zarr Store created
Reading DataFrames¶
- Call
zw.from_zarr("store.zarr", backend="polars") - Create an Internal Narwhals complient
ZarrFrame, then convert to the user's specifiedbackend. - Return DataFrame
Storage Format¶
zarrwhals follows the anndata on-disk specification for DataFrame storage, enabling interoperability with the scientific Python ecosystem.
DataFrames are stored as a Zarr group:
store.zarr/
├── zarr.json # Group metadata
├── _index/ # Row index array
├── column_a/ # Regular column (array)
│ └── zarr.json # Array metadata + narwhals_dtype
├── column_b/ # Categorical column (group)
│ ├── codes/ # Integer codes array
│ └── categories/ # Category labels array
└── ...
Group Attributes¶
{
"encoding-type": "dataframe",
"encoding-version": "0.0.1",
"column-order": ["column_a", "column_b"],
"_index": "_index"
}
Column Attributes¶
Each column stores its Narwhals dtype for accurate round-trip:
{
"encoding-type": "array",
"narwhals_dtype": "Int64"
}
Type System¶
zarrwhals implements custom Zarr dtypes for types not natively supported:
| Source Type | Zarr Encoding |
|---|---|
| int8/16/32/64 | Native |
| float32/64 | Native |
| bool | Native |
| String | VLen UTF-8 |
| Categorical | Codes + Categories arrays |
| Datetime | int64 + unit metadata |
| Duration | int64 + unit metadata |
| Object | Pickle bytes |
Categorical Encoding¶
Categoricals use code/category separation:
One Zarr Array gets written for the codes, another for the categories. During read time, the codes get mapped to the original categories reconstructing the categorical column.
Input: ['A', 'B', 'A', 'C', 'B']
↓
codes: [0, 1, 0, 2, 1] (int array)
categories: ['A', 'B', 'C'] (string array)
ordered: false
General Zarr Features in zarrwhals¶
Chunking¶
Control how data is split across storage:
# Default: Zarr auto-selects chunk size
zw.to_zarr(df, "store.zarr")
# Custom: 10,000 rows per chunk
zw.to_zarr(df, "store.zarr", chunks=10_000)
Sharding¶
Group multiple chunks into single files for large datasets:
# 1,000 rows per chunk, 100,000 rows per shard
zw.to_zarr(df, "store.zarr", chunks=1_000, shards=100_000)
Compression¶
Choose from multiple codecs:
from zarr.codecs import ZstdCodec, LZ4Codec, GzipCodec
# Auto (Zarr default)
zw.to_zarr(df, "store.zarr", compressors="auto")
# Custom codec with options
zw.to_zarr(df, "store.zarr", compressors=ZstdCodec(level=5))
# No compression
zw.to_zarr(df, "store.zarr", compressors=None)
Next Steps¶
- See the API Reference for read/write function documentation.
- Check out Contributing to get involved