Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-test-mutation-observers.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
While DataStore is highly compatible with pandas, there are important differences to understand.
Summary Table
| Aspect | pandas | DataStore |
|---|
| Execution | Eager (immediate) | Lazy (deferred) |
| Return types | DataFrame/Series | DataStore/ColumnExpr |
| Row order | Preserved | Preserved (automatic); not guaranteed in performance mode |
| inplace | Supported | Not supported |
| Index | Full support | Simplified |
| Memory | All data in memory | Data at source |
- Lazy vs Eager Execution
pandas (Eager)
Operations execute immediately:
import pandas as pd
df = pd.read_csv("data.csv") # Loads entire file NOW
result = df[df['age'] > 25] # Filters NOW
grouped = result.groupby('city')['salary'].mean() # Aggregates NOW
DataStore (Lazy)
Operations are deferred until results are needed:
from chdb import datastore as pd
ds = pd.read_csv("data.csv") # Just records the source
result = ds[ds['age'] > 25] # Just records the filter
grouped = result.groupby('city')['salary'].mean() # Just records
# Execution happens here:
print(grouped) # Executes when displaying
df = grouped.to_df() # Or when converting to pandas
Why It Matters
Lazy execution enables:
- Query optimization: Multiple operations compile to one SQL query
- Column pruning: Only needed columns are read
- Filter pushdown: Filters apply at the source
- Memory efficiency: Don’t load data you don’t need
- Return Types
pandas
df['col'] # Returns pd.Series
df[['a', 'b']] # Returns pd.DataFrame
df[df['x'] > 10] # Returns pd.DataFrame
df.groupby('x') # Returns DataFrameGroupBy
DataStore
ds['col'] # Returns ColumnExpr (lazy)
ds[['a', 'b']] # Returns DataStore (lazy)
ds[ds['x'] > 10] # Returns DataStore (lazy)
ds.groupby('x') # Returns LazyGroupBy
Converting to pandas Types
# Get pandas DataFrame
df = ds.to_df()
df = ds.to_pandas()
# Get pandas Series from column
series = ds['col'].to_pandas()
# Or trigger execution
print(ds) # Automatically converts for display
- Execution Triggers
DataStore executes when you need actual values:
| Trigger | Example | Notes |
|---|
print() / repr() | print(ds) | Display needs data |
len() | len(ds) | Need row count |
.columns | ds.columns | Need column names |
.dtypes | ds.dtypes | Need type info |
.shape | ds.shape | Need dimensions |
.values | ds.values | Need actual data |
.index | ds.index | Need index |
to_df() | ds.to_df() | Explicit conversion |
| Iteration | for row in ds | Need to iterate |
equals() | ds.equals(other) | Need comparison |
Operations That Stay Lazy
| Operation | Returns |
|---|
filter() | DataStore |
select() | DataStore |
sort() | DataStore |
groupby() | LazyGroupBy |
join() | DataStore |
ds['col'] | ColumnExpr |
ds[['a', 'b']] | DataStore |
ds[condition] | DataStore |
- Row Order
pandas
Row order is always preserved:
df = pd.read_csv("data.csv")
print(df.head()) # Always same order as file
DataStore
Row order is automatically preserved for most operations:
ds = pd.read_csv("data.csv")
print(ds.head()) # Matches file order
# Filter preserves order
ds_filtered = ds[ds['age'] > 25] # Same order as pandas
DataStore automatically tracks original row positions internally (using rowNumberInAllBlocks()) to ensure order consistency with pandas.
When Order Is Preserved
- File sources (CSV, Parquet, JSON, etc.)
- pandas DataFrame sources
- Filter operations
- Column selection
- After explicit
sort() or sort_values()
- Operations that define order (
nlargest(), nsmallest(), head(), tail())
When Order May Differ
- After
groupby() aggregations (use sort_values() to ensure consistent order)
- After
merge() / join() with certain join types
- In performance mode (
config.use_performance_mode()): row order is not guaranteed for any operation. See Performance Mode.
- No inplace Parameter
pandas
df.drop(columns=['col'], inplace=True) # Modifies df
df.fillna(0, inplace=True) # Modifies df
df.rename(columns={'old': 'new'}, inplace=True)
DataStore
inplace=True is not supported. Always assign the result:
ds = ds.drop(columns=['col']) # Returns new DataStore
ds = ds.fillna(0) # Returns new DataStore
ds = ds.rename(columns={'old': 'new'}) # Returns new DataStore
Why No inplace?
DataStore uses immutable operations to enable:
- Query building (lazy evaluation)
- Thread safety
- Easier debugging
- Cleaner code
- Index Support
pandas
Full index support:
df = df.set_index('id')
df.loc['user123'] # Label-based access
df.loc['a':'z'] # Label-based slicing
df.reset_index()
df.index.name = 'user_id'
DataStore
Simplified index support:
# Basic operations work
ds.loc[0:10] # Integer position
ds.iloc[0:10] # Same as loc for DataStore
# For pandas-style index operations, convert first
df = ds.to_df()
df = df.set_index('id')
df.loc['user123']
DataStore Source Matters
- DataFrame source: Preserves pandas index
- File source: Uses simple integer index
- Comparison Behavior
Comparing with pandas
pandas doesn’t recognize DataStore objects:
import pandas as pd
from chdb import datastore as ds
pdf = pd.DataFrame({'a': [1, 2, 3]})
dsf = ds.DataFrame({'a': [1, 2, 3]})
# This doesn't work as expected
pdf == dsf # pandas doesn't know DataStore
# Solution: convert DataStore to pandas
pdf.equals(dsf.to_pandas()) # True
Using equals()
# DataStore.equals() also works
dsf.equals(pdf) # Compares with pandas DataFrame
- Type Inference
pandas
Uses numpy/pandas types:
df['col'].dtype # int64, float64, object, datetime64, etc.
DataStore
May use ClickHouse types:
ds['col'].dtype # Int64, Float64, String, DateTime, etc.
# Types are converted when going to pandas
df = ds.to_df()
df['col'].dtype # Now pandas type
Explicit Casting
# Force specific type
ds['col'] = ds['col'].astype('int64')
- Memory Model
pandas
All data lives in memory:
df = pd.read_csv("huge.csv") # 10GB in memory!
DataStore
Data stays at source until needed:
ds = pd.read_csv("huge.csv") # Just metadata
ds = ds.filter(ds['year'] == 2024) # Still just metadata
# Only filtered result is loaded
df = ds.to_df() # Maybe only 1GB now
- Error Messages
Different Error Sources
- pandas errors: From pandas library
- DataStore errors: From chDB or ClickHouse
# May see ClickHouse-style errors
# "Code: 62. DB::Exception: Syntax error..."
Debugging Tips
# View the SQL to debug
print(ds.to_sql())
# See execution plan
ds.explain()
# Enable debug logging
from chdb.datastore.config import config
config.enable_debug()
Migration Checklist
When migrating from pandas:
Quick Reference
| pandas | DataStore |
|---|
df[condition] | Same (returns DataStore) |
df.groupby() | Same (returns LazyGroupBy) |
df.drop(inplace=True) | ds = ds.drop() |
df.equals(other) | ds.to_pandas().equals(other) |
df.loc['label'] | ds.to_df().loc['label'] |
print(df) | Same (triggers execution) |
len(df) | Same (triggers execution) |