Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-test-mutation-observers.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
DataStore provides comprehensive aggregation and window function support, leveraging ClickHouse’s powerful SQL aggregation capabilities.
Basic Aggregations
Built-in Methods
| Method | SQL Equivalent | Description |
|---|
sum() | SUM() | Sum of values |
mean() | AVG() | Average/mean |
count() | COUNT() | Count non-null values |
min() | MIN() | Minimum value |
max() | MAX() | Maximum value |
median() | MEDIAN() | Median value |
std() | stddevPop() | Standard deviation |
var() | varPop() | Variance |
nunique() | COUNT(DISTINCT) | Count unique values |
Examples:
from pathlib import Path
Path("sales.csv").write_text("""\
region,product,category,amount,quantity,price,date,order_id
East,Widget,Electronics,5200,10,120,2024-01-15,1001
West,Gadget,Electronics,800,5,160,2024-02-20,1002
East,Gizmo,Home,6500,3,100,2024-03-10,1003
North,Widget,Electronics,4500,6,150,2024-06-18,1004
West,Gadget,Electronics,2000,8,250,2024-09-14,1005
""")
from chdb import datastore as pd
ds = pd.read_csv("sales.csv")
# Single column aggregation
total = ds['amount'].sum()
average = ds['amount'].mean()
count = ds['amount'].count()
# All aggregations
print(ds['amount'].sum()) # Total
print(ds['amount'].mean()) # Average
print(ds['amount'].std()) # Standard deviation
print(ds['amount'].median()) # Median
print(ds['amount'].nunique()) # Unique count
GroupBy Aggregations
Single Aggregation
# Group by and aggregate
result = ds.groupby('category')['amount'].sum()
result = ds.groupby('region')['sales'].mean()
Multiple Aggregations
# Dictionary syntax
result = ds.groupby('category').agg({
'amount': 'sum',
'quantity': 'mean',
'order_id': 'count'
})
# List of aggregations per column
result = ds.groupby('category').agg({
'amount': ['sum', 'mean', 'max'],
'quantity': ['sum', 'count']
})
Named Aggregations
# Named aggregation (pandas style)
result = ds.groupby('region').agg(
total_amount=('amount', 'sum'),
avg_quantity=('quantity', 'mean'),
order_count=('order_id', 'count'),
max_price=('price', 'max')
)
Multiple GroupBy Keys
# Group by multiple columns
result = ds.groupby(['region', 'category']).agg({
'amount': 'sum',
'quantity': 'sum'
})
Statistical Aggregations
| Method | SQL Equivalent | Description |
|---|
quantile(q) | quantile(q) | q-th quantile (0-1) |
skew() | skewPop() | Skewness |
kurt() | kurtPop() | Kurtosis |
corr() | corr() | Correlation |
cov() | covar() | Covariance |
sem() | - | Standard error of mean |
Examples:
# Quantiles
q50 = ds['amount'].quantile(0.5) # Median
q95 = ds['amount'].quantile(0.95) # 95th percentile
# Multiple quantiles
quantiles = ds['amount'].quantile([0.25, 0.5, 0.75])
# Correlation between columns
correlation = ds[['sales', 'marketing_spend']].corr()
Conditional Aggregations
ClickHouse-specific conditional aggregation functions.
| Function | ClickHouse | Description |
|---|
sum_if(cond) | sumIf() | Sum where condition |
count_if(cond) | countIf() | Count where condition |
avg_if(cond) | avgIf() | Average where condition |
min_if(cond) | minIf() | Min where condition |
max_if(cond) | maxIf() | Max where condition |
Examples:
from chdb.datastore import F, Field
# Sum only high value orders
high_value_sum = F.sum_if(Field('amount'), Field('amount') > 1000)
# Count active users
active_count = F.count_if(Field('status') == 'active')
# In groupby context
result = ds.groupby('region').agg({
'total': ('amount', 'sum'),
'high_value': ('amount', F.sum_if(Field('amount') > 1000)),
})
Collection Aggregations
ClickHouse-specific functions that collect values.
| Function | ClickHouse | Description |
|---|
group_array() | groupArray() | Collect into array |
group_uniq_array() | groupUniqArray() | Collect unique into array |
group_concat(sep) | groupConcat() | Concatenate strings |
top_k(n) | topK(n) | Top K frequent values |
any() | any() | Any value |
any_last() | anyLast() | Last value |
first_value() | first_value() | First value in order |
last_value() | last_value() | Last value in order |
Examples:
from chdb.datastore import F, Field
# Collect all tags per category
result = ds.groupby('category').agg({
'all_tags': ('tag', F.group_array()),
'unique_tags': ('tag', F.group_uniq_array())
})
# Get top 5 products per region
result = ds.groupby('region').agg({
'top_products': ('product_id', F.top_k(5))
})
Window Functions
Ranking Functions
| Function | SQL | Description |
|---|
row_number() | ROW_NUMBER() | Sequential row number |
rank() | RANK() | Rank with gaps |
dense_rank() | DENSE_RANK() | Rank without gaps |
ntile(n) | NTILE(n) | Divide into n buckets |
percent_rank() | PERCENT_RANK() | Percentile rank (0-1) |
cume_dist() | CUME_DIST() | Cumulative distribution |
Examples:
from chdb.datastore import F, Field
# Add row number
ds['row_num'] = F.row_number().over(order_by='date')
# Rank within groups
ds['rank'] = F.rank().over(
partition_by='category',
order_by='sales'
)
# Dense rank (no gaps)
ds['dense_rank'] = F.dense_rank().over(
partition_by='region',
order_by=('revenue', 'desc')
)
# Divide into quartiles
ds['quartile'] = F.ntile(4).over(order_by='score')
Value Functions
| Function | SQL | Description |
|---|
lag(n) | LAG(col, n) | Previous row value |
lead(n) | LEAD(col, n) | Next row value |
first_value() | FIRST_VALUE() | First value in window |
last_value() | LAST_VALUE() | Last value in window |
nth_value(n) | NTH_VALUE(col, n) | Nth value in window |
Examples:
# Previous and next value
ds['prev_price'] = F.lag('price', 1).over(order_by='date')
ds['next_price'] = F.lead('price', 1).over(order_by='date')
# First and last in partition
ds['first_order'] = F.first_value('amount').over(
partition_by='customer_id',
order_by='date'
)
Cumulative Functions
| Method | Description |
|---|
cumsum() | Cumulative sum |
cummax() | Cumulative maximum |
cummin() | Cumulative minimum |
cumprod() | Cumulative product |
diff(n) | Difference from n rows back |
pct_change(n) | Percent change from n rows back |
Examples:
# Cumulative calculations
ds['running_total'] = ds['amount'].cumsum()
ds['running_max'] = ds['amount'].cummax()
# With grouping
ds['group_cumsum'] = ds.groupby('category')['amount'].cumsum()
# Period over period
ds['daily_diff'] = ds['sales'].diff(1)
ds['pct_change'] = ds['sales'].pct_change(1)
Rolling Windows
# Rolling window aggregations
ds['rolling_avg'] = ds['price'].rolling(window=7).mean()
ds['rolling_sum'] = ds['amount'].rolling(window=30).sum()
ds['rolling_std'] = ds['value'].rolling(window=10).std()
# Expanding windows
ds['expanding_max'] = ds['price'].expanding().max()
ds['expanding_sum'] = ds['amount'].expanding().sum()
F Namespace
The F namespace provides access to ClickHouse functions.
Import
from chdb.datastore import F, Field
Using F Functions
# Aggregations
F.sum(Field('amount'))
F.avg(Field('price'))
F.count(Field('id'))
# Statistical
F.quantile(Field('value'), 0.95)
F.stddev_pop(Field('score'))
F.corr(Field('x'), Field('y'))
# Conditional
F.sum_if(Field('amount'), Field('status') == 'completed')
F.count_if(Field('is_active'))
# String
F.length(Field('name'))
F.upper(Field('text'))
# Date/Time
F.to_year(Field('date'))
F.date_diff('day', Field('start'), Field('end'))
# Array
F.array_sum(Field('values'))
F.array_avg(Field('scores'))
# Math
F.abs(Field('delta'))
F.round(Field('price'), 2)
F.floor(Field('value'))
F.ceil(Field('value'))
F with Window Functions
# Define window frame
window = F.window(
partition_by='category',
order_by='date',
rows_between=(-7, 0) # Current row and 7 preceding
)
ds['rolling_avg'] = F.avg(Field('price')).over(window)
Common Aggregation Patterns
Top N per Group
# Top 3 products per category by sales
result = (ds
.assign(rank=F.row_number().over(
partition_by='category',
order_by=('sales', 'desc')
))
.filter(ds['rank'] <= 3)
)
Running Total
# Running total of sales
ds['running_total'] = F.sum('amount').over(
order_by='date',
rows_between=(None, 0) # All rows up to current
)
Moving Average
# 7-day moving average
ds['ma_7'] = F.avg('price').over(
order_by='date',
rows_between=(-6, 0)
)
Year-over-Year Comparison
# YoY comparison
ds['prev_year_sales'] = F.lag('sales', 12).over(
partition_by='product_id',
order_by='month'
)
ds['yoy_growth'] = (ds['sales'] - ds['prev_year_sales']) / ds['prev_year_sales']
Percentile Ranking
# Rank customers by total spend
ds['spend_percentile'] = F.percent_rank().over(order_by='total_spend')
Aggregation Methods Summary
| Category | Methods |
|---|
| Basic | sum, mean, count, min, max, median |
| Statistical | std, var, quantile, skew, kurt, corr, cov |
| Conditional | sum_if, count_if, avg_if, min_if, max_if |
| Collection | group_array, group_uniq_array, group_concat, top_k |
| Ranking | row_number, rank, dense_rank, ntile, percent_rank |
| Value | lag, lead, first_value, last_value, nth_value |
| Cumulative | cumsum, cummax, cummin, cumprod, diff, pct_change |
| Rolling | rolling().mean/sum/std/..., expanding().mean/sum/... |