Spaces:

rcgalbo
/

wayydb-api

Sleeping

App Files Files Community

wayydb-api / README.md

rcgalbo

Add Hugging Face Spaces deployment

fcd08bd 30 days ago

preview code

raw

history blame contribute delete

10.8 kB

metadata

title: WayyDB API
emoji: ⚡
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860

WayyDB

High-performance columnar time-series database for quantitative finance

kdb+ functionality • Pythonic API • Zero-copy NumPy • SIMD-accelerated

WayyDB is a C++ time-series database with Python bindings, designed for quantitative research and trading systems. It provides kdb+-like temporal join operations with a modern, accessible API—no q language required.

Why WayyDB?

Challenge	WayyDB Solution
kdb+ costs $100K+/year	Open source, free forever
q language learning curve	Pythonic API you already know
Pandas/Polars lack temporal joins	Native `aj()` and `wj()` primitives
Memory copies kill performance	Zero-copy NumPy via mmap
Slow aggregations	AVX2/AVX-512 SIMD acceleration

Features

As-of Join (aj) — For each trade, find the most recent quote. O(n log m) via binary search on sorted indices
Window Join (wj) — Get all quotes within a time window around each trade
Zero-copy NumPy — Columns are memory-mapped; to_numpy() returns views, not copies
SIMD Aggregations — Sum, avg, min, max accelerated with AVX2 intrinsics
Window Functions — Moving average, EMA, rolling std with O(n) complexity
Persistent Storage — Tables saved as memory-mapped files for instant loading

Installation

pip install wayy-db

Or build from source:

git clone https://github.com/wayy-research/wayydb.git
cd wayydb
pip install -e .

Quick Start

Create Tables from NumPy Arrays

import wayy_db as wdb
import numpy as np

# Create trades table
trades = wdb.from_dict({
    "timestamp": np.array([1000, 2000, 3000, 4000, 5000], dtype=np.int64),
    "symbol": np.array([0, 1, 0, 1, 0], dtype=np.uint32),  # AAPL=0, MSFT=1
    "price": np.array([150.25, 380.50, 151.00, 381.25, 152.00]),
    "size": np.array([100, 200, 150, 250, 100], dtype=np.int64),
}, name="trades", sorted_by="timestamp")

# Create quotes table
quotes = wdb.from_dict({
    "timestamp": np.array([500, 900, 1500, 2500, 3500], dtype=np.int64),
    "symbol": np.array([0, 1, 0, 1, 0], dtype=np.uint32),
    "bid": np.array([149.50, 379.50, 150.50, 380.50, 151.50]),
    "ask": np.array([150.00, 380.00, 151.00, 381.00, 152.00]),
}, name="quotes", sorted_by="timestamp")

As-of Join: Match Trades to Quotes

# For each trade, get the most recent quote for that symbol
result = wdb.ops.aj(trades, quotes, on=["symbol"], as_of="timestamp")

# Result contains trade columns + quote columns (bid, ask)
print(result["bid"].to_numpy())  # [149.5, 379.5, 150.5, 380.5, 151.5]

Aggregations and Window Functions

# SIMD-accelerated aggregations
total_volume = wdb.ops.sum(trades["size"])
avg_price = wdb.ops.avg(trades["price"])
price_std = wdb.ops.std(trades["price"])

# Window functions
mavg_20 = wdb.ops.mavg(trades["price"], window=20)
ema = wdb.ops.ema(trades["price"], alpha=0.1)
rolling_std = wdb.ops.mstd(trades["price"], window=10)

# Returns and changes
returns = wdb.ops.pct_change(trades["price"])
price_diff = wdb.ops.diff(trades["price"])

Persistent Database

# Create persistent database
db = wdb.Database("/data/markets")

# Add table (automatically saved)
db.add_table(trades)

# Later: reload with zero-copy mmap
db2 = wdb.Database("/data/markets")
trades = db2["trades"]  # Instant load via memory mapping

# Access data without copying
prices = trades["price"].to_numpy()  # Zero-copy view into mmap'd file

Pandas/Polars Interop

import pandas as pd
import polars as pl

# From pandas
df = pd.DataFrame({"timestamp": [...], "price": [...]})
table = wdb.from_pandas(df, name="from_pandas", sorted_by="timestamp")

# From polars
df = pl.DataFrame({"timestamp": [...], "price": [...]})
table = wdb.from_polars(df, name="from_polars", sorted_by="timestamp")

# To dict (for conversion back)
data = table.to_dict()  # {"timestamp": np.array, "price": np.array, ...}

API Reference

Core Classes

Class	Description
`Database(path="")`	Container for tables. Empty path = in-memory
`Table(name="")`	Columnar table with optional sorted index
`Column`	Typed column with zero-copy NumPy access

Table Methods

table.num_rows          # Number of rows
table.num_columns       # Number of columns
table.column_names()    # List of column names
table.sorted_by         # Column used for temporal ordering (or None)
table["col"]            # Get column by name
table.to_dict()         # Export as {name: np.array} dict
table.save(path)        # Save to directory
Table.load(path)        # Load from directory (copies data)
Table.mmap(path)        # Memory-map from directory (zero-copy)

Operations (wayy_db.ops)

Aggregations

Function	Description
`sum(col)`	Sum of values (SIMD)
`avg(col)`	Mean of values
`min(col)`	Minimum value
`max(col)`	Maximum value
`std(col)`	Standard deviation

Temporal Joins

Function	Description
`aj(left, right, on, as_of)`	As-of join: most recent right row for each left row
`wj(left, right, on, as_of, before, after)`	Window join: all right rows within time window

Window Functions

Function	Description
`mavg(col, window)`	Moving average
`msum(col, window)`	Moving sum
`mstd(col, window)`	Moving standard deviation
`mmin(col, window)`	Moving minimum (O(n) via monotonic deque)
`mmax(col, window)`	Moving maximum (O(n) via monotonic deque)
`ema(col, alpha)`	Exponential moving average
`diff(col, periods=1)`	Difference from n periods ago
`pct_change(col, periods=1)`	Percent change from n periods ago
`shift(col, n)`	Shift values by n positions

Type System

Type	Python	C++	Size	Use Case
Int64	`np.int64`	`int64_t`	8B	Quantities, IDs
Float64	`np.float64`	`double`	8B	Prices, returns
Timestamp	`np.int64`	`int64_t`	8B	Nanoseconds since epoch
Symbol	`np.uint32`	`uint32_t`	4B	Interned strings (tickers)
Bool	`np.uint8`	`uint8_t`	1B	Flags

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Python Interface                        │
│         wayy_db.Database | Table | Column | ops              │
├─────────────────────────────────────────────────────────────┤
│                    pybind11 Bindings                         │
│         Zero-copy NumPy arrays via buffer protocol           │
├─────────────────────────────────────────────────────────────┤
│                       C++ Core Engine                        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   Storage   │  │   Compute   │  │      Joins          │  │
│  │  • mmap I/O │  │  • AVX2 agg │  │  • O(n log m) aj    │  │
│  │  • columnar │  │  • windows  │  │  • O(n) wj          │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
├─────────────────────────────────────────────────────────────┤
│                  Memory-Mapped File Storage                  │
│              Zero-copy | Lazy loading | Shared               │
└─────────────────────────────────────────────────────────────┘

Performance

Complexity

Operation	Complexity	Notes
As-of join	O(n log(m/k))	n=left rows, m=right rows, k=unique keys
Window join	O(n log m + matches)	Plus output size
Aggregations	O(n)	SIMD 4x speedup for sum
Window functions	O(n)	Single pass with O(1) update
Point lookup	O(log n)	Binary search on sorted index
Load from disk	O(1)	Memory mapping, no deserialization

Design Targets

Metric	Target
As-of join (1M x 1M rows)	< 150ms
Simple aggregation (1B rows)	< 80ms
Binary size	< 5 MB
Memory overhead	< 1% beyond data

Building from Source

Requirements

CMake >= 3.20
C++20 compiler (GCC 11+, Clang 14+, MSVC 2022+)
Python >= 3.9

Build

git clone https://github.com/wayy-research/wayydb.git
cd wayydb

# Option 1: pip install (recommended)
pip install -e .

# Option 2: CMake directly
mkdir build && cd build
cmake .. -DWAYY_BUILD_PYTHON=ON -DWAYY_BUILD_TESTS=ON
make -j$(nproc)

Run Tests

# C++ tests (31 tests)
cd build && ctest --output-on-failure

# Python tests (17 tests)
PYTHONPATH=python pytest tests/python -v

Comparison with Alternatives

Feature	WayyDB	kdb+	DuckDB	Polars
As-of join	Native	Native	Extension	None
Window join	Native	Native	None	None
Zero-copy Python	Yes	No	No	Limited
Sorted index optimization	Yes	Yes	No	No
License	MIT	Commercial	MIT	MIT
Learning curve	Low	High (q)	Low	Low
Persistence	mmap	Native	Native	None

Roadmap

String column type with dictionary encoding
LZ4 compression for columns
Parallel aggregations
More join types (inner, left, full)
Query optimizer
Streaming ingestion API

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please read our contributing guidelines and submit PRs to the develop branch.

Built with C++20 and Python by Wayy Research