wayydb-api / README.md
rcgalbo's picture
Add Hugging Face Spaces deployment
fcd08bd
---
title: WayyDB API
emoji:
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
---
<p align="center">
<h1 align="center">WayyDB</h1>
<p align="center">
<strong>High-performance columnar time-series database for quantitative finance</strong>
</p>
<p align="center">
kdb+ functionality &bull; Pythonic API &bull; Zero-copy NumPy &bull; SIMD-accelerated
</p>
</p>
---
WayyDB is a C++ time-series database with Python bindings, designed for quantitative research and trading systems. It provides **kdb+-like temporal join operations** with a modern, accessible API—no q language required.
## Why WayyDB?
| Challenge | WayyDB Solution |
|-----------|-----------------|
| kdb+ costs $100K+/year | Open source, free forever |
| q language learning curve | Pythonic API you already know |
| Pandas/Polars lack temporal joins | Native `aj()` and `wj()` primitives |
| Memory copies kill performance | Zero-copy NumPy via mmap |
| Slow aggregations | AVX2/AVX-512 SIMD acceleration |
## Features
- **As-of Join (aj)** — For each trade, find the most recent quote. O(n log m) via binary search on sorted indices
- **Window Join (wj)** — Get all quotes within a time window around each trade
- **Zero-copy NumPy** — Columns are memory-mapped; `to_numpy()` returns views, not copies
- **SIMD Aggregations** — Sum, avg, min, max accelerated with AVX2 intrinsics
- **Window Functions** — Moving average, EMA, rolling std with O(n) complexity
- **Persistent Storage** — Tables saved as memory-mapped files for instant loading
## Installation
```bash
pip install wayy-db
```
Or build from source:
```bash
git clone https://github.com/wayy-research/wayydb.git
cd wayydb
pip install -e .
```
## Quick Start
### Create Tables from NumPy Arrays
```python
import wayy_db as wdb
import numpy as np
# Create trades table
trades = wdb.from_dict({
"timestamp": np.array([1000, 2000, 3000, 4000, 5000], dtype=np.int64),
"symbol": np.array([0, 1, 0, 1, 0], dtype=np.uint32), # AAPL=0, MSFT=1
"price": np.array([150.25, 380.50, 151.00, 381.25, 152.00]),
"size": np.array([100, 200, 150, 250, 100], dtype=np.int64),
}, name="trades", sorted_by="timestamp")
# Create quotes table
quotes = wdb.from_dict({
"timestamp": np.array([500, 900, 1500, 2500, 3500], dtype=np.int64),
"symbol": np.array([0, 1, 0, 1, 0], dtype=np.uint32),
"bid": np.array([149.50, 379.50, 150.50, 380.50, 151.50]),
"ask": np.array([150.00, 380.00, 151.00, 381.00, 152.00]),
}, name="quotes", sorted_by="timestamp")
```
### As-of Join: Match Trades to Quotes
```python
# For each trade, get the most recent quote for that symbol
result = wdb.ops.aj(trades, quotes, on=["symbol"], as_of="timestamp")
# Result contains trade columns + quote columns (bid, ask)
print(result["bid"].to_numpy()) # [149.5, 379.5, 150.5, 380.5, 151.5]
```
### Aggregations and Window Functions
```python
# SIMD-accelerated aggregations
total_volume = wdb.ops.sum(trades["size"])
avg_price = wdb.ops.avg(trades["price"])
price_std = wdb.ops.std(trades["price"])
# Window functions
mavg_20 = wdb.ops.mavg(trades["price"], window=20)
ema = wdb.ops.ema(trades["price"], alpha=0.1)
rolling_std = wdb.ops.mstd(trades["price"], window=10)
# Returns and changes
returns = wdb.ops.pct_change(trades["price"])
price_diff = wdb.ops.diff(trades["price"])
```
### Persistent Database
```python
# Create persistent database
db = wdb.Database("/data/markets")
# Add table (automatically saved)
db.add_table(trades)
# Later: reload with zero-copy mmap
db2 = wdb.Database("/data/markets")
trades = db2["trades"] # Instant load via memory mapping
# Access data without copying
prices = trades["price"].to_numpy() # Zero-copy view into mmap'd file
```
### Pandas/Polars Interop
```python
import pandas as pd
import polars as pl
# From pandas
df = pd.DataFrame({"timestamp": [...], "price": [...]})
table = wdb.from_pandas(df, name="from_pandas", sorted_by="timestamp")
# From polars
df = pl.DataFrame({"timestamp": [...], "price": [...]})
table = wdb.from_polars(df, name="from_polars", sorted_by="timestamp")
# To dict (for conversion back)
data = table.to_dict() # {"timestamp": np.array, "price": np.array, ...}
```
## API Reference
### Core Classes
| Class | Description |
|-------|-------------|
| `Database(path="")` | Container for tables. Empty path = in-memory |
| `Table(name="")` | Columnar table with optional sorted index |
| `Column` | Typed column with zero-copy NumPy access |
### Table Methods
```python
table.num_rows # Number of rows
table.num_columns # Number of columns
table.column_names() # List of column names
table.sorted_by # Column used for temporal ordering (or None)
table["col"] # Get column by name
table.to_dict() # Export as {name: np.array} dict
table.save(path) # Save to directory
Table.load(path) # Load from directory (copies data)
Table.mmap(path) # Memory-map from directory (zero-copy)
```
### Operations (wayy_db.ops)
#### Aggregations
| Function | Description |
|----------|-------------|
| `sum(col)` | Sum of values (SIMD) |
| `avg(col)` | Mean of values |
| `min(col)` | Minimum value |
| `max(col)` | Maximum value |
| `std(col)` | Standard deviation |
#### Temporal Joins
| Function | Description |
|----------|-------------|
| `aj(left, right, on, as_of)` | As-of join: most recent right row for each left row |
| `wj(left, right, on, as_of, before, after)` | Window join: all right rows within time window |
#### Window Functions
| Function | Description |
|----------|-------------|
| `mavg(col, window)` | Moving average |
| `msum(col, window)` | Moving sum |
| `mstd(col, window)` | Moving standard deviation |
| `mmin(col, window)` | Moving minimum (O(n) via monotonic deque) |
| `mmax(col, window)` | Moving maximum (O(n) via monotonic deque) |
| `ema(col, alpha)` | Exponential moving average |
| `diff(col, periods=1)` | Difference from n periods ago |
| `pct_change(col, periods=1)` | Percent change from n periods ago |
| `shift(col, n)` | Shift values by n positions |
## Type System
| Type | Python | C++ | Size | Use Case |
|------|--------|-----|------|----------|
| Int64 | `np.int64` | `int64_t` | 8B | Quantities, IDs |
| Float64 | `np.float64` | `double` | 8B | Prices, returns |
| Timestamp | `np.int64` | `int64_t` | 8B | Nanoseconds since epoch |
| Symbol | `np.uint32` | `uint32_t` | 4B | Interned strings (tickers) |
| Bool | `np.uint8` | `uint8_t` | 1B | Flags |
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Python Interface │
│ wayy_db.Database | Table | Column | ops │
├─────────────────────────────────────────────────────────────┤
│ pybind11 Bindings │
│ Zero-copy NumPy arrays via buffer protocol │
├─────────────────────────────────────────────────────────────┤
│ C++ Core Engine │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Storage │ │ Compute │ │ Joins │ │
│ │ • mmap I/O │ │ • AVX2 agg │ │ • O(n log m) aj │ │
│ │ • columnar │ │ • windows │ │ • O(n) wj │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Memory-Mapped File Storage │
│ Zero-copy | Lazy loading | Shared │
└─────────────────────────────────────────────────────────────┘
```
## Performance
### Complexity
| Operation | Complexity | Notes |
|-----------|------------|-------|
| As-of join | O(n log(m/k)) | n=left rows, m=right rows, k=unique keys |
| Window join | O(n log m + matches) | Plus output size |
| Aggregations | O(n) | SIMD 4x speedup for sum |
| Window functions | O(n) | Single pass with O(1) update |
| Point lookup | O(log n) | Binary search on sorted index |
| Load from disk | O(1) | Memory mapping, no deserialization |
### Design Targets
| Metric | Target |
|--------|--------|
| As-of join (1M x 1M rows) | < 150ms |
| Simple aggregation (1B rows) | < 80ms |
| Binary size | < 5 MB |
| Memory overhead | < 1% beyond data |
## Building from Source
### Requirements
- CMake >= 3.20
- C++20 compiler (GCC 11+, Clang 14+, MSVC 2022+)
- Python >= 3.9
### Build
```bash
git clone https://github.com/wayy-research/wayydb.git
cd wayydb
# Option 1: pip install (recommended)
pip install -e .
# Option 2: CMake directly
mkdir build && cd build
cmake .. -DWAYY_BUILD_PYTHON=ON -DWAYY_BUILD_TESTS=ON
make -j$(nproc)
```
### Run Tests
```bash
# C++ tests (31 tests)
cd build && ctest --output-on-failure
# Python tests (17 tests)
PYTHONPATH=python pytest tests/python -v
```
## Comparison with Alternatives
| Feature | WayyDB | kdb+ | DuckDB | Polars |
|---------|--------|------|--------|--------|
| As-of join | Native | Native | Extension | None |
| Window join | Native | Native | None | None |
| Zero-copy Python | Yes | No | No | Limited |
| Sorted index optimization | Yes | Yes | No | No |
| License | MIT | Commercial | MIT | MIT |
| Learning curve | Low | High (q) | Low | Low |
| Persistence | mmap | Native | Native | None |
## Roadmap
- [ ] String column type with dictionary encoding
- [ ] LZ4 compression for columns
- [ ] Parallel aggregations
- [ ] More join types (inner, left, full)
- [ ] Query optimizer
- [ ] Streaming ingestion API
## License
MIT License - see [LICENSE](LICENSE) for details.
## Contributing
Contributions welcome! Please read our contributing guidelines and submit PRs to the `develop` branch.
---
<p align="center">
Built with C++20 and Python by <a href="https://wayy.io">Wayy Research</a>
</p>