Spaces:

rcgalbo
/

wayydb-api

Sleeping

App Files Files Community

wayydb-api / README.md

rcgalbo

Add Hugging Face Spaces deployment

fcd08bd about 1 month ago

preview code

raw

history blame contribute delete

10.8 kB

	---
	title: WayyDB API
	emoji: ⚡
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	---

	<p align="center">
	<h1 align="center">WayyDB</h1>
	<p align="center">
	<strong>High-performance columnar time-series database for quantitative finance</strong>
	</p>
	<p align="center">
	kdb+ functionality • Pythonic API • Zero-copy NumPy • SIMD-accelerated
	</p>
	</p>

	---

	WayyDB is a C++ time-series database with Python bindings, designed for quantitative research and trading systems. It provides kdb+-like temporal join operations with a modern, accessible API—no q language required.

	## Why WayyDB?

	\| Challenge \| WayyDB Solution \|
	\|-----------\|-----------------\|
	\| kdb+ costs $100K+/year \| Open source, free forever \|
	\| q language learning curve \| Pythonic API you already know \|
	\| Pandas/Polars lack temporal joins \| Native `aj()` and `wj()` primitives \|
	\| Memory copies kill performance \| Zero-copy NumPy via mmap \|
	\| Slow aggregations \| AVX2/AVX-512 SIMD acceleration \|

	## Features

	- As-of Join (aj) — For each trade, find the most recent quote. O(n log m) via binary search on sorted indices
	- Window Join (wj) — Get all quotes within a time window around each trade
	- Zero-copy NumPy — Columns are memory-mapped; `to_numpy()` returns views, not copies
	- SIMD Aggregations — Sum, avg, min, max accelerated with AVX2 intrinsics
	- Window Functions — Moving average, EMA, rolling std with O(n) complexity
	- Persistent Storage — Tables saved as memory-mapped files for instant loading

	## Installation

	```bash
	pip install wayy-db
	```

	Or build from source:

	```bash
	git clone https://github.com/wayy-research/wayydb.git
	cd wayydb
	pip install -e .
	```

	## Quick Start

	### Create Tables from NumPy Arrays

	```python
	import wayy_db as wdb
	import numpy as np

	# Create trades table
	trades = wdb.from_dict({
	"timestamp": np.array([1000, 2000, 3000, 4000, 5000], dtype=np.int64),
	"symbol": np.array([0, 1, 0, 1, 0], dtype=np.uint32), # AAPL=0, MSFT=1
	"price": np.array([150.25, 380.50, 151.00, 381.25, 152.00]),
	"size": np.array([100, 200, 150, 250, 100], dtype=np.int64),
	}, name="trades", sorted_by="timestamp")

	# Create quotes table
	quotes = wdb.from_dict({
	"timestamp": np.array([500, 900, 1500, 2500, 3500], dtype=np.int64),
	"symbol": np.array([0, 1, 0, 1, 0], dtype=np.uint32),
	"bid": np.array([149.50, 379.50, 150.50, 380.50, 151.50]),
	"ask": np.array([150.00, 380.00, 151.00, 381.00, 152.00]),
	}, name="quotes", sorted_by="timestamp")
	```

	### As-of Join: Match Trades to Quotes

	```python
	# For each trade, get the most recent quote for that symbol
	result = wdb.ops.aj(trades, quotes, on=["symbol"], as_of="timestamp")

	# Result contains trade columns + quote columns (bid, ask)
	print(result["bid"].to_numpy()) # [149.5, 379.5, 150.5, 380.5, 151.5]
	```

	### Aggregations and Window Functions

	```python
	# SIMD-accelerated aggregations
	total_volume = wdb.ops.sum(trades["size"])
	avg_price = wdb.ops.avg(trades["price"])
	price_std = wdb.ops.std(trades["price"])

	# Window functions
	mavg_20 = wdb.ops.mavg(trades["price"], window=20)
	ema = wdb.ops.ema(trades["price"], alpha=0.1)
	rolling_std = wdb.ops.mstd(trades["price"], window=10)

	# Returns and changes
	returns = wdb.ops.pct_change(trades["price"])
	price_diff = wdb.ops.diff(trades["price"])
	```

	### Persistent Database

	```python
	# Create persistent database
	db = wdb.Database("/data/markets")

	# Add table (automatically saved)
	db.add_table(trades)

	# Later: reload with zero-copy mmap
	db2 = wdb.Database("/data/markets")
	trades = db2["trades"] # Instant load via memory mapping

	# Access data without copying
	prices = trades["price"].to_numpy() # Zero-copy view into mmap'd file
	```

	### Pandas/Polars Interop

	```python
	import pandas as pd
	import polars as pl

	# From pandas
	df = pd.DataFrame({"timestamp": [...], "price": [...]})
	table = wdb.from_pandas(df, name="from_pandas", sorted_by="timestamp")

	# From polars
	df = pl.DataFrame({"timestamp": [...], "price": [...]})
	table = wdb.from_polars(df, name="from_polars", sorted_by="timestamp")

	# To dict (for conversion back)
	data = table.to_dict() # {"timestamp": np.array, "price": np.array, ...}
	```

	## API Reference

	### Core Classes

	\| Class \| Description \|
	\|-------\|-------------\|
	\| `Database(path="")` \| Container for tables. Empty path = in-memory \|
	\| `Table(name="")` \| Columnar table with optional sorted index \|
	\| `Column` \| Typed column with zero-copy NumPy access \|

	### Table Methods

	```python
	table.num_rows # Number of rows
	table.num_columns # Number of columns
	table.column_names() # List of column names
	table.sorted_by # Column used for temporal ordering (or None)
	table["col"] # Get column by name
	table.to_dict() # Export as {name: np.array} dict
	table.save(path) # Save to directory
	Table.load(path) # Load from directory (copies data)
	Table.mmap(path) # Memory-map from directory (zero-copy)
	```

	### Operations (wayy_db.ops)

	#### Aggregations
	\| Function \| Description \|
	\|----------\|-------------\|
	\| `sum(col)` \| Sum of values (SIMD) \|
	\| `avg(col)` \| Mean of values \|
	\| `min(col)` \| Minimum value \|
	\| `max(col)` \| Maximum value \|
	\| `std(col)` \| Standard deviation \|

	#### Temporal Joins
	\| Function \| Description \|
	\|----------\|-------------\|
	\| `aj(left, right, on, as_of)` \| As-of join: most recent right row for each left row \|
	\| `wj(left, right, on, as_of, before, after)` \| Window join: all right rows within time window \|

	#### Window Functions
	\| Function \| Description \|
	\|----------\|-------------\|
	\| `mavg(col, window)` \| Moving average \|
	\| `msum(col, window)` \| Moving sum \|
	\| `mstd(col, window)` \| Moving standard deviation \|
	\| `mmin(col, window)` \| Moving minimum (O(n) via monotonic deque) \|
	\| `mmax(col, window)` \| Moving maximum (O(n) via monotonic deque) \|
	\| `ema(col, alpha)` \| Exponential moving average \|
	\| `diff(col, periods=1)` \| Difference from n periods ago \|
	\| `pct_change(col, periods=1)` \| Percent change from n periods ago \|
	\| `shift(col, n)` \| Shift values by n positions \|

	## Type System

	\| Type \| Python \| C++ \| Size \| Use Case \|
	\|------\|--------\|-----\|------\|----------\|
	\| Int64 \| `np.int64` \| `int64_t` \| 8B \| Quantities, IDs \|
	\| Float64 \| `np.float64` \| `double` \| 8B \| Prices, returns \|
	\| Timestamp \| `np.int64` \| `int64_t` \| 8B \| Nanoseconds since epoch \|
	\| Symbol \| `np.uint32` \| `uint32_t` \| 4B \| Interned strings (tickers) \|
	\| Bool \| `np.uint8` \| `uint8_t` \| 1B \| Flags \|

	## Architecture

	```
	┌─────────────────────────────────────────────────────────────┐
	│ Python Interface │
	│ wayy_db.Database \| Table \| Column \| ops │
	├─────────────────────────────────────────────────────────────┤
	│ pybind11 Bindings │
	│ Zero-copy NumPy arrays via buffer protocol │
	├─────────────────────────────────────────────────────────────┤
	│ C++ Core Engine │
	│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
	│ │ Storage │ │ Compute │ │ Joins │ │
	│ │ • mmap I/O │ │ • AVX2 agg │ │ • O(n log m) aj │ │
	│ │ • columnar │ │ • windows │ │ • O(n) wj │ │
	│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
	├─────────────────────────────────────────────────────────────┤
	│ Memory-Mapped File Storage │
	│ Zero-copy \| Lazy loading \| Shared │
	└─────────────────────────────────────────────────────────────┘
	```

	## Performance

	### Complexity

	\| Operation \| Complexity \| Notes \|
	\|-----------\|------------\|-------\|
	\| As-of join \| O(n log(m/k)) \| n=left rows, m=right rows, k=unique keys \|
	\| Window join \| O(n log m + matches) \| Plus output size \|
	\| Aggregations \| O(n) \| SIMD 4x speedup for sum \|
	\| Window functions \| O(n) \| Single pass with O(1) update \|
	\| Point lookup \| O(log n) \| Binary search on sorted index \|
	\| Load from disk \| O(1) \| Memory mapping, no deserialization \|

	### Design Targets

	\| Metric \| Target \|
	\|--------\|--------\|
	\| As-of join (1M x 1M rows) \| < 150ms \|
	\| Simple aggregation (1B rows) \| < 80ms \|
	\| Binary size \| < 5 MB \|
	\| Memory overhead \| < 1% beyond data \|

	## Building from Source

	### Requirements

	- CMake >= 3.20
	- C++20 compiler (GCC 11+, Clang 14+, MSVC 2022+)
	- Python >= 3.9

	### Build

	```bash
	git clone https://github.com/wayy-research/wayydb.git
	cd wayydb

	# Option 1: pip install (recommended)
	pip install -e .

	# Option 2: CMake directly
	mkdir build && cd build
	cmake .. -DWAYY_BUILD_PYTHON=ON -DWAYY_BUILD_TESTS=ON
	make -j$(nproc)
	```

	### Run Tests

	```bash
	# C++ tests (31 tests)
	cd build && ctest --output-on-failure

	# Python tests (17 tests)
	PYTHONPATH=python pytest tests/python -v
	```

	## Comparison with Alternatives

	\| Feature \| WayyDB \| kdb+ \| DuckDB \| Polars \|
	\|---------\|--------\|------\|--------\|--------\|
	\| As-of join \| Native \| Native \| Extension \| None \|
	\| Window join \| Native \| Native \| None \| None \|
	\| Zero-copy Python \| Yes \| No \| No \| Limited \|
	\| Sorted index optimization \| Yes \| Yes \| No \| No \|
	\| License \| MIT \| Commercial \| MIT \| MIT \|
	\| Learning curve \| Low \| High (q) \| Low \| Low \|
	\| Persistence \| mmap \| Native \| Native \| None \|

	## Roadmap

	- [ ] String column type with dictionary encoding
	- [ ] LZ4 compression for columns
	- [ ] Parallel aggregations
	- [ ] More join types (inner, left, full)
	- [ ] Query optimizer
	- [ ] Streaming ingestion API

	## License

	MIT License - see [LICENSE](LICENSE) for details.

	## Contributing

	Contributions welcome! Please read our contributing guidelines and submit PRs to the `develop` branch.

	---

	<p align="center">
	Built with C++20 and Python by <a href="https://wayy.io">Wayy Research</a>
	</p>