forexdatalake / docs /DATASET_CONTEXT.md
OMCHOKSI108's picture
code
aac542c

OMCHOKSI108/my-cloud-data-lake β€” Dataset Context & API Reference

Source: https://huggingface.co/datasets/OMCHOKSI108/my-cloud-data-lake
Format: Apache Parquet
Total Size: ~4.65 GB
Generated: February 2026


1. High-Level Summary

Metric Value
Total Parquet Files 793
Total Data Rows 276,427,113 (~276 M)
Unique Instruments 111
Timeframes 7 (1min, 5min, 15min, 30min, 1hr, 4hr, 1day)
Top-Level Folders 3 (ALL_TIME_DATA, data, reports)
Column Schema (OHLCV) ts, open, high, low, close, volume
Row Count Range 106 β€” 9,823,126 per file

2. Folder Structure

my-cloud-data-lake/                          (4.65 GB)
β”œβ”€β”€ ALL_TIME_DATA/                           756 files β”‚ 275,232,406 rows
β”‚   β”œβ”€β”€ 1min_time/                           108 files β”‚ 182,411,731 rows
β”‚   β”œβ”€β”€ 5min_time/                           108 files β”‚  51,658,834 rows
β”‚   β”œβ”€β”€ 15min_time/                          108 files β”‚  22,286,150 rows
β”‚   β”œβ”€β”€ 30min_time/                          108 files β”‚  11,251,532 rows
β”‚   β”œβ”€β”€ 1hr_time/                            108 files β”‚   5,739,404 rows
β”‚   β”œβ”€β”€ 4hr_time/                            108 files β”‚   1,526,167 rows
β”‚   └── 1day_time/                           108 files β”‚     358,588 rows
β”œβ”€β”€ data/                                     35 files β”‚   1,185,910 rows
β”‚   β”œβ”€β”€ 1min_time/                             5 files β”‚     500,093 rows
β”‚   β”œβ”€β”€ 5min_time/                             5 files β”‚     422,691 rows
β”‚   β”œβ”€β”€ 15min_time/                            5 files β”‚     144,273 rows
β”‚   β”œβ”€β”€ 30min_time/                            5 files β”‚      72,193 rows
β”‚   β”œβ”€β”€ 1hr_time/                              5 files β”‚      36,099 rows
β”‚   β”œβ”€β”€ 4hr_time/                              5 files β”‚       9,039 rows
β”‚   └── 1day_time/                             5 files β”‚       1,522 rows
β”œβ”€β”€ reports/                                   2 files β”‚       8,797 rows
β”‚   β”œβ”€β”€ portfolio_equity.parquet               1 file  β”‚       5,000 rows
β”‚   └── portfolio_trades.parquet               1 file  β”‚       3,797 rows
β”œβ”€β”€ .gitattributes
└── README.md

3. Data Schema

3.1 OHLCV Files (791 files β€” ALL_TIME_DATA + data)

Every OHLCV parquet file has 6 columns with a uniform schema:

Column Type Description
ts datetime/string Timestamp of the candle bar
open float Opening price
high float Highest price in the period
low float Lowest price in the period
close float Closing price
volume int/float Trade volume during the period

Note: Some files in data/ have 7 columns β€” the first column appears to be a malformed header row baked into the schema (data from a raw TSV conversion). The core data columns remain the same 6 OHLCV fields.

3.2 Reports Files

reports/portfolio_equity.parquet β€” 5,000 rows, 2 columns:

Column Description
step Simulation/backtest step index
equity Portfolio equity value at that step

reports/portfolio_trades.parquet β€” 3,797 rows, 8 columns:

Column Description
pair Trading pair/instrument
type Order type
side Buy or Sell
price Execution price
size Position size
time Trade timestamp
score Signal/strategy score
pnl Profit & Loss

4. Instruments Catalog (111 Unique)

4.1 Forex Pairs (28)

Major, minor, and cross pairs:

Pair Pair Pair Pair
AUDCAD# AUDCHF# AUDJPY# AUDNZD#
AUDUSD# CADCHF# CADJPY# CHFJPY#
EURAUD# EURCAD# EURCHF# EURGBP#
EURJPY# EURNZD# EURUSD# GBPAUD#
GBPCAD# GBPCHF# GBPJPY# GBPNZD#
GBPUSD# NZDCAD# NZDCHF# NZDJPY#
NZDUSD# USDCAD# USDCHF# USDJPY#

Additional forex in data/ folder without # suffix: EURUSD, GBPUSD, USDJPY

4.2 Cryptocurrencies (52)

Symbol Symbol Symbol Symbol
1INCHUSD# AAVEUSD# ADAUSD# ALGOUSD#
APEUSD# APTUSD# ARBUSD# ATOMUSD#
AVAXUSD# AXSUSD# BATUSD# BCHUSD#
BTCEUR# BTCGBP# BTCJPY# BTCUSD#
BTGUSD# CHZUSD# COMPUSD# CRVUSD#
DASHUSD# DOGEUSD# DOTUSD# EGLDUSD#
ENJUSD# ETCUSD# ETHBTC# ETHEUR#
ETHGBP# ETHUSD# FILUSD# FLOWUSD#
GRTUSD# ICPUSD# IMXUSD# LDOUSD#
LINKUSD# LRCUSD# LTCUSD# MANAUSD#
MATICUSD# NEARUSD# OPUSD# SANDUSD#
SHIBUSD# SNXUSD# SOLUSD# STORJUSD#
STXUSD# SUSHIUSD# UMAUSD# UNIUSD#
VAULTAUSD# XLMUSD# XRPUSD# XTZUSD#
ZECUSD# ZRXUSD#

4.3 Commodities / Precious Metals (8)

Symbol Description
GOLD.i# Gold (USD)
SILVER.i# Silver (USD)
XAUCNH.i# Gold / Chinese Yuan
XAUEUR.i# Gold / Euro
XAUJPY.i# Gold / Japanese Yen
GAUCNH.i# Gold (alternate CNH)
GAUUSD.i# Gold (alternate USD)
XPDUSD.i# Palladium / USD
XPTUSD.i# Platinum / USD

4.4 Stocks / Equities (12)

Symbol Company
Amazon Amazon.com Inc.
BancoBradesco Banco Bradesco S.A.
DraftKings DraftKings Inc.
Ford Ford Motor Company
Gerdau Gerdau S.A.
Intel Intel Corporation
Nu Holdings Nu Holdings Ltd.
Nvidia NVIDIA Corporation
Pinterest Pinterest Inc.
PlugPower Plug Power Inc.
Rivian Rivian Automotive Inc.
Tesla Tesla Inc.
Transocean Transocean Ltd.

5. Timeframes

Timeframe Label in Path Files (ALL_TIME_DATA) Rows (ALL_TIME_DATA) Typical Row Count per Instrument
1 Minute 1min_time 108 182,411,731 42K β€” 9.8M
5 Minutes 5min_time 108 51,658,834 8K β€” 2.0M
15 Minutes 15min_time 108 22,286,150 9K β€” 679K
30 Minutes 30min_time 108 11,251,532 4K β€” 343K
1 Hour 1hr_time 108 5,739,404 2K β€” 175K
4 Hours 4hr_time 108 1,526,167 628 β€” 49K
1 Day 1day_time 108 358,588 106 β€” 14K

6. Data Partitioning: ALL_TIME_DATA vs data

Property ALL_TIME_DATA/ data/
Purpose Full historical archive Recent/sampled subset
File Count 756 35
Row Count 275,232,406 1,185,910
Instruments 108 (all) 5 (BTCUSD#, ETHUSD#, EURUSD, GBPUSD, USDJPY)
Timeframes All 7 All 7
Schema Notes Clean 6-col OHLCV Some files have 7 cols (legacy header artifact)

7. File Naming Convention

{InstrumentSymbol}_{Timeframe}.parquet

Examples:

  • BTCUSD#_1min.parquet β†’ Bitcoin/USD, 1-minute bars
  • EURUSD#_1day.parquet β†’ EUR/USD, daily bars
  • Tesla_15min.parquet β†’ Tesla stock, 15-minute bars
  • GOLD.i#_4hr.parquet β†’ Gold, 4-hour bars

Symbol suffix meanings:

  • # β†’ CFD/derivative instrument
  • .i# β†’ Index/commodity CFD
  • No suffix β†’ Spot or direct instrument

8. API Design Reference

8.1 Recommended API Endpoints

GET /api/v1/instruments
    β†’ List all 111 available instruments with metadata

GET /api/v1/instruments/{symbol}
    β†’ Instrument details (asset class, available timeframes, row counts)

GET /api/v1/ohlcv/{symbol}
    ?timeframe=1min|5min|15min|30min|1hr|4hr|1day
    &start=2024-01-01T00:00:00Z
    &end=2025-12-31T23:59:59Z
    &limit=1000
    &offset=0
    β†’ OHLCV candle data with pagination

GET /api/v1/reports/equity
    β†’ Portfolio equity curve (5,000 steps)

GET /api/v1/reports/trades
    ?pair=BTCUSD
    &side=buy|sell
    β†’ Portfolio trade history (3,797 trades)

GET /api/v1/metadata
    β†’ Dataset-level metadata (total files, rows, timeframes, etc.)

GET /api/v1/search
    ?q=BTC&asset_class=crypto
    β†’ Search instruments by name/class

8.2 Data Access Pattern (HuggingFace)

# Direct parquet read from HuggingFace
from huggingface_hub import hf_hub_url
import pandas as pd

url = hf_hub_url(
    repo_id="OMCHOKSI108/my-cloud-data-lake",
    filename="ALL_TIME_DATA/1hr_time/BTCUSD#_1hr.parquet",
    repo_type="dataset"
)
df = pd.read_parquet(url)

8.3 Query Parameters for API

Parameter Type Description
symbol string Instrument symbol (e.g., BTCUSD#, Tesla)
timeframe enum 1min, 5min, 15min, 30min, 1hr, 4hr, 1day
start ISO datetime Start of date range filter
end ISO datetime End of date range filter
limit int Max rows returned (default: 1000, max: 10000)
offset int Pagination offset
source enum all_time or recent (maps to folder)
format enum json, csv, parquet

8.4 Response Schema

{
  "symbol": "BTCUSD#",
  "timeframe": "1hr",
  "total_rows": 66829,
  "returned": 1000,
  "data": [
    {
      "ts": "2024-01-01T00:00:00Z",
      "open": 42150.50,
      "high": 42280.00,
      "low": 42100.00,
      "close": 42230.75,
      "volume": 1234
    }
  ]
}

9. Asset Classification Map

Use this mapping to categorize instruments in the API:

{
  "forex_major": ["EURUSD#", "GBPUSD#", "USDJPY#", "USDCHF#", "AUDUSD#", "NZDUSD#", "USDCAD#"],
  "forex_cross": ["EURJPY#", "GBPJPY#", "EURGBP#", "AUDCAD#", "AUDCHF#", "AUDJPY#", "AUDNZD#", "CADCHF#", "CADJPY#", "CHFJPY#", "EURAUD#", "EURCAD#", "EURCHF#", "EURNZD#", "GBPAUD#", "GBPCAD#", "GBPCHF#", "GBPNZD#", "NZDCAD#", "NZDCHF#", "NZDJPY#"],
  "crypto_major": ["BTCUSD#", "ETHUSD#", "LTCUSD#", "XRPUSD#", "BCHUSD#"],
  "crypto_alt": ["ADAUSD#", "SOLUSD#", "DOTUSD#", "LINKUSD#", "AVAXUSD#", "DOGEUSD#", "SHIBUSD#", "MATICUSD#", "UNIUSD#", "AAVEUSD#", "...and 40+ more"],
  "crypto_cross": ["BTCEUR#", "BTCGBP#", "BTCJPY#", "ETHBTC#", "ETHEUR#", "ETHGBP#"],
  "commodities": ["GOLD.i#", "SILVER.i#", "XPDUSD.i#", "XPTUSD.i#", "XAUEUR.i#", "XAUJPY.i#", "XAUCNH.i#", "GAUCNH.i#", "GAUUSD.i#"],
  "stocks": ["Amazon", "Tesla", "Nvidia", "Intel", "Ford", "Rivian", "Pinterest", "PlugPower", "DraftKings", "Nu Holdings", "Gerdau", "BancoBradesco", "Transocean"]
}

10. Data Volume by Asset Class (ALL_TIME_DATA)

Asset Class Instruments Files Est. Rows
Forex ~28 196 ~80M+
Crypto ~52 364 ~150M+
Commodities ~9 63 ~20M+
Stocks ~13 91 ~25M+
Reports 2 2 8,797

11. Key Observations & Notes

  1. Uniform OHLCV schema across all market data files β€” no schema conflicts between asset classes
  2. Highest granularity data is in 1-minute bars, accounting for 66% of all rows (182M rows)
  3. Longest history instruments: Major forex pairs (EURUSD, GBPUSD, USDJPY, USDCHF) have up to 14K daily bars (55+ years of data) and 9.8M 1-minute bars
  4. Shortest history instruments: GAUCNH.i#, GAUUSD.i#, XAUCNH.i#, XAUJPY.i# have only ~106 daily bars
  5. data/ folder contains a focused subset of 5 key instruments (BTCUSD#, ETHUSD#, EURUSD, GBPUSD, USDJPY) β€” likely used for development/testing
  6. reports/ folder contains backtesting/simulation results β€” equity curve and trade log
  7. Data format is Apache Parquet β€” columnar, compressed, ideal for analytical queries
  8. No bid/ask spread data β€” only mid-price OHLCV
  9. No fundamental data β€” purely technical/price data

12. Potential Use Cases

  • Trading Strategy Backtesting β€” multi-asset, multi-timeframe
  • ML/AI Price Prediction Models β€” 276M row training dataset
  • Technical Analysis API β€” serve OHLCV with on-the-fly indicator calculation
  • Cross-Asset Correlation Analysis β€” forex, crypto, commodities, stocks in one lake
  • Portfolio Simulation β€” reports data already includes equity curves and trade logs
  • Real-Time Dashboard β€” serve historical + stream live data via WebSocket
  • Market Data Microservice β€” HuggingFace as cold storage, API serves hot queries via Redis/DuckDB cache