OMCHOKSI108/my-cloud-data-lake β Dataset Context & API Reference
Source: https://huggingface.co/datasets/OMCHOKSI108/my-cloud-data-lake
Format: Apache Parquet
Total Size: ~4.65 GB
Generated: February 2026
1. High-Level Summary
| Metric |
Value |
| Total Parquet Files |
793 |
| Total Data Rows |
276,427,113 (~276 M) |
| Unique Instruments |
111 |
| Timeframes |
7 (1min, 5min, 15min, 30min, 1hr, 4hr, 1day) |
| Top-Level Folders |
3 (ALL_TIME_DATA, data, reports) |
| Column Schema (OHLCV) |
ts, open, high, low, close, volume |
| Row Count Range |
106 β 9,823,126 per file |
2. Folder Structure
my-cloud-data-lake/ (4.65 GB)
βββ ALL_TIME_DATA/ 756 files β 275,232,406 rows
β βββ 1min_time/ 108 files β 182,411,731 rows
β βββ 5min_time/ 108 files β 51,658,834 rows
β βββ 15min_time/ 108 files β 22,286,150 rows
β βββ 30min_time/ 108 files β 11,251,532 rows
β βββ 1hr_time/ 108 files β 5,739,404 rows
β βββ 4hr_time/ 108 files β 1,526,167 rows
β βββ 1day_time/ 108 files β 358,588 rows
βββ data/ 35 files β 1,185,910 rows
β βββ 1min_time/ 5 files β 500,093 rows
β βββ 5min_time/ 5 files β 422,691 rows
β βββ 15min_time/ 5 files β 144,273 rows
β βββ 30min_time/ 5 files β 72,193 rows
β βββ 1hr_time/ 5 files β 36,099 rows
β βββ 4hr_time/ 5 files β 9,039 rows
β βββ 1day_time/ 5 files β 1,522 rows
βββ reports/ 2 files β 8,797 rows
β βββ portfolio_equity.parquet 1 file β 5,000 rows
β βββ portfolio_trades.parquet 1 file β 3,797 rows
βββ .gitattributes
βββ README.md
3. Data Schema
3.1 OHLCV Files (791 files β ALL_TIME_DATA + data)
Every OHLCV parquet file has 6 columns with a uniform schema:
| Column |
Type |
Description |
ts |
datetime/string |
Timestamp of the candle bar |
open |
float |
Opening price |
high |
float |
Highest price in the period |
low |
float |
Lowest price in the period |
close |
float |
Closing price |
volume |
int/float |
Trade volume during the period |
Note: Some files in data/ have 7 columns β the first column appears to be a malformed header row baked into the schema (data from a raw TSV conversion). The core data columns remain the same 6 OHLCV fields.
3.2 Reports Files
reports/portfolio_equity.parquet β 5,000 rows, 2 columns:
| Column |
Description |
step |
Simulation/backtest step index |
equity |
Portfolio equity value at that step |
reports/portfolio_trades.parquet β 3,797 rows, 8 columns:
| Column |
Description |
pair |
Trading pair/instrument |
type |
Order type |
side |
Buy or Sell |
price |
Execution price |
size |
Position size |
time |
Trade timestamp |
score |
Signal/strategy score |
pnl |
Profit & Loss |
4. Instruments Catalog (111 Unique)
4.1 Forex Pairs (28)
Major, minor, and cross pairs:
| Pair |
Pair |
Pair |
Pair |
| AUDCAD# |
AUDCHF# |
AUDJPY# |
AUDNZD# |
| AUDUSD# |
CADCHF# |
CADJPY# |
CHFJPY# |
| EURAUD# |
EURCAD# |
EURCHF# |
EURGBP# |
| EURJPY# |
EURNZD# |
EURUSD# |
GBPAUD# |
| GBPCAD# |
GBPCHF# |
GBPJPY# |
GBPNZD# |
| GBPUSD# |
NZDCAD# |
NZDCHF# |
NZDJPY# |
| NZDUSD# |
USDCAD# |
USDCHF# |
USDJPY# |
Additional forex in data/ folder without # suffix: EURUSD, GBPUSD, USDJPY
4.2 Cryptocurrencies (52)
| Symbol |
Symbol |
Symbol |
Symbol |
| 1INCHUSD# |
AAVEUSD# |
ADAUSD# |
ALGOUSD# |
| APEUSD# |
APTUSD# |
ARBUSD# |
ATOMUSD# |
| AVAXUSD# |
AXSUSD# |
BATUSD# |
BCHUSD# |
| BTCEUR# |
BTCGBP# |
BTCJPY# |
BTCUSD# |
| BTGUSD# |
CHZUSD# |
COMPUSD# |
CRVUSD# |
| DASHUSD# |
DOGEUSD# |
DOTUSD# |
EGLDUSD# |
| ENJUSD# |
ETCUSD# |
ETHBTC# |
ETHEUR# |
| ETHGBP# |
ETHUSD# |
FILUSD# |
FLOWUSD# |
| GRTUSD# |
ICPUSD# |
IMXUSD# |
LDOUSD# |
| LINKUSD# |
LRCUSD# |
LTCUSD# |
MANAUSD# |
| MATICUSD# |
NEARUSD# |
OPUSD# |
SANDUSD# |
| SHIBUSD# |
SNXUSD# |
SOLUSD# |
STORJUSD# |
| STXUSD# |
SUSHIUSD# |
UMAUSD# |
UNIUSD# |
| VAULTAUSD# |
XLMUSD# |
XRPUSD# |
XTZUSD# |
| ZECUSD# |
ZRXUSD# |
|
|
4.3 Commodities / Precious Metals (8)
| Symbol |
Description |
| GOLD.i# |
Gold (USD) |
| SILVER.i# |
Silver (USD) |
| XAUCNH.i# |
Gold / Chinese Yuan |
| XAUEUR.i# |
Gold / Euro |
| XAUJPY.i# |
Gold / Japanese Yen |
| GAUCNH.i# |
Gold (alternate CNH) |
| GAUUSD.i# |
Gold (alternate USD) |
| XPDUSD.i# |
Palladium / USD |
| XPTUSD.i# |
Platinum / USD |
4.4 Stocks / Equities (12)
| Symbol |
Company |
| Amazon |
Amazon.com Inc. |
| BancoBradesco |
Banco Bradesco S.A. |
| DraftKings |
DraftKings Inc. |
| Ford |
Ford Motor Company |
| Gerdau |
Gerdau S.A. |
| Intel |
Intel Corporation |
| Nu Holdings |
Nu Holdings Ltd. |
| Nvidia |
NVIDIA Corporation |
| Pinterest |
Pinterest Inc. |
| PlugPower |
Plug Power Inc. |
| Rivian |
Rivian Automotive Inc. |
| Tesla |
Tesla Inc. |
| Transocean |
Transocean Ltd. |
5. Timeframes
| Timeframe |
Label in Path |
Files (ALL_TIME_DATA) |
Rows (ALL_TIME_DATA) |
Typical Row Count per Instrument |
| 1 Minute |
1min_time |
108 |
182,411,731 |
42K β 9.8M |
| 5 Minutes |
5min_time |
108 |
51,658,834 |
8K β 2.0M |
| 15 Minutes |
15min_time |
108 |
22,286,150 |
9K β 679K |
| 30 Minutes |
30min_time |
108 |
11,251,532 |
4K β 343K |
| 1 Hour |
1hr_time |
108 |
5,739,404 |
2K β 175K |
| 4 Hours |
4hr_time |
108 |
1,526,167 |
628 β 49K |
| 1 Day |
1day_time |
108 |
358,588 |
106 β 14K |
6. Data Partitioning: ALL_TIME_DATA vs data
| Property |
ALL_TIME_DATA/ |
data/ |
| Purpose |
Full historical archive |
Recent/sampled subset |
| File Count |
756 |
35 |
| Row Count |
275,232,406 |
1,185,910 |
| Instruments |
108 (all) |
5 (BTCUSD#, ETHUSD#, EURUSD, GBPUSD, USDJPY) |
| Timeframes |
All 7 |
All 7 |
| Schema Notes |
Clean 6-col OHLCV |
Some files have 7 cols (legacy header artifact) |
7. File Naming Convention
{InstrumentSymbol}_{Timeframe}.parquet
Examples:
BTCUSD#_1min.parquet β Bitcoin/USD, 1-minute bars
EURUSD#_1day.parquet β EUR/USD, daily bars
Tesla_15min.parquet β Tesla stock, 15-minute bars
GOLD.i#_4hr.parquet β Gold, 4-hour bars
Symbol suffix meanings:
# β CFD/derivative instrument
.i# β Index/commodity CFD
- No suffix β Spot or direct instrument
8. API Design Reference
8.1 Recommended API Endpoints
GET /api/v1/instruments
β List all 111 available instruments with metadata
GET /api/v1/instruments/{symbol}
β Instrument details (asset class, available timeframes, row counts)
GET /api/v1/ohlcv/{symbol}
?timeframe=1min|5min|15min|30min|1hr|4hr|1day
&start=2024-01-01T00:00:00Z
&end=2025-12-31T23:59:59Z
&limit=1000
&offset=0
β OHLCV candle data with pagination
GET /api/v1/reports/equity
β Portfolio equity curve (5,000 steps)
GET /api/v1/reports/trades
?pair=BTCUSD
&side=buy|sell
β Portfolio trade history (3,797 trades)
GET /api/v1/metadata
β Dataset-level metadata (total files, rows, timeframes, etc.)
GET /api/v1/search
?q=BTC&asset_class=crypto
β Search instruments by name/class
8.2 Data Access Pattern (HuggingFace)
from huggingface_hub import hf_hub_url
import pandas as pd
url = hf_hub_url(
repo_id="OMCHOKSI108/my-cloud-data-lake",
filename="ALL_TIME_DATA/1hr_time/BTCUSD#_1hr.parquet",
repo_type="dataset"
)
df = pd.read_parquet(url)
8.3 Query Parameters for API
| Parameter |
Type |
Description |
symbol |
string |
Instrument symbol (e.g., BTCUSD#, Tesla) |
timeframe |
enum |
1min, 5min, 15min, 30min, 1hr, 4hr, 1day |
start |
ISO datetime |
Start of date range filter |
end |
ISO datetime |
End of date range filter |
limit |
int |
Max rows returned (default: 1000, max: 10000) |
offset |
int |
Pagination offset |
source |
enum |
all_time or recent (maps to folder) |
format |
enum |
json, csv, parquet |
8.4 Response Schema
{
"symbol": "BTCUSD#",
"timeframe": "1hr",
"total_rows": 66829,
"returned": 1000,
"data": [
{
"ts": "2024-01-01T00:00:00Z",
"open": 42150.50,
"high": 42280.00,
"low": 42100.00,
"close": 42230.75,
"volume": 1234
}
]
}
9. Asset Classification Map
Use this mapping to categorize instruments in the API:
{
"forex_major": ["EURUSD#", "GBPUSD#", "USDJPY#", "USDCHF#", "AUDUSD#", "NZDUSD#", "USDCAD#"],
"forex_cross": ["EURJPY#", "GBPJPY#", "EURGBP#", "AUDCAD#", "AUDCHF#", "AUDJPY#", "AUDNZD#", "CADCHF#", "CADJPY#", "CHFJPY#", "EURAUD#", "EURCAD#", "EURCHF#", "EURNZD#", "GBPAUD#", "GBPCAD#", "GBPCHF#", "GBPNZD#", "NZDCAD#", "NZDCHF#", "NZDJPY#"],
"crypto_major": ["BTCUSD#", "ETHUSD#", "LTCUSD#", "XRPUSD#", "BCHUSD#"],
"crypto_alt": ["ADAUSD#", "SOLUSD#", "DOTUSD#", "LINKUSD#", "AVAXUSD#", "DOGEUSD#", "SHIBUSD#", "MATICUSD#", "UNIUSD#", "AAVEUSD#", "...and 40+ more"],
"crypto_cross": ["BTCEUR#", "BTCGBP#", "BTCJPY#", "ETHBTC#", "ETHEUR#", "ETHGBP#"],
"commodities": ["GOLD.i#", "SILVER.i#", "XPDUSD.i#", "XPTUSD.i#", "XAUEUR.i#", "XAUJPY.i#", "XAUCNH.i#", "GAUCNH.i#", "GAUUSD.i#"],
"stocks": ["Amazon", "Tesla", "Nvidia", "Intel", "Ford", "Rivian", "Pinterest", "PlugPower", "DraftKings", "Nu Holdings", "Gerdau", "BancoBradesco", "Transocean"]
}
10. Data Volume by Asset Class (ALL_TIME_DATA)
| Asset Class |
Instruments |
Files |
Est. Rows |
| Forex |
~28 |
196 |
~80M+ |
| Crypto |
~52 |
364 |
~150M+ |
| Commodities |
~9 |
63 |
~20M+ |
| Stocks |
~13 |
91 |
~25M+ |
| Reports |
2 |
2 |
8,797 |
11. Key Observations & Notes
- Uniform OHLCV schema across all market data files β no schema conflicts between asset classes
- Highest granularity data is in 1-minute bars, accounting for 66% of all rows (182M rows)
- Longest history instruments: Major forex pairs (EURUSD, GBPUSD, USDJPY, USDCHF) have up to
14K daily bars (55+ years of data) and 9.8M 1-minute bars
- Shortest history instruments: GAUCNH.i#, GAUUSD.i#, XAUCNH.i#, XAUJPY.i# have only ~106 daily bars
data/ folder contains a focused subset of 5 key instruments (BTCUSD#, ETHUSD#, EURUSD, GBPUSD, USDJPY) β likely used for development/testing
reports/ folder contains backtesting/simulation results β equity curve and trade log
- Data format is Apache Parquet β columnar, compressed, ideal for analytical queries
- No bid/ask spread data β only mid-price OHLCV
- No fundamental data β purely technical/price data
12. Potential Use Cases
- Trading Strategy Backtesting β multi-asset, multi-timeframe
- ML/AI Price Prediction Models β 276M row training dataset
- Technical Analysis API β serve OHLCV with on-the-fly indicator calculation
- Cross-Asset Correlation Analysis β forex, crypto, commodities, stocks in one lake
- Portfolio Simulation β reports data already includes equity curves and trade logs
- Real-Time Dashboard β serve historical + stream live data via WebSocket
- Market Data Microservice β HuggingFace as cold storage, API serves hot queries via Redis/DuckDB cache