Spaces:

Yash1178
/

commodisense

Sleeping

App Files Files Community

Yash1178 commited on 6 days ago

Commit

2c3c5f5

1 Parent(s): 054baf7

Deploy CommodiSense v1.0

Browse files

Files changed (17) hide show

.gitignore +63 -0
.streamlit/config.toml +14 -0
Dockerfile +12 -0
README.md +564 -5
dashboard/app.py +1077 -0
model/__init__.py +0 -0
model/explainer.py +266 -0
model/feature_builder.py +374 -0
model/predictor.py +387 -0
model/trainer.py +496 -0
requirements.txt +18 -0
signals/__init__.py +0 -0
signals/macro_features.py +457 -0
signals/nlp_events.py +313 -0
signals/nlp_sentiment.py +337 -0
signals/price_features.py +365 -0
signals/weather_features.py +118 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,63 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+venv/
+ENV/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# Environment
+.env
+.env.local
+.env.*.local
+# Data/Cache/Database (binary files)
+*.db
+*.duckdb
+data/cache/
+data/*.parquet
+data/*.csv.bak
+data/collector_cache/
+model/models/
+model/cache/
+.cache/
+*.pkl
+*.pickle
+# Logs
+*.log
+logs/
+runs/
+# Streamlit cache (keep config.toml for deployment)
+.streamlit_cache/
+.streamlit/.cache/
+.streamlit/__pycache__/
+# Jupyter
+.ipynb_checkpoints/
+*.ipynb

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,14 @@

+[theme]
+primaryColor = "#3D7FFF"
+backgroundColor = "#060A0F"
+secondaryBackgroundColor = "#0D1117"
+textColor = "#E6EDF3"
+[client]
+showErrorDetails = true
+[server]
+port = 7860
+headless = true
+enableCORS = false
+enableXsrfProtection = true

Dockerfile ADDED Viewed

	@@ -0,0 +1,12 @@

+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 7860
+CMD ["streamlit", "run", "dashboard/app.py", "--server.port=7860", "--server.address=0.0.0.0"]

README.md CHANGED Viewed

@@ -1,10 +1,569 @@
 ---
-title: Commodisense
-emoji: 🐢
-colorFrom: pink
-colorTo: purple
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: CommodiSense
+colorFrom: gray
+colorTo: gray
 sdk: docker
+app_file: dashboard/app.py
 pinned: false
 ---
+# ◈ CommodiSense — Global Commodity Intelligence Engine
+<div align="center">
+![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=flat-square&logo=python&logoColor=white)
+![Streamlit](https://img.shields.io/badge/Streamlit-1.28+-FF4B4B?style=flat-square&logo=streamlit&logoColor=white)
+![XGBoost](https://img.shields.io/badge/XGBoost-2.0+-006400?style=flat-square)
+![LightGBM](https://img.shields.io/badge/LightGBM-4.0+-5B8C5A?style=flat-square)
+![DuckDB](https://img.shields.io/badge/DuckDB-0.10+-FFF000?style=flat-square)
+![License](https://img.shields.io/badge/License-MIT-blue?style=flat-square)
+![Status](https://img.shields.io/badge/Status-Live-00D97E?style=flat-square)
+**Zero-cost commodity price direction forecaster for 10 global markets.**
+Powered by XGBoost + LightGBM ensemble, SHAP explainability, FinBERT NLP sentiment,
+CFTC COT positioning, EIA inventory data, USDA crop signals, and FRED macro indicators.
+[**Live Demo**](https://commodisense.streamlit.app) · [**Report Bug**](https://github.com/Yashvardhansharma112/commodisense/issues) · [**Request Feature**](https://github.com/Yashvardhansharma112/commodisense/issues)
+</div>
+---
+## Table of Contents
+- [Overview](#overview)
+- [Features](#features)
+- [How It Works](#how-it-works)
+- [Data Sources](#data-sources)
+- [Model Architecture](#model-architecture)
+- [Accuracy Results](#accuracy-results)
+- [Tech Stack](#tech-stack)
+- [Project Structure](#project-structure)
+- [Getting Started](#getting-started)
+- [Configuration](#configuration)
+- [Deployment](#deployment)
+- [Daily Pipeline](#daily-pipeline)
+- [API Keys](#api-keys)
+---
+## Overview
+CommodiSense is a production-grade commodity intelligence platform that forecasts price direction (UP / STABLE / DOWN) for 10 global commodity futures over 7-day and 30-day horizons.
+Unlike most financial ML projects that rely on price technicals alone, CommodiSense fuses **8 independent data sources** — including institutional positioning data (CFTC COT), energy inventory surprises (EIA), crop condition ratings (USDA), and macroeconomic indicators (FRED) — into a single ensemble model per commodity.
+The entire system runs at **zero ongoing cost** using free public APIs, GitHub Actions for scheduling, Streamlit Cloud for hosting, and DuckDB as a serverless embedded database.
+```
+Data Collection → Feature Engineering → Ensemble Training → Live Dashboard
+   (8 sources)       (65+ features)       (XGBoost+LGBM)    (Streamlit Cloud)
+```
+---
+## Features
+### Forecasting Engine
+- **10 commodity markets**: Crude Oil (CL=F), Natural Gas (NG=F), Gold (GC=F), Wheat (ZW=F), Corn (ZC=F), Soybeans (ZS=F), Cotton (CT=F), Sugar (SB=F), USD/INR (USDINR=X), Copper (HG=F)
+- **Dual horizons**: 7-day and 30-day directional forecasts
+- **3-class output**: UP (>threshold%), STABLE, DOWN (<-threshold%) with per-commodity calibrated thresholds
+- **Probability scores** with isotonic calibration for reliable confidence estimates
+- **HIGH / MEDIUM / LOW confidence tiers** based on model probability
+- **Signal confirmation filter**: 4 independent signals must agree to issue a HIGH-confidence call (price momentum, COT commercial positioning, EIA supply signal, USDA crop trend)
+### Data Intelligence
+- **CFTC COT Reports**: 13 years of weekly institutional positioning (commercial hedgers vs managed money). The single most valuable commodity signal — smart money positioning often leads price by 1–3 weeks.
+- **EIA Inventory**: Weekly crude oil stocks (2,278 rows back to 1982) and natural gas storage (856 rows). Inventory surprises vs 5-year average directly drive energy price moves.
+- **USDA NASS**: Weekly crop condition (% good + excellent) for corn, wheat, soybeans, cotton. Annual production estimates. Declining crop condition → bullish price signal.
+- **FRED Macro**: USD Index (DXY), VIX volatility, 10-year Treasury yield, Fed Funds rate, Industrial Production. Gold inversely correlates with real yields; copper tracks industrial output.
+- **FinBERT NLP**: GDELT news articles scored for financial sentiment (bullish/bearish/neutral). Rolling 1-day, 3-day, 7-day sentiment aggregates per commodity.
+- **spaCy Event Extraction**: Supply shock, policy change, and geopolitical event detection from news headlines.
+- **Open-Meteo Weather**: Drought index, heat stress days, precipitation anomaly for agricultural commodity regions.
+- **ACLED Geopolitical**: Risk scores for regions that supply each commodity.
+### Explainability
+- **SHAP values** for every forecast — top 5 signal drivers shown in the dashboard
+- Human-readable feature labels (e.g., "COT Smart Money Positioning", "EIA Crude Inventory Surprise")
+- **AI Analyst Reports** generated via Groq LLM (Llama 3) contextualizing each forecast
+### Dashboard (Dark Luxury Terminal)
+- Live animated ticker strip with all 10 markets
+- Macro environment bar: DXY, VIX, yield curve, spread, copper demand proxy
+- Direction-colored commodity cards with confidence badges
+- Candlestick chart with 20-day SMA and forecast zone overlay
+- COT positioning chart (commercial vs managed money, 2-year history)
+- EIA inventory bar chart with 4-week rolling average
+- News sentiment chart with bull/bear zones
+- Weather signal metrics
+- AI analyst report per commodity
+- Recent news feed with sentiment scores
+### Infrastructure
+- **GitHub Actions** daily pipeline (Mon–Fri 6am UTC): collect → process → retrain → forecast → commit
+- **DuckDB** embedded database (no server required, zero cost)
+- **Streamlit Cloud** free-tier hosting with auto-deploy on push
+- Full **error isolation** — one failing step doesn't halt the rest of the pipeline
+---
+## How It Works
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     DAILY PIPELINE (13 Steps)                    │
+├─────────────────────────────────────────────────────────────────┤
+│  Step 1   Collect prices         yfinance → DuckDB               │
+│  Step 2   Collect news           GDELT → DuckDB                  │
+│  Step 3   Collect weather        Open-Meteo → DuckDB             │
+│  Step 4   Collect geopolitical   ACLED → DuckDB                  │
+│  Step 5   Collect COT            CFTC → DuckDB                   │
+│  Step 6   Collect FRED macro     FRED CSV + yfinance → DuckDB    │
+│  Step 7   Collect EIA inventory  EIA API v2 → DuckDB             │
+│  Step 8   Collect USDA crop      USDA NASS API → DuckDB          │
+│  Step 9   Score NLP sentiment    FinBERT → sentiment_daily       │
+│  Step 10  Extract events         spaCy → extracted_events        │
+│  Step 11  Generate forecasts     XGBoost+LightGBM → accuracy_log │
+│  Step 12  Generate AI reports    Groq LLM → reports              │
+│  Step 13  Log accuracy           Compare 7-day-old forecasts      │
+└─────────────────────────────────────────────────────────────────┘
+                             ↓ pushes to GitHub ↓
+                    Streamlit Cloud auto-deploys
+```
+---
+## Data Sources
+| Source | Type | Coverage | Update Frequency | Key |
+|--------|------|----------|-----------------|-----|
+| **yfinance** | Price OHLCV | 12,613 rows · 5yr | Daily | None |
+| **CFTC COT** | Futures positioning | 8,826 rows · 13yr | Weekly (Friday) | None |
+| **FRED** | Macro indicators | 7,193 rows · 16yr | Daily/Weekly/Monthly | None |
+| **EIA** | Energy inventory | 3,134 rows · 40yr crude | Weekly (Wednesday) | Free |
+| **USDA NASS** | Crop condition & stocks | 1,104 rows · 5yr | Weekly/Quarterly | Free |
+| **GDELT** | Global news | 392 articles | Daily | None |
+| **Open-Meteo** | Agricultural weather | 210 rows | Daily | None |
+| **ACLED** | Geopolitical events | 20 events | Weekly | None |
+### Free API Keys Required
+| API | Data | Register |
+|-----|------|---------|
+| EIA | Crude oil & natural gas weekly inventory | [eia.gov/opendata](https://www.eia.gov/opendata/register.php) |
+| USDA NASS | Crop condition, stocks, production | [quickstats.nass.usda.gov/api](https://quickstats.nass.usda.gov/api) |
+| Groq | AI analyst report generation | [console.groq.com](https://console.groq.com) |
+---
+## Model Architecture
+### Per-Symbol Ensemble
+Each of the 10 commodities has **two independent models** trained: one for the 7-day horizon and one for the 30-day horizon.
+```
+Raw Features (65+)
+        │
+        ▼
+  Feature Selection                ← drops columns with <5% non-zero values
+  (sparse filter)                    auto-excludes missing data sources
+        │
+        ▼
+  StandardScaler                   ← fit on training data, saved per symbol
+        │
+        ├─────────────────────────────────────────────┐
+        ▼                                             ▼
+  XGBoost Classifier            LightGBM Classifier
+  (300 trees, max_depth=5)      (300 trees, 31 leaves)
+  + Isotonic Calibration
+        │                                             │
+        └──────────────┬──────────────────────────────┘
+                       ▼
+              Ensemble (avg probabilities)
+                       │
+                       ▼
+              Direction + Probability
+              (UP / STABLE / DOWN)
+                       │
+                       ▼
+         Signal Confirmation Filter          ← 4-signal cross-check
+         (momentum + COT + EIA + USDA)
+                       │
+                       ▼
+              HIGH / MEDIUM / LOW confidence
+```
+### Feature Groups (65+ total)
+| Group | Features | Count |
+|-------|----------|-------|
+| **Price technicals** | RSI-14, MACD, Bollinger Band position, ATR, SMA crossover | 5 |
+| **Price momentum** | Return 1d/7d/14d/30d/60d, momentum score | 6 |
+| **Seasonality** | Month sin/cos, harvest season flag, days to OPEC meeting | 4 |
+| **Cross-commodity** | Oil/Gold ratio, DXY proxy | 2 |
+| **CFTC COT** | Commercial net %, MM net %, week-over-week changes, open interest | 7 |
+| **FRED macro** | DXY, VIX, 10Y yield, Fed Funds, INDPRO, yield inversion, copper basis | 12 |
+| **EIA inventory** | Stocks level, weekly change, z-score vs 5yr avg, draw flag | 5 |
+| **USDA crop** | Condition score, week-over-week change, stocks, production | 5 |
+| **NLP sentiment** | 1-day/3-day/7-day sentiment, article count, positive ratio | 5 |
+| **Event signals** | Bullish/bearish events, max severity, supply shock, policy change | 6 |
+| **Geopolitical** | Risk score 7d, risk score 30d | 2 |
+| **Weather** | Drought index, heat stress days, precipitation anomaly | 3 |
+| **Data flags** | has_cot_data, has_fred_data, has_eia_data, has_usda_data | 4 |
+### Training Strategy
+- **Walk-forward validation**: 5-fold cross-validation on 80% of data, tested on most recent 20%
+- **Class balancing**: `compute_sample_weight("balanced")` addresses UP/DOWN/STABLE imbalance
+- **Commodity-specific thresholds**: USDINR uses ±0.4% threshold (managed float), NG=F uses ±3.5% (highly volatile)
+- **Regime detection**: TRENDING / VOLATILE / RANGE_BOUND classification per row
+- **Interaction features**: `sentiment × momentum`, `event × momentum`, `high_volatility_flag`
+- **SHAP explainer**: TreeExplainer run post-training, top 5 features saved per forecast
+---
+## Accuracy Results
+> Measured on held-out test set (most recent 20% of data). Random chance = 33.3% (3-class problem).
+| Commodity | 7-Day | 30-Day | vs Baseline |
+|-----------|-------|--------|------------|
+| Crude Oil (CL=F) | 30.7% | 31.5% | +4.0% |
+| Natural Gas (NG=F) | 36.3% | 44.6% | +3.6% |
+| Gold (GC=F) | 37.1% | **54.2%** | +6.8% 30d |
+| Wheat (ZW=F) | **44.6%** | 23.1% | +0.4% 7d |
+| Corn (ZC=F) | 16.7%⚠ | **48.2%** | — |
+| **Soybeans (ZS=F)** | **62.2%** | 48.6% | **+18.0%** |
+| Cotton (CT=F) | **45.8%** | 34.7% | +0.8% |
+| Sugar (SB=F) | 35.9% | 36.7% | — |
+| USD/INR (USDINR=X) | 41.2% | **50.8%** | **+28.1%** 30d |
+| Copper (HG=F) | 16.3%⚠ | 23.1% | — |
+| **Average** | **36.7%** | **39.6%** | +5.4% vs random |
+> ⚠ ZC=F 7d and HG=F have below-random accuracy due to structural market regime breaks in 2024–2026 (South American corn oversupply, HG=F name change in CFTC files limiting history). Use 30d forecasts for these symbols.
+**Best performers:**
+- 🥇 **ZS=F 7d: 62.2%** — USDA soybean crop condition is a dominant signal
+- 🥈 **USDINR=X 30d: 50.8%** — FRED DXY + Fed Funds rate highly predictive for USD/INR
+- 🥉 **GC=F 30d: 54.2%** — Gold responds strongly to yield curve and inflation expectations
+---
+## Tech Stack
+```
+Language        Python 3.10+
+Database        DuckDB 0.10+ (embedded, zero-config, serverless)
+ML              XGBoost 2.0, LightGBM 4.0, scikit-learn 1.3
+Explainability  SHAP 0.42
+NLP             HuggingFace Transformers (FinBERT), spaCy 3.5
+Dashboard       Streamlit 1.28, Plotly 5.15
+LLM Reports     Groq API (Llama 3)
+Data APIs       yfinance, requests, FRED CSV, EIA API v2, USDA NASS API
+Scheduling      GitHub Actions (cron)
+Hosting         Streamlit Cloud (free tier)
+```
+---
+## Project Structure
+```
+commodisense/
+│
+├── data/                          # Data collection layer
+│   ├── db.py                      # DuckDB connection + schema init (9 tables)
+│   ├── collector_prices.py        # yfinance OHLCV prices
+│   ├── collector_news.py          # GDELT news articles
+│   ├── collector_weather.py       # Open-Meteo agricultural weather
+│   ├── collector_geopolitical.py  # ACLED geopolitical events
+│   ├── collector_cot.py           # CFTC COT weekly positioning (2013–2026)
+│   ├── collector_fred.py          # FRED macro + yfinance DXY/VIX
+│   ├── collector_eia.py           # EIA crude oil + natural gas inventory
+│   └── collector_usda.py          # USDA crop condition + stocks + production
+│
+├── signals/                       # Feature engineering layer
+│   ├── price_features.py          # RSI, MACD, momentum, seasonality, cross-commodity
+│   ├── nlp_sentiment.py           # FinBERT sentiment scoring pipeline
+│   ├── nlp_events.py              # spaCy event extraction
+│   ├── weather_features.py        # Drought/heat/precip aggregation by commodity region
+│   └── macro_features.py          # COT + FRED + EIA + USDA feature engineering
+│
+├── model/                         # ML layer
+│   ├── feature_builder.py         # Assembles all signals → training matrix (no lookahead)
+│   ├── trainer.py                 # XGBoost + LightGBM training, calibration, SHAP
+│   ├── predictor.py               # Inference with signal confirmation filter
+│   └── explainer.py               # AI report generation via Groq
+│
+├── pipeline/
+│   └── daily_run.py               # 13-step orchestrator with error isolation
+│
+├── dashboard/
+│   └── app.py                     # Streamlit dashboard (dark luxury terminal UI)
+│
+├── models/                        # Trained model artifacts (committed to git)
+│   ├── xgb_{SYMBOL}_{horizon}.pkl
+│   ├── lgbm_{SYMBOL}_{horizon}.pkl
+│   ├── scaler_{SYMBOL}_{horizon}.pkl
+│   ├── feature_names_{SYMBOL}_{horizon}.json
+│   └── accuracy_report.json
+│
+├── tests/
+│   └── test_accuracy.py           # Walk-forward backtesting framework (6 boosters)
+│
+├── .github/workflows/
+│   └── daily_pipeline.yml         # GitHub Actions cron (Mon–Fri 06:00 UTC)
+│
+├── .env.example                   # Environment variable template
+├── requirements.txt               # Python dependencies
+└── README.md
+```
+### Database Schema (9 tables)
+| Table | Description |
+|-------|-------------|
+| `prices` | Daily OHLCV per symbol |
+| `news_raw` | Raw news articles with NLP scores |
+| `sentiment_daily` | Aggregated daily sentiment per commodity |
+| `extracted_events` | spaCy-extracted supply shocks, policy changes |
+| `weather_features` | Drought/heat/precip by region and commodity |
+| `geopolitical_events` | Risk scores per region/commodity |
+| `accuracy_log` | Live forecast vs actual outcome tracking |
+| `cot_data` | CFTC COT weekly positioning per symbol |
+| `fred_data` | FRED macro series (daily, forward-filled) |
+| `eia_inventory` | EIA weekly energy storage |
+| `usda_crop` | USDA crop condition, stocks, production |
+---
+## Getting Started
+### Prerequisites
+- Python 3.10+
+- Git
+### Installation
+```bash
+# Clone the repository
+git clone https://github.com/Yashvardhansharma112/commodisense.git
+cd commodisense
+# Create virtual environment
+python -m venv venv
+# Activate (Windows)
+venv\Scripts\activate
+# Activate (macOS/Linux)
+source venv/bin/activate
+# Install dependencies
+pip install -r requirements.txt
+# Download spaCy model
+python -m spacy download en_core_web_sm
+```
+### Environment Variables
+```bash
+# Copy the example and fill in your keys
+cp .env.example .env
+```
+Edit `.env`:
+```env
+GROQ_API_KEY=your_groq_key_here       # groq.com — free, for AI reports
+EIA_API_KEY=your_eia_key_here         # eia.gov/opendata — free
+USDA_API_KEY=your_usda_key_here       # quickstats.nass.usda.gov/api — free
+```
+### First Run (Full Backfill)
+```bash
+# Initialize database schema
+python data/db.py
+# Backfill all data sources (takes ~15 minutes)
+python pipeline/daily_run.py --backfill
+# Train models for all 10 commodities
+for symbol in CL=F NG=F GC=F ZW=F ZC=F ZS=F CT=F SB=F USDINR=X HG=F; do
+    python model/trainer.py --symbol $symbol --horizon both
+done
+# Launch dashboard
+streamlit run dashboard/app.py
+```
+The dashboard will be available at **http://localhost:8501**
+### Individual Commands
+```bash
+# Collect specific data source
+python data/collector_prices.py --backfill
+python data/collector_cot.py --backfill
+python data/collector_fred.py --backfill
+python data/collector_eia.py --backfill
+python data/collector_usda.py --backfill
+# Run NLP pipeline
+python signals/nlp_sentiment.py --limit 500
+python signals/nlp_events.py --limit 500
+# Generate forecast for a single symbol
+python model/predictor.py --symbol ZS=F
+# Generate all forecasts
+python model/predictor.py --all
+# Run accuracy backtest
+python tests/test_accuracy.py --symbol ZS=F
+# Run only a specific pipeline step (for debugging)
+python pipeline/daily_run.py --step 7
+```
+---
+## Configuration
+### Per-Commodity Direction Thresholds
+Different commodities have different volatility profiles. Thresholds are set in `model/feature_builder.py`:
+| Symbol | Threshold | Rationale |
+|--------|-----------|-----------|
+| USDINR=X | ±0.4% | Managed float — rarely moves >1% in a week |
+| GC=F | ±1.5% | Gold — moderately volatile |
+| NG=F | ±3.5% | Natural gas — highly volatile seasonally |
+| Others | ±2.0% | Default threshold |
+### Adding a New Commodity
+1. Add the ticker to `ALL_SYMBOLS` in `signals/price_features.py`
+2. Add a human-readable name to `SYMBOL_NAMES` in `model/predictor.py`
+3. Run `python data/collector_prices.py --backfill`
+4. Train: `python model/trainer.py --symbol NEW=F --horizon both`
+---
+## Deployment
+### Streamlit Cloud (Recommended — Free)
+1. Fork or push to GitHub
+2. Go to [share.streamlit.io](https://share.streamlit.io)
+3. Click **New app** → connect your GitHub repo
+4. Set:
+   - **Repository**: `Yashvardhansharma112/commodisense`
+   - **Branch**: `main`
+   - **Main file path**: `dashboard/app.py`
+5. Click **Advanced settings** → paste in **Secrets** (TOML format):
+   ```toml
+   GROQ_API_KEY = "your_key"
+   EIA_API_KEY  = "your_key"
+   USDA_API_KEY = "your_key"
+   ```
+6. Click **Deploy**
+### GitHub Actions (Daily Pipeline)
+Add the same 3 keys as **Repository Secrets** at:
+`Settings → Secrets → Actions → New repository secret`
+The pipeline runs automatically Mon–Fri at 06:00 UTC. It:
+1. Collects fresh data from all 8 sources
+2. Runs NLP sentiment + event extraction
+3. Generates new forecasts for all 10 symbols
+4. Commits the updated `data/commodisense.duckdb` back to the repo
+5. Streamlit Cloud auto-deploys on the new commit
+---
+## Daily Pipeline
+The pipeline is defined in `pipeline/daily_run.py`. Each step is isolated in a `try/except` — one failure doesn't stop the rest.
+```
+Step 1   Collect prices          ~30s
+Step 2   Collect news            ~60s   (GDELT rate-limited)
+Step 3   Collect weather         ~45s
+Step 4   Collect geopolitical    ~15s
+Step 5   Collect COT             ~30s   (CFTC public ZIP download)
+Step 6   Collect FRED macro      ~30s   (7 series + yfinance fallback)
+Step 7   Collect EIA inventory   ~15s   (2 series via API)
+Step 8   Collect USDA crop       ~60s   (4 commodities × 3 queries)
+Step 9   Score NLP sentiment     ~120s  (FinBERT on GPU/CPU)
+Step 10  Extract events          ~60s   (spaCy NER)
+Step 11  Generate forecasts      ~30s   (10 symbols, cached models)
+Step 12  Generate AI reports     ~90s   (Groq API, 10 LLM calls)
+Step 13  Log accuracy            ~5s    (compare 7-day-old forecasts)
+─────────────────────────────────────────
+Total                            ~8-12 minutes
+```
+Manual trigger: Go to **Actions** tab → **Daily CommodiSense Pipeline** → **Run workflow**
+---
+## API Keys
+| Key | Where to get | Cost | What it enables |
+|-----|-------------|------|----------------|
+| `GROQ_API_KEY` | [console.groq.com](https://console.groq.com) | Free tier | AI analyst reports via Llama 3 |
+| `EIA_API_KEY` | [eia.gov/opendata/register.php](https://www.eia.gov/opendata/register.php) | Free | Crude oil + natural gas weekly inventory data |
+| `USDA_API_KEY` | [quickstats.nass.usda.gov/api](https://quickstats.nass.usda.gov/api) | Free | Crop condition, stocks, production |
+The system runs without any API keys — it will skip those data collection steps and fall back to price technicals only. Accuracy improves significantly with all keys set.
+---
+## Accuracy Improvement Roadmap
+| Data Source | Expected Gain | Status |
+|------------|--------------|--------|
+| CFTC COT (13yr history) | +5–8% avg | ✅ Implemented |
+| EIA crude + natgas inventory | +10–13% for CL=F | ✅ Implemented |
+| USDA crop condition | +15–18% for ZS=F | ✅ Implemented |
+| FRED macro (DXY, VIX, yields) | +21% USDINR=X 30d | ✅ Implemented |
+| South American crop data (CONAB) | +10–15% ZC=F | 🔲 Planned |
+| LME copper warehouse stocks | +8–12% HG=F | 🔲 Planned |
+| Heating/Cooling Degree Days (NOAA) | +5–8% NG=F | 🔲 Planned |
+| WASDE monthly projections | +5–7% grains | 🔲 Planned |
+---
+## License
+MIT License — see [LICENSE](LICENSE) for details.
+---
+## Acknowledgements
+- **CFTC** for free public COT disaggregated reports
+- **Federal Reserve (FRED)** for free macroeconomic data API
+- **U.S. Energy Information Administration (EIA)** for free energy inventory API
+- **USDA NASS** for free agricultural statistics API
+- **GDELT Project** for free global news event database
+- **Open-Meteo** for free historical weather API
+- **yfinance** community for the excellent Yahoo Finance wrapper
+- **Groq** for free Llama 3 inference API
+---
+<div align="center">
+Built with Python · Deployed on Streamlit Cloud · Data from CFTC, FRED, EIA, USDA, GDELT
+**[⭐ Star this repo](https://github.com/Yashvardhansharma112/commodisense)** if you find it useful
+</div>

dashboard/app.py ADDED Viewed

	@@ -0,0 +1,1077 @@

+"""
+CommodiSense Dashboard — Global Commodity Intelligence Engine
+Dark luxury financial terminal UI.
+Run: streamlit run dashboard/app.py
+Deploy: Streamlit Cloud → main file: dashboard/app.py → secret: GROQ_API_KEY
+"""
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+import pandas as pd
+import plotly.graph_objects as go
+import streamlit as st
+ROOT = Path(__file__).parent.parent
+sys.path.insert(0, str(ROOT))
+from data.db import get_conn, init_schema
+from model.explainer import load_latest_reports, generate_report
+from model.predictor import predict, SYMBOL_NAMES
+# ── page config ────────────────────────────────────────────────────────────────
+st.set_page_config(
+    page_title="CommodiSense",
+    page_icon="◈",
+    layout="wide",
+    initial_sidebar_state="collapsed",
+)
+# ── design tokens ──────────────────────────────────────────────────────────────
+C = {
+    "bg":          "#060A0F",
+    "surface":     "#0D1117",
+    "surface2":    "#161B22",
+    "border":      "rgba(255,255,255,0.07)",
+    "border_hi":   "rgba(255,255,255,0.14)",
+    "up":          "#00D97E",
+    "down":        "#FF3B55",
+    "stable":      "#7A8899",
+    "up_dim":      "rgba(0,217,126,0.12)",
+    "down_dim":    "rgba(255,59,85,0.12)",
+    "stable_dim":  "rgba(122,136,153,0.10)",
+    "accent":      "#3D7FFF",
+    "accent_dim":  "rgba(61,127,255,0.12)",
+    "gold":        "#FFBB00",
+    "text":        "#E6EDF3",
+    "text2":       "#8B949E",
+    "text3":       "#484F58",
+    "conf_high":   "#00D97E",
+    "conf_mid":    "#FFBB00",
+    "conf_low":    "#7A8899",
+}
+DIR_COLOR  = {"UP": C["up"],   "DOWN": C["down"],   "STABLE": C["stable"]}
+DIR_DIM    = {"UP": C["up_dim"],"DOWN": C["down_dim"],"STABLE": C["stable_dim"]}
+DIR_ICON   = {"UP": "▲",       "DOWN": "▼",          "STABLE": "◆"}
+CONF_COLOR = {"HIGH": C["conf_high"], "MEDIUM": C["conf_mid"], "LOW": C["conf_low"]}
+ALL_SYMBOLS = list(SYMBOL_NAMES.keys())
+# ── CSS ────────────────────────────────────────────────────────────────────────
+def _inject_css():
+    st.markdown(f"""
+    <style>
+    @import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&family=JetBrains+Mono:wght@400;500&display=swap');
+    html, body, [class*="css"] {{
+        font-family: 'Inter', -apple-system, sans-serif;
+        background-color: {C['bg']};
+        color: {C['text']};
+    }}
+    .stApp {{ background-color: {C['bg']}; }}
+    .block-container {{ padding: 1.2rem 2rem 3rem 2rem; max-width: 1600px; }}
+    /* Hide default Streamlit chrome */
+    #MainMenu, footer, header {{ visibility: hidden; }}
+    .stDeployButton {{ display: none; }}
+    [data-testid="stSidebar"] {{ background: {C['surface']}; border-right: 1px solid {C['border']}; }}
+    /* Scrollbar */
+    ::-webkit-scrollbar {{ width: 4px; height: 4px; }}
+    ::-webkit-scrollbar-track {{ background: {C['bg']}; }}
+    ::-webkit-scrollbar-thumb {{ background: {C['border_hi']}; border-radius: 2px; }}
+    /* Buttons */
+    .stButton > button {{
+        background: transparent;
+        border: 1px solid {C['border_hi']};
+        color: {C['text2']};
+        border-radius: 6px;
+        font-size: 0.78rem;
+        padding: 4px 10px;
+        transition: all 0.15s ease;
+        font-family: 'Inter', sans-serif;
+    }}
+    .stButton > button:hover {{
+        border-color: {C['accent']};
+        color: {C['accent']};
+        background: {C['accent_dim']};
+    }}
+    /* Metric cards */
+    div[data-testid="metric-container"] {{
+        background: {C['surface']};
+        border: 1px solid {C['border']};
+        border-radius: 10px;
+        padding: 14px 16px;
+    }}
+    div[data-testid="metric-container"] label {{
+        color: {C['text2']} !important;
+        font-size: 0.72rem !important;
+        letter-spacing: 0.06em;
+        text-transform: uppercase;
+    }}
+    div[data-testid="metric-container"] [data-testid="stMetricValue"] {{
+        color: {C['text']} !important;
+        font-size: 1.3rem !important;
+        font-weight: 600;
+        font-family: 'JetBrains Mono', monospace;
+    }}
+    /* Radio + select */
+    .stRadio > div {{ gap: 8px; }}
+    .stRadio label {{ font-size: 0.8rem; color: {C['text2']}; }}
+    .stSelectbox label {{ color: {C['text2']}; font-size: 0.8rem; }}
+    /* Tabs */
+    .stTabs [data-baseweb="tab-list"] {{
+        gap: 4px;
+        background: transparent;
+        border-bottom: 1px solid {C['border']};
+    }}
+    .stTabs [data-baseweb="tab"] {{
+        background: transparent;
+        border: none;
+        color: {C['text2']};
+        font-size: 0.82rem;
+        padding: 6px 14px;
+        border-radius: 6px 6px 0 0;
+    }}
+    .stTabs [aria-selected="true"] {{
+        background: {C['surface']} !important;
+        color: {C['text']} !important;
+        border-bottom: 2px solid {C['accent']};
+    }}
+    /* Ticker animation */
+    @keyframes ticker-scroll {{
+        0%   {{ transform: translateX(0); }}
+        100% {{ transform: translateX(-50%); }}
+    }}
+    .ticker-wrap {{
+        overflow: hidden;
+        background: {C['surface']};
+        border-top: 1px solid {C['border']};
+        border-bottom: 1px solid {C['border']};
+        padding: 8px 0;
+        margin: -1rem -2rem 1.4rem -2rem;
+    }}
+    .ticker-inner {{
+        display: flex;
+        animation: ticker-scroll 40s linear infinite;
+        width: max-content;
+    }}
+    .ticker-item {{
+        display: inline-flex;
+        align-items: center;
+        gap: 6px;
+        padding: 0 28px;
+        white-space: nowrap;
+        font-family: 'JetBrains Mono', monospace;
+        font-size: 0.78rem;
+        border-right: 1px solid {C['border']};
+    }}
+    .ticker-sep {{
+        padding: 0 28px;
+        color: {C['text3']};
+        font-size: 0.6rem;
+        border-right: 1px solid {C['border']};
+    }}
+    /* Commodity cards */
+    .comm-card {{
+        background: {C['surface']};
+        border: 1px solid {C['border']};
+        border-radius: 12px;
+        padding: 16px;
+        cursor: pointer;
+        transition: all 0.18s ease;
+        height: 100%;
+        position: relative;
+        overflow: hidden;
+    }}
+    .comm-card::before {{
+        content: '';
+        position: absolute;
+        top: 0; left: 0;
+        width: 3px; height: 100%;
+        border-radius: 12px 0 0 12px;
+    }}
+    .comm-card:hover {{
+        border-color: {C['border_hi']};
+        transform: translateY(-1px);
+        box-shadow: 0 8px 24px rgba(0,0,0,0.4);
+    }}
+    .comm-card.active {{
+        border-color: {C['accent']} !important;
+        background: linear-gradient(135deg, {C['surface']} 0%, rgba(61,127,255,0.05) 100%);
+    }}
+    .comm-card.up::before {{ background: {C['up']}; }}
+    .comm-card.down::before {{ background: {C['down']}; }}
+    .comm-card.stable::before {{ background: {C['stable']}; }}
+    /* Signal pill */
+    .signal-pill {{
+        display: inline-block;
+        padding: 2px 8px;
+        border-radius: 20px;
+        font-size: 0.68rem;
+        font-weight: 600;
+        letter-spacing: 0.04em;
+        text-transform: uppercase;
+    }}
+    /* Macro bar */
+    .macro-item {{
+        text-align: center;
+        padding: 10px 16px;
+        background: {C['surface']};
+        border: 1px solid {C['border']};
+        border-radius: 8px;
+    }}
+    .macro-label {{ font-size: 0.65rem; color: {C['text3']}; letter-spacing: 0.08em; text-transform: uppercase; margin-bottom: 3px; }}
+    .macro-value {{ font-size: 1.05rem; font-weight: 600; font-family: 'JetBrains Mono', monospace; color: {C['text']}; }}
+    .macro-change {{ font-size: 0.68rem; margin-top: 2px; }}
+    /* AI report */
+    .ai-report {{
+        background: linear-gradient(135deg, {C['surface2']} 0%, rgba(61,127,255,0.04) 100%);
+        border: 1px solid {C['border']};
+        border-left: 3px solid {C['accent']};
+        border-radius: 10px;
+        padding: 16px 20px;
+        line-height: 1.7;
+        font-size: 0.9rem;
+        color: {C['text']};
+    }}
+    /* News row */
+    .news-row {{
+        padding: 10px 0;
+        border-bottom: 1px solid {C['border']};
+        display: flex;
+        align-items: flex-start;
+        gap: 12px;
+    }}
+    /* COT bar */
+    .cot-label {{ font-size: 0.7rem; color: {C['text2']}; margin-bottom: 4px; }}
+    .cot-bar-wrap {{
+        height: 6px;
+        background: {C['surface2']};
+        border-radius: 3px;
+        overflow: hidden;
+        margin-bottom: 10px;
+    }}
+    /* Section header */
+    .section-header {{
+        display: flex;
+        align-items: center;
+        gap: 10px;
+        margin-bottom: 12px;
+        padding-bottom: 8px;
+        border-bottom: 1px solid {C['border']};
+    }}
+    .section-title {{
+        font-size: 0.7rem;
+        font-weight: 600;
+        letter-spacing: 0.12em;
+        text-transform: uppercase;
+        color: {C['text2']};
+    }}
+    .section-dot {{ width: 6px; height: 6px; border-radius: 50%; background: {C['accent']}; }}
+    /* Confidence arc */
+    .conf-badge {{
+        display: inline-flex;
+        align-items: center;
+        gap: 5px;
+        padding: 4px 10px;
+        border-radius: 20px;
+        font-size: 0.72rem;
+        font-weight: 600;
+        letter-spacing: 0.05em;
+    }}
+    </style>
+    """, unsafe_allow_html=True)
+# ── data loaders ───────────────────────────────────────────────────────────────
+@st.cache_resource
+def _ensure_schema():
+    init_schema()
+@st.cache_data(ttl=3600)
+def _load_forecast(symbol: str) -> dict:
+    return predict(symbol)
+@st.cache_data(ttl=3600)
+def _load_all_forecasts(symbols: tuple) -> dict:
+    return {s: _load_forecast(s) for s in symbols}
+@st.cache_data(ttl=3600)
+def _load_price_history(symbol: str, days: int = 90) -> pd.DataFrame:
+    conn = get_conn()
+    df = conn.execute(
+        "SELECT date, open, high, low, close FROM prices "
+        "WHERE symbol = ? AND date >= ? ORDER BY date",
+        [symbol, (date.today() - timedelta(days=days)).isoformat()],
+    ).df()
+    conn.close()
+    return df
+@st.cache_data(ttl=3600)
+def _load_sentiment_history(symbol: str, days: int = 60) -> pd.DataFrame:
+    conn = get_conn()
+    df = conn.execute(
+        "SELECT date, sentiment_score, article_count FROM sentiment_daily "
+        "WHERE commodity = ? AND date >= ? ORDER BY date",
+        [symbol, (date.today() - timedelta(days=days)).isoformat()],
+    ).df()
+    conn.close()
+    return df
+@st.cache_data(ttl=3600)
+def _load_cot_history(symbol: str, weeks: int = 104) -> pd.DataFrame:
+    conn = get_conn()
+    df = conn.execute(
+        "SELECT date, commercial_net_pct, mm_net_pct, open_interest "
+        "FROM cot_data WHERE symbol = ? ORDER BY date DESC LIMIT ?",
+        [symbol, weeks],
+    ).df()
+    conn.close()
+    return df.sort_values("date").reset_index(drop=True) if not df.empty else df
+@st.cache_data(ttl=3600)
+def _load_macro_env() -> dict:
+    conn = get_conn()
+    try:
+        row = conn.execute(
+            "SELECT dxy, vix, treasury_10y, fedfunds, financial_stress, copper_basis "
+            "FROM fred_data WHERE dxy IS NOT NULL ORDER BY date DESC LIMIT 1"
+        ).fetchone()
+    except Exception:
+        row = None
+    conn.close()
+    if row:
+        return {"dxy": row[0], "vix": row[1], "t10y": row[2],
+                "fedfunds": row[3], "stress": row[4], "copper_basis": row[5]}
+    return {}
+@st.cache_data(ttl=3600)
+def _load_recent_news(symbol: str, limit: int = 15) -> pd.DataFrame:
+    conn = get_conn()
+    df = conn.execute(
+        "SELECT published_date, title, url, sentiment_score FROM news_raw "
+        "WHERE commodity_tags LIKE ? ORDER BY published_date DESC LIMIT ?",
+        [f"%{symbol}%", limit],
+    ).df()
+    conn.close()
+    return df
+@st.cache_data(ttl=3600)
+def _load_weather(symbol: str) -> dict:
+    from signals.weather_features import get_weather_features
+    return get_weather_features(symbol, days=30)
+@st.cache_data(ttl=3600)
+def _load_eia_history(series: str, weeks: int = 52) -> pd.DataFrame:
+    conn = get_conn()
+    df = conn.execute(
+        "SELECT date, value, chg_1w, vs_5yr_avg FROM eia_inventory "
+        "WHERE series = ? ORDER BY date DESC LIMIT ?",
+        [series, weeks],
+    ).df()
+    conn.close()
+    return df.sort_values("date").reset_index(drop=True) if not df.empty else df
+# ── header ─────────────────────────────────────────────────────────────────────
+def _render_header():
+    now = datetime.now()
+    st.markdown(f"""
+    <div style="display:flex;align-items:center;justify-content:space-between;
+                padding:16px 0 12px 0;border-bottom:1px solid {C['border']};margin-bottom:0;">
+      <div style="display:flex;align-items:center;gap:14px;">
+        <div style="font-size:1.6rem;font-weight:700;letter-spacing:-0.02em;
+                    background:linear-gradient(135deg,{C['text']} 0%,{C['accent']} 100%);
+                    -webkit-background-clip:text;-webkit-text-fill-color:transparent;">
+          ◈ CommodiSense
+        </div>
+        <div style="display:flex;align-items:center;gap:5px;
+                    background:{C['surface']};border:1px solid {C['border']};
+                    border-radius:20px;padding:3px 10px;">
+          <div style="width:6px;height:6px;border-radius:50%;background:{C['up']};
+                      box-shadow:0 0 6px {C['up']};animation:pulse 2s infinite;"></div>
+          <span style="font-size:0.68rem;color:{C['up']};font-weight:600;letter-spacing:0.06em;">LIVE</span>
+        </div>
+      </div>
+      <div style="text-align:right;">
+        <div style="font-size:0.7rem;color:{C['text3']};letter-spacing:0.06em;text-transform:uppercase;">
+          Global Commodity Intelligence
+        </div>
+        <div style="font-size:0.78rem;color:{C['text2']};font-family:'JetBrains Mono',monospace;">
+          {now.strftime('%a %d %b %Y  %H:%M')} UTC
+        </div>
+      </div>
+    </div>
+    <style>
+    @keyframes pulse {{
+      0%,100% {{ opacity:1; }} 50% {{ opacity:0.4; }}
+    }}
+    </style>
+    """, unsafe_allow_html=True)
+# ── ticker strip ───────────────────────────────────────────────────────────────
+def _render_ticker(forecasts: dict, horizon_key: str):
+    fk = "forecast_7d" if horizon_key == "7d" else "forecast_30d"
+    items_html = ""
+    for sym in ALL_SYMBOLS:
+        fc = forecasts.get(sym, {})
+        if "error" in fc or not fc:
+            continue
+        f     = fc.get(fk, {})
+        price = fc.get("current_price", 0)
+        dir_  = f.get("direction", "STABLE")
+        prob  = f.get("probability", 0)
+        icon  = DIR_ICON.get(dir_, "◆")
+        col   = DIR_COLOR.get(dir_, C["stable"])
+        name  = SYMBOL_NAMES.get(sym, sym).upper()
+        items_html += f"""
+        <div class="ticker-item">
+          <span style="color:{C['text3']};font-size:0.65rem;">{sym}</span>
+          <span style="color:{C['text']};font-weight:500;">{name}</span>
+          <span style="color:{C['text2']};">${price:,.2f}</span>
+          <span style="color:{col};font-weight:600;">{icon} {prob:.0%}</span>
+        </div>"""
+    # Double for seamless loop
+    st.markdown(f"""
+    <div class="ticker-wrap">
+      <div class="ticker-inner">{items_html}{items_html}</div>
+    </div>
+    """, unsafe_allow_html=True)
+# ── macro environment bar ──────────────────────────────────────────────────────
+def _render_macro_bar():
+    macro = _load_macro_env()
+    if not macro:
+        return
+    def _change_html(val, neutral=0, invert=False, fmt=".2f", suffix=""):
+        if val is None:
+            return ""
+        diff = val - neutral
+        if invert:
+            diff = -diff
+        col = C["up"] if diff > 0 else (C["down"] if diff < 0 else C["stable"])
+        sign = "+" if diff > 0 else ""
+        return f'<span style="color:{col}">{sign}{diff:{fmt}}{suffix}</span>'
+    vix = macro.get("vix") or 0
+    vix_regime = "HIGH FEAR" if vix > 30 else ("CAUTION" if vix > 20 else "CALM")
+    vix_col = C["down"] if vix > 30 else (C["gold"] if vix > 20 else C["up"])
+    dxy = macro.get("dxy") or 0
+    t10y = macro.get("t10y") or 0
+    ff = macro.get("fedfunds") or 0
+    yield_inv = t10y < ff
+    spread = t10y - ff
+    st.markdown(f"""
+    <div style="display:grid;grid-template-columns:repeat(6,1fr);gap:8px;margin-bottom:20px;">
+      <div class="macro-item">
+        <div class="macro-label">USD Index (DXY)</div>
+        <div class="macro-value">{dxy:.1f}</div>
+        <div class="macro-change" style="color:{C['text3']}">Broad USD Strength</div>
+      </div>
+      <div class="macro-item">
+        <div class="macro-label">VIX Volatility</div>
+        <div class="macro-value" style="color:{vix_col}">{vix:.1f}</div>
+        <div class="macro-change" style="color:{vix_col}">{vix_regime}</div>
+      </div>
+      <div class="macro-item">
+        <div class="macro-label">10Y Treasury</div>
+        <div class="macro-value">{t10y:.2f}%</div>
+        <div class="macro-change" style="color:{C['text3']}">US Yield</div>
+      </div>
+      <div class="macro-item">
+        <div class="macro-label">Fed Funds</div>
+        <div class="macro-value">{ff:.2f}%</div>
+        <div class="macro-change" style="color:{C['text3']}">Policy Rate</div>
+      </div>
+      <div class="macro-item">
+        <div class="macro-label">Yield Spread</div>
+        <div class="macro-value" style="color:{C['down'] if yield_inv else C['up']}">{spread:+.2f}%</div>
+        <div class="macro-change" style="color:{C['down'] if yield_inv else C['text3']}">
+          {'⚠ INVERTED' if yield_inv else 'Normal'}
+        </div>
+      </div>
+      <div class="macro-item">
+        <div class="macro-label">Copper 3M Trend</div>
+        <div class="macro-value" style="color:{C['up'] if (macro.get('copper_basis') or 0) > 0 else C['down']}">
+          {(macro.get('copper_basis') or 0):+.1f}%
+        </div>
+        <div class="macro-change" style="color:{C['text3']}">Industrial Demand</div>
+      </div>
+    </div>
+    """, unsafe_allow_html=True)
+# ── commodity grid ─────────────────────────────────────────────────────────────
+def _render_commodity_grid(forecasts: dict, horizon_key: str, active_sym: str) -> str | None:
+    fk = "forecast_7d" if horizon_key == "7d" else "forecast_30d"
+    st.markdown(f"""
+    <div class="section-header">
+      <div class="section-dot"></div>
+      <div class="section-title">Market Overview — {horizon_key.upper()} Forecast</div>
+    </div>
+    """, unsafe_allow_html=True)
+    clicked = None
+    rows = [ALL_SYMBOLS[i:i+5] for i in range(0, len(ALL_SYMBOLS), 5)]
+    for row_syms in rows:
+        cols = st.columns(len(row_syms))
+        for col, sym in zip(cols, row_syms):
+            fc = forecasts.get(sym, {})
+            f  = fc.get(fk, {}) if fc and "error" not in fc else {}
+            dir_  = f.get("direction", "STABLE")
+            conf  = f.get("confidence", "LOW")
+            prob  = f.get("probability", 0.5)
+            price = fc.get("current_price", 0) if fc else 0
+            name  = SYMBOL_NAMES.get(sym, sym)
+            icon  = DIR_ICON.get(dir_, "◆")
+            dcol  = DIR_COLOR.get(dir_, C["stable"])
+            ddim  = DIR_DICT = DIR_DIM.get(dir_, C["stable_dim"])
+            ccol  = CONF_COLOR.get(conf, C["conf_low"])
+            is_active = sym == active_sym
+            warn  = fc.get("forecast_7d", {}).get("model_warning") if fc else None
+            with col:
+                st.markdown(f"""
+                <div class="comm-card {dir_.lower()} {'active' if is_active else ''}"
+                     style="background:linear-gradient(145deg,{C['surface']} 0%,{ddim} 100%);">
+                  <div style="display:flex;justify-content:space-between;align-items:flex-start;margin-bottom:8px;">
+                    <div>
+                      <div style="font-size:0.65rem;color:{C['text3']};letter-spacing:0.08em;font-family:'JetBrains Mono',monospace;">{sym}</div>
+                      <div style="font-size:0.88rem;font-weight:600;color:{C['text']};margin-top:1px;">{name}</div>
+                    </div>
+                    <div style="background:{ccol}22;border:1px solid {ccol}44;border-radius:4px;
+                                padding:2px 6px;font-size:0.6rem;font-weight:700;color:{ccol};
+                                letter-spacing:0.06em;">{conf}</div>
+                  </div>
+                  <div style="font-size:1.05rem;font-weight:600;color:{C['text']};
+                              font-family:'JetBrains Mono',monospace;margin-bottom:6px;">
+                    ${price:,.2f}
+                  </div>
+                  <div style="display:flex;align-items:center;gap:6px;">
+                    <span style="font-size:1.5rem;color:{dcol};font-weight:700;line-height:1;">{icon}</span>
+                    <div>
+                      <div style="font-size:0.82rem;color:{dcol};font-weight:600;">{dir_}</div>
+                      <div style="font-size:0.65rem;color:{C['text3']};">{prob:.0%} probability</div>
+                    </div>
+                  </div>
+                  {'<div style="margin-top:6px;font-size:0.62rem;color:' + C["gold"] + ';background:' + C["gold"] + '15;border-radius:3px;padding:2px 5px;">⚠ Use 30d model</div>' if warn and horizon_key == "7d" else ''}
+                </div>
+                """, unsafe_allow_html=True)
+                if st.button("Analyze →", key=f"btn_{sym}", use_container_width=True):
+                    clicked = sym
+    return clicked
+# ── deep dive ──────────────────────────────────────────────────────────────────
+def _price_chart(symbol: str, days: int, fc: dict, horizon_key: str):
+    df = _load_price_history(symbol, days)
+    if df.empty:
+        st.info("No price history — run the price collector.")
+        return
+    fk    = "forecast_7d" if horizon_key == "7d" else "forecast_30d"
+    fcast = fc.get(fk, {})
+    dir_  = fcast.get("direction", "STABLE")
+    dcol  = DIR_COLOR.get(dir_, C["stable"])
+    low   = fcast.get("price_range_low")
+    high  = fcast.get("price_range_high")
+    fig = go.Figure()
+    fig.add_trace(go.Candlestick(
+        x=df["date"], open=df["open"], high=df["high"],
+        low=df["low"], close=df["close"], name="Price",
+        increasing=dict(line=dict(color=C["up"], width=1), fillcolor=C["up_dim"]),
+        decreasing=dict(line=dict(color=C["down"], width=1), fillcolor=C["down_dim"]),
+    ))
+    # 20-day SMA
+    df["sma20"] = df["close"].rolling(20, min_periods=1).mean()
+    fig.add_trace(go.Scatter(
+        x=df["date"], y=df["sma20"], mode="lines",
+        line=dict(color=C["accent"], width=1.2, dash="dot"),
+        name="SMA 20", opacity=0.6,
+    ))
+    # Forecast zone
+    if low and high and not df.empty:
+        last_date = pd.to_datetime(df["date"].max())
+        fwd = last_date + timedelta(days=7 if horizon_key == "7d" else 30)
+        fig.add_shape(type="rect",
+            x0=str(last_date.date()), x1=str(fwd.date()),
+            y0=low, y1=high,
+            fillcolor=dcol, opacity=0.10,
+            line=dict(color=dcol, width=1, dash="dot"),
+        )
+        fig.add_annotation(
+            x=str(fwd.date()), y=(low + high) / 2,
+            text=f"  {DIR_ICON.get(dir_,'')} {dir_} {fcast.get('probability',0):.0%}",
+            showarrow=False, font=dict(color=dcol, size=11, family="JetBrains Mono"),
+            bgcolor=C["surface2"], bordercolor=dcol,
+        )
+    fig.update_layout(
+        template="plotly_dark",
+        paper_bgcolor=C["bg"], plot_bgcolor=C["bg"],
+        xaxis_rangeslider_visible=False,
+        height=360,
+        margin=dict(l=0, r=0, t=8, b=0),
+        legend=dict(orientation="h", x=0, y=1.06, font=dict(size=10, color=C["text2"])),
+        xaxis=dict(gridcolor=C["surface2"], showgrid=True),
+        yaxis=dict(gridcolor=C["surface2"], showgrid=True),
+        font=dict(family="Inter", color=C["text2"]),
+    )
+    st.plotly_chart(fig, use_container_width=True)
+def _shap_chart(fc: dict):
+    signals = fc.get("top_signals", [])
+    if not signals:
+        st.caption("No SHAP signals — retrain models to enable.")
+        return
+    labels  = [s.get("label", s.get("feature", ""))[:32] for s in signals]
+    weights = [s["weight"] if s["impact"] == "BULLISH" else -s["weight"] for s in signals]
+    colors  = [C["up"] if w > 0 else C["down"] for w in weights]
+    fig = go.Figure(go.Bar(
+        x=weights, y=labels, orientation="h",
+        marker=dict(color=colors, opacity=0.85),
+        text=[f"{'▲' if w>0 else '▼'} {abs(w):.3f}" for w in weights],
+        textposition="outside", textfont=dict(size=10, family="JetBrains Mono", color=C["text2"]),
+    ))
+    fig.update_layout(
+        template="plotly_dark",
+        paper_bgcolor=C["bg"], plot_bgcolor=C["bg"],
+        title=dict(text="Top Signal Drivers (SHAP)", font=dict(size=11, color=C["text2"])),
+        xaxis=dict(gridcolor=C["surface2"], zeroline=True, zerolinecolor=C["border_hi"],
+                   showticklabels=False),
+        yaxis=dict(gridcolor="transparent"),
+        height=260, margin=dict(l=0, r=40, t=32, b=0),
+        showlegend=False,
+    )
+    st.plotly_chart(fig, use_container_width=True)
+def _cot_chart(symbol: str):
+    df = _load_cot_history(symbol)
+    if df.empty:
+        st.caption("No COT data for this symbol.")
+        return
+    fig = go.Figure()
+    fig.add_trace(go.Scatter(
+        x=df["date"], y=df["commercial_net_pct"] * 100,
+        mode="lines", fill="tozeroy",
+        line=dict(color=C["up"], width=1.5),
+        fillcolor="rgba(0,217,126,0.08)",
+        name="Commercial (Smart $)",
+    ))
+    fig.add_trace(go.Scatter(
+        x=df["date"], y=df["mm_net_pct"] * 100,
+        mode="lines", fill="tozeroy",
+        line=dict(color=C["accent"], width=1.5),
+        fillcolor="rgba(61,127,255,0.08)",
+        name="Managed Money",
+    ))
+    fig.add_hline(y=0, line_dash="dot", line_color=C["border_hi"], line_width=1)
+    fig.update_layout(
+        template="plotly_dark",
+        paper_bgcolor=C["bg"], plot_bgcolor=C["bg"],
+        title=dict(text="COT Positioning — % of Open Interest", font=dict(size=11, color=C["text2"])),
+        xaxis=dict(gridcolor=C["surface2"]),
+        yaxis=dict(gridcolor=C["surface2"], ticksuffix="%"),
+        height=220, margin=dict(l=0, r=0, t=32, b=0),
+        legend=dict(orientation="h", x=0, y=1.12, font=dict(size=10)),
+        font=dict(family="Inter", color=C["text2"]),
+    )
+    st.plotly_chart(fig, use_container_width=True)
+def _sentiment_chart(symbol: str):
+    df = _load_sentiment_history(symbol)
+    if df.empty:
+        st.caption("No sentiment data — run the NLP processor.")
+        return
+    colors = [C["up"] if float(s) > 0.1 else (C["down"] if float(s) < -0.1 else C["stable"])
+              for s in df["sentiment_score"].fillna(0)]
+    fig = go.Figure()
+    fig.add_hrect(y0=0.1, y1=1, fillcolor=C["up_dim"], opacity=1, line_width=0)
+    fig.add_hrect(y0=-1, y1=-0.1, fillcolor=C["down_dim"], opacity=1, line_width=0)
+    fig.add_trace(go.Scatter(
+        x=df["date"], y=df["sentiment_score"],
+        mode="lines+markers",
+        line=dict(color=C["text2"], width=1.5),
+        marker=dict(color=colors, size=5),
+        fill="tozeroy", fillcolor="rgba(139,148,158,0.06)",
+        name="Sentiment",
+    ))
+    fig.add_hline(y=0, line_dash="solid", line_color=C["border_hi"], line_width=1)
+    fig.update_layout(
+        template="plotly_dark",
+        paper_bgcolor=C["bg"], plot_bgcolor=C["bg"],
+        title=dict(text="News Sentiment (60-day)", font=dict(size=11, color=C["text2"])),
+        yaxis=dict(range=[-1, 1], gridcolor=C["surface2"], tickformat=".1f"),
+        xaxis=dict(gridcolor=C["surface2"]),
+        height=200, margin=dict(l=0, r=0, t=32, b=0),
+        showlegend=False,
+    )
+    st.plotly_chart(fig, use_container_width=True)
+def _eia_chart(symbol: str):
+    series = {"CL=F": "crude_stocks", "NG=F": "natgas_storage"}.get(symbol)
+    if not series:
+        return
+    df = _load_eia_history(series)
+    if df.empty:
+        return
+    label = "Crude Oil Stocks (Mbbls)" if symbol == "CL=F" else "Natural Gas Storage (Bcf)"
+    div = 1000 if symbol == "CL=F" else 1
+    fig = go.Figure()
+    fig.add_trace(go.Bar(
+        x=df["date"], y=df["value"] / div,
+        name=label,
+        marker=dict(
+            color=[C["down_dim"] if (v or 0) > 0 else C["up_dim"] for v in df.get("chg_1w", [])],
+            line=dict(width=0),
+        ),
+        opacity=0.8,
+    ))
+    fig.add_trace(go.Scatter(
+        x=df["date"], y=(df["value"] / div).rolling(4).mean(),
+        mode="lines", line=dict(color=C["accent"], width=1.5, dash="dot"),
+        name="4-wk avg",
+    ))
+    fig.update_layout(
+        template="plotly_dark",
+        paper_bgcolor=C["bg"], plot_bgcolor=C["bg"],
+        title=dict(text=label, font=dict(size=11, color=C["text2"])),
+        height=200, margin=dict(l=0, r=0, t=32, b=0),
+        legend=dict(orientation="h", x=0, y=1.15, font=dict(size=10)),
+        xaxis=dict(gridcolor=C["surface2"]),
+        yaxis=dict(gridcolor=C["surface2"]),
+    )
+    st.plotly_chart(fig, use_container_width=True)
+def _render_deep_dive(symbol: str, days: int, horizon_key: str):
+    fc   = _load_forecast(symbol)
+    name = SYMBOL_NAMES.get(symbol, symbol)
+    if "error" in fc:
+        st.warning(f"No forecast for {name} — run `python model/trainer.py --symbol {symbol}`")
+        return
+    fk     = "forecast_7d" if horizon_key == "7d" else "forecast_30d"
+    fcast  = fc.get(fk, {})
+    dir_   = fcast.get("direction", "STABLE")
+    prob   = fcast.get("probability", 0.5)
+    conf   = fcast.get("confidence", "LOW")
+    price  = fc.get("current_price", 0)
+    dcol   = DIR_COLOR.get(dir_, C["stable"])
+    ddim   = DIR_DIM.get(dir_, C["stable_dim"])
+    icon   = DIR_ICON.get(dir_, "◆")
+    ccol   = CONF_COLOR.get(conf, C["conf_low"])
+    warn   = fcast.get("model_warning")
+    # Breadcrumb + headline
+    st.markdown(f"""
+    <div style="display:flex;align-items:center;gap:8px;margin-bottom:16px;">
+      <div style="font-size:0.7rem;color:{C['text3']};letter-spacing:0.08em;">ANALYSIS</div>
+      <div style="font-size:0.7rem;color:{C['text3']};">›</div>
+      <div style="font-size:0.85rem;font-weight:600;color:{C['text']};">{name}</div>
+      <div style="font-size:0.65rem;color:{C['text3']};font-family:'JetBrains Mono',monospace;">{symbol}</div>
+    </div>
+    <div style="display:flex;align-items:center;gap:16px;padding:18px 20px;
+                background:linear-gradient(135deg,{C['surface']} 0%,{ddim} 100%);
+                border:1px solid {dcol}44;border-radius:12px;margin-bottom:16px;">
+      <div style="font-size:3rem;color:{dcol};line-height:1;">{icon}</div>
+      <div>
+        <div style="font-size:1.9rem;font-weight:700;color:{C['text']};font-family:'JetBrains Mono',monospace;">
+          ${price:,.2f}
+        </div>
+        <div style="display:flex;align-items:center;gap:8px;margin-top:4px;">
+          <span style="font-size:1.1rem;font-weight:700;color:{dcol};">{dir_}</span>
+          <span style="background:{ccol}22;border:1px solid {ccol}55;color:{ccol};
+                       font-size:0.72rem;font-weight:700;padding:3px 8px;border-radius:20px;">
+            {conf} CONF
+          </span>
+          <span style="font-size:0.85rem;color:{C['text2']};">{prob:.1%} probability · {horizon_key.upper()}</span>
+        </div>
+        {f'<div style="margin-top:6px;font-size:0.72rem;color:{C["gold"]};background:{C["gold"]}18;padding:4px 10px;border-radius:6px;display:inline-block;">⚠ {warn}</div>' if warn else ''}
+      </div>
+      {f'''<div style="margin-left:auto;text-align:right;">
+        <div style="font-size:0.65rem;color:{C["text3"]};text-transform:uppercase;letter-spacing:0.1em;">Price Target Range</div>
+        <div style="font-size:1.1rem;font-weight:600;color:{C["text"]};font-family:'JetBrains Mono',monospace;">
+          ${fcast.get("price_range_low",0):,.0f} – ${fcast.get("price_range_high",0):,.0f}
+        </div>
+      </div>''' if fcast.get("price_range_low") else ''}
+    </div>
+    """, unsafe_allow_html=True)
+    # Main layout: chart | signals
+    chart_col, signal_col = st.columns([3, 2])
+    with chart_col:
+        st.markdown(f'<div class="section-header"><div class="section-dot"></div><div class="section-title">Price Chart</div></div>', unsafe_allow_html=True)
+        _price_chart(symbol, days, fc, horizon_key)
+    with signal_col:
+        st.markdown(f'<div class="section-header"><div class="section-dot"></div><div class="section-title">Signal Drivers</div></div>', unsafe_allow_html=True)
+        _shap_chart(fc)
+        # Both 7d and 30d forecast side by side
+        f7  = fc.get("forecast_7d", {})
+        f30 = fc.get("forecast_30d", {})
+        st.markdown(f"""
+        <div style="display:grid;grid-template-columns:1fr 1fr;gap:8px;margin-top:8px;">
+          <div style="background:{C['surface2']};border:1px solid {C['border']};border-radius:8px;padding:10px;text-align:center;">
+            <div style="font-size:0.6rem;color:{C['text3']};letter-spacing:0.1em;text-transform:uppercase;margin-bottom:4px;">7-Day</div>
+            <div style="font-size:1.1rem;font-weight:700;color:{DIR_COLOR.get(f7.get('direction','STABLE'),C['stable'])};">
+              {DIR_ICON.get(f7.get('direction','STABLE'),'◆')} {f7.get('direction','—')}
+            </div>
+            <div style="font-size:0.7rem;color:{C['text3']};">{f7.get('probability',0):.0%}</div>
+          </div>
+          <div style="background:{C['surface2']};border:1px solid {C['border']};border-radius:8px;padding:10px;text-align:center;">
+            <div style="font-size:0.6rem;color:{C['text3']};letter-spacing:0.1em;text-transform:uppercase;margin-bottom:4px;">30-Day</div>
+            <div style="font-size:1.1rem;font-weight:700;color:{DIR_COLOR.get(f30.get('direction','STABLE'),C['stable'])};">
+              {DIR_ICON.get(f30.get('direction','STABLE'),'◆')} {f30.get('direction','—')}
+            </div>
+            <div style="font-size:0.7rem;color:{C['text3']};">{f30.get('probability',0):.0%}</div>
+          </div>
+        </div>
+        """, unsafe_allow_html=True)
+    # Tabbed data panels
+    tab_labels = ["COT Positioning", "Sentiment", "EIA Inventory", "Weather", "AI Report"]
+    tabs = st.tabs(tab_labels)
+    with tabs[0]:
+        _cot_chart(symbol)
+    with tabs[1]:
+        _sentiment_chart(symbol)
+    with tabs[2]:
+        _eia_chart(symbol)
+        if symbol not in ("CL=F", "NG=F"):
+            st.caption("EIA inventory data is available for Crude Oil (CL=F) and Natural Gas (NG=F) only.")
+    with tabs[3]:
+        weather = _load_weather(symbol)
+        if weather and weather.get("drought_index", 0) > 0:
+            w1, w2, w3 = st.columns(3)
+            w1.metric("Drought Index", f"{weather['drought_index']:.2f}", help="0=normal, 1=extreme drought")
+            w2.metric("Heat Stress Days", weather["heat_stress_days"])
+            w3.metric("Precip Anomaly", f"{weather['precip_anomaly_pct']:+.1f}%")
+        else:
+            st.caption("No weather data available. Weather signals apply to agricultural commodities.")
+    with tabs[4]:
+        reports = load_latest_reports()
+        report_text = reports.get(symbol, "")
+        if not report_text:
+            with st.spinner("Generating AI analysis..."):
+                report_text = generate_report(fc)
+        if report_text:
+            st.markdown(f'<div class="ai-report">🤖&nbsp; <strong>AI Analyst</strong><br><br>{report_text}</div>', unsafe_allow_html=True)
+        else:
+            st.caption("AI report unavailable — set GROQ_API_KEY in your .env file.")
+# ── news feed ──────────────────────────────────────────────────────────────────
+def _render_news(symbol: str):
+    st.markdown(f"""
+    <div class="section-header" style="margin-top:8px;">
+      <div class="section-dot"></div>
+      <div class="section-title">Recent News — {SYMBOL_NAMES.get(symbol, symbol)}</div>
+    </div>
+    """, unsafe_allow_html=True)
+    df = _load_recent_news(symbol)
+    if df.empty:
+        st.caption("No news data — run the news collector.")
+        return
+    for _, row in df.iterrows():
+        score = float(row.get("sentiment_score") or 0)
+        scol  = C["up"] if score > 0.1 else (C["down"] if score < -0.1 else C["stable"])
+        sign  = "+" if score > 0 else ""
+        title = str(row.get("title", ""))[:120]
+        url   = str(row.get("url", "#"))
+        pub   = str(row.get("published_date", ""))[:10]
+        st.markdown(f"""
+        <div class="news-row">
+          <div style="min-width:80px;font-size:0.68rem;color:{C['text3']};
+                      font-family:'JetBrains Mono',monospace;padding-top:1px;">{pub}</div>
+          <div style="min-width:42px;text-align:center;">
+            <span style="background:{scol}22;color:{scol};border-radius:4px;
+                         padding:2px 6px;font-size:0.68rem;font-weight:600;
+                         font-family:'JetBrains Mono',monospace;">{sign}{score:.2f}</span>
+          </div>
+          <div style="flex:1;font-size:0.84rem;color:{C['text']};">
+            <a href="{url}" target="_blank"
+               style="color:{C['text']};text-decoration:none;"
+               onmouseover="this.style.color='{C['accent']}'"
+               onmouseout="this.style.color='{C['text']}'">{title}</a>
+          </div>
+        </div>
+        """, unsafe_allow_html=True)
+# ── sidebar controls ───────────────────────────────────────────────────────────
+def _render_sidebar() -> tuple[str, int]:
+    with st.sidebar:
+        st.markdown(f"""
+        <div style="padding:12px 0 16px 0;border-bottom:1px solid {C['border']};margin-bottom:16px;">
+          <div style="font-size:1.1rem;font-weight:700;
+                      background:linear-gradient(135deg,{C['text']} 0%,{C['accent']} 100%);
+                      -webkit-background-clip:text;-webkit-text-fill-color:transparent;">
+            ◈ CommodiSense
+          </div>
+          <div style="font-size:0.65rem;color:{C['text3']};margin-top:3px;letter-spacing:0.06em;">
+            COMMODITY INTELLIGENCE
+          </div>
+        </div>
+        """, unsafe_allow_html=True)
+        horizon = st.radio("Forecast Horizon", ["7d", "30d"], index=0,
+                           format_func=lambda x: "7-Day" if x == "7d" else "30-Day")
+        days = st.slider("Chart History", 30, 365, 90, step=15,
+                         format="%d days")
+        st.markdown("---")
+        if st.button("↺ Refresh Data", use_container_width=True):
+            st.cache_data.clear()
+            st.rerun()
+        st.markdown(f"""
+        <div style="margin-top:16px;">
+          <div style="font-size:0.65rem;color:{C['text3']};letter-spacing:0.08em;
+                      text-transform:uppercase;margin-bottom:10px;">Data Sources</div>
+        """, unsafe_allow_html=True)
+        sources = [
+            ("Prices", "yfinance", "12,613 rows"),
+            ("COT", "CFTC", "8,826 rows"),
+            ("Macro", "FRED", "7,193 rows"),
+            ("EIA", "DOE", "3,134 rows"),
+            ("USDA", "NASS", "1,104 rows"),
+            ("News", "GDELT", "392 articles"),
+            ("Weather", "Open-Meteo", "210 rows"),
+        ]
+        for name, src, count in sources:
+            st.markdown(f"""
+            <div style="display:flex;justify-content:space-between;align-items:center;
+                        padding:5px 0;border-bottom:1px solid {C['border']};">
+              <div style="font-size:0.72rem;color:{C['text2']};font-weight:500;">{name}</div>
+              <div style="text-align:right;">
+                <div style="font-size:0.62rem;color:{C['text3']};font-family:'JetBrains Mono',monospace;">{count}</div>
+              </div>
+            </div>
+            """, unsafe_allow_html=True)
+        st.markdown("</div>", unsafe_allow_html=True)
+        st.markdown(f"""
+        <div style="margin-top:20px;padding:10px;background:{C['surface']};
+                    border:1px solid {C['border']};border-radius:8px;font-size:0.65rem;
+                    color:{C['text3']};line-height:1.6;">
+          <div style="color:{C['text2']};font-weight:600;margin-bottom:4px;">Pipeline</div>
+          GitHub Actions · Mon–Fri 06:00 UTC<br>
+          XGBoost + LightGBM ensemble<br>
+          SHAP explainability · FinBERT NLP
+        </div>
+        """, unsafe_allow_html=True)
+    return horizon, days
+# ── main ───────────────────────────────────────────────────────────────────────
+def main():
+    _ensure_schema()
+    _inject_css()
+    _render_header()
+    horizon, days = _render_sidebar()
+    # Load all forecasts at once
+    forecasts = _load_all_forecasts(tuple(ALL_SYMBOLS))
+    # Ticker strip
+    _render_ticker(forecasts, horizon)
+    # Macro environment
+    _render_macro_bar()
+    # Commodity grid — track active symbol in session state
+    clicked = _render_commodity_grid(forecasts, horizon,
+                                     st.session_state.get("active_sym", ALL_SYMBOLS[0]))
+    if clicked:
+        st.session_state["active_sym"] = clicked
+    active = st.session_state.get("active_sym")
+    if not active:
+        active = ALL_SYMBOLS[0]
+        st.session_state["active_sym"] = active
+    # Divider
+    st.markdown(f'<div style="height:1px;background:linear-gradient(90deg,transparent,{C["border_hi"]},transparent);margin:20px 0;"></div>', unsafe_allow_html=True)
+    # Deep dive
+    _render_deep_dive(active, days, horizon)
+    # News
+    st.markdown(f'<div style="height:1px;background:linear-gradient(90deg,transparent,{C["border_hi"]},transparent);margin:20px 0 16px;"></div>', unsafe_allow_html=True)
+    _render_news(active)
+if __name__ == "__main__":
+    main()

model/__init__.py ADDED Viewed

File without changes

model/explainer.py ADDED Viewed

	@@ -0,0 +1,266 @@

+"""
+Explainer — generates plain-English 3-sentence forecast reports using
+Groq API (llama-3.3-70b-versatile, free tier: 14,400 req/day).
+Falls back to a deterministic template if Groq is unavailable.
+Reports are cached to data/reports/report_{date}.json.
+Usage:
+    python model/explainer.py --symbol ZW=F
+    python model/explainer.py --all
+"""
+import json
+import logging
+import os
+import sys
+from datetime import date
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from model.predictor import predict, predict_all, SYMBOL_NAMES
+log = logging.getLogger(__name__)
+REPORTS_DIR = Path(__file__).parent.parent / "data" / "reports"
+REPORTS_DIR.mkdir(parents=True, exist_ok=True)
+GROQ_MODEL = "llama-3.3-70b-versatile"
+# ── Groq client (lazy) ─────────────────────────────────────────────────────────
+_groq_client = None
+def _get_groq_client():
+    global _groq_client
+    if _groq_client is not None:
+        return _groq_client
+    api_key = os.getenv("GROQ_API_KEY")
+    if not api_key:
+        log.warning("GROQ_API_KEY not set — using template fallback")
+        return None
+    try:
+        from groq import Groq
+        _groq_client = Groq(api_key=api_key)
+        return _groq_client
+    except ImportError:
+        log.warning("groq package not installed — using template fallback")
+        return None
+# ── helpers ────────────────────────────────────────────────────────────────────
+def _format_signals(signals: list[dict]) -> str:
+    """Format top signals as numbered list for the LLM prompt."""
+    lines = []
+    for i, sig in enumerate(signals[:5], 1):
+        label = sig.get("label", sig.get("feature", "unknown"))
+        value = sig.get("value", 0)
+        impact = sig.get("impact", "NEUTRAL")
+        weight = sig.get("weight", 0)
+        lines.append(f"  {i}. {label}: {value:.3g} | Impact: {impact} | Weight: {weight:.3f}")
+    return "\n".join(lines) if lines else "  (no signal data available)"
+def _pick_risk_factor(prediction: dict) -> str:
+    """Return the top bearish signal as the risk factor for the report."""
+    signals = prediction.get("top_signals", [])
+    bearish = [s for s in signals if s.get("impact") == "BEARISH"]
+    if bearish:
+        return bearish[0].get("label", "adverse signal reversal")
+    # Generic risks per commodity type
+    symbol = prediction.get("symbol", "")
+    risk_map = {
+        "CL=F":    "unexpected OPEC output increase",
+        "NG=F":    "warmer-than-expected winter forecast",
+        "GC=F":    "stronger-than-expected US jobs data",
+        "ZW=F":    "favourable Black Sea weather reducing supply fears",
+        "ZC=F":    "USDA upward crop estimate revision",
+        "ZS=F":    "Brazil harvest beating expectations",
+        "CT=F":    "recovery in monsoon rainfall",
+        "SB=F":    "Brazil supply-side recovery",
+        "USDINR=X":"RBI unexpected rate cut",
+        "HG=F":    "China demand slowdown data",
+    }
+    return risk_map.get(symbol, "unexpected policy reversal")
+def _template_report(prediction: dict) -> str:
+    """
+    Deterministic template-based report. Used when Groq is unavailable.
+    No LLM needed — readable and fast.
+    """
+    name     = prediction.get("commodity_name", prediction.get("symbol", "Commodity"))
+    fc7      = prediction.get("forecast_7d", {})
+    fc30     = prediction.get("forecast_30d", {})
+    direction= fc7.get("direction", "STABLE")
+    prob     = fc7.get("probability", 0.5)
+    conf     = fc7.get("confidence", "LOW")
+    dir30    = fc30.get("direction", "STABLE")
+    prob30   = fc30.get("probability", 0.5)
+    signals  = prediction.get("top_signals", [])
+    sig1 = signals[0] if len(signals) > 0 else {}
+    sig2 = signals[1] if len(signals) > 1 else {}
+    s1_label = sig1.get("label", "market momentum")
+    s1_val   = sig1.get("value", 0)
+    s2_label = sig2.get("label", "news sentiment")
+    s2_val   = sig2.get("value", 0)
+    risk     = _pick_risk_factor(prediction)
+    dir_phrase = {
+        "UP":     "rise",
+        "DOWN":   "fall",
+        "STABLE": "remain stable",
+    }.get(direction, "remain stable")
+    return (
+        f"{name} is forecast to {dir_phrase} over the next 7 days "
+        f"({prob:.0%} confidence, {conf}); 30-day view is {dir30} ({prob30:.0%}). "
+        f"Primary drivers are {s1_label} at {s1_val:.3g} and "
+        f"{s2_label} at {s2_val:.3g}. "
+        f"Key risk: {risk} could invalidate this forecast."
+    )
+def _groq_report(prediction: dict) -> str:
+    """Call Groq API to generate a 3-sentence analyst report."""
+    client = _get_groq_client()
+    if client is None:
+        return _template_report(prediction)
+    name    = prediction.get("commodity_name", prediction.get("symbol"))
+    price   = prediction.get("current_price", 0)
+    fc7     = prediction.get("forecast_7d", {})
+    fc30    = prediction.get("forecast_30d", {})
+    signals = prediction.get("top_signals", [])
+    prompt = f"""You are a commodity market analyst. Based on the following data signals, write a 3-sentence forecast report. Be specific. Cite the signals. Use numbers.
+Commodity: {name}
+Current price: {price}
+7-day forecast: {fc7.get('direction')} with {fc7.get('probability', 0):.0%} confidence ({fc7.get('confidence')} tier)
+30-day forecast: {fc30.get('direction')} with {fc30.get('probability', 0):.0%} confidence
+Top 5 driving signals:
+{_format_signals(signals)}
+Rules:
+- Sentence 1: State the forecast and confidence level.
+- Sentence 2: Name the top 2 signals and their specific values.
+- Sentence 3: Name one risk factor that could invalidate this forecast.
+- Write in plain English. No jargon. Max 80 words total.
+- Do not use phrases like "based on the data" or "analysis suggests".
+- Start directly: "{name} is forecast to..."
+"""
+    try:
+        response = client.chat.completions.create(
+            model=GROQ_MODEL,
+            messages=[{"role": "user", "content": prompt}],
+            max_tokens=150,
+            temperature=0.3,
+        )
+        text = response.choices[0].message.content.strip()
+        # Sanity: if response is empty or too short, fall back to template
+        if len(text) < 30:
+            return _template_report(prediction)
+        return text
+    except Exception as exc:
+        log.warning("Groq API error: %s — using template fallback", exc)
+        return _template_report(prediction)
+# ── public API ─────────────────────────────────────────────────────────────────
+def generate_report(prediction: dict) -> str:
+    """
+    Generate a plain-English 3-sentence forecast report for a commodity.
+    Uses Groq if GROQ_API_KEY is set, otherwise falls back to template.
+    Args:
+        prediction: Dict returned by predictor.predict()
+    Returns:
+        3-sentence report string.
+    """
+    if "error" in prediction:
+        return f"{prediction.get('symbol', 'Commodity')}: forecast unavailable ({prediction['error']})."
+    return _groq_report(prediction)
+def generate_all_reports(as_of_date: str = None) -> dict[str, str]:
+    """
+    Generate reports for all 10 commodities.
+    Calls predict() + generate_report() for each.
+    Caches results to data/reports/report_{date}.json.
+    Args:
+        as_of_date: ISO date string. Defaults to today.
+    Returns:
+        Dict mapping symbol → report string.
+    """
+    today = as_of_date or date.today().isoformat()
+    cache_path = REPORTS_DIR / f"report_{today}.json"
+    # Return cached reports if already generated today
+    if cache_path.exists():
+        log.info("Loading cached reports from %s", cache_path)
+        with open(cache_path) as f:
+            return json.load(f)
+    forecasts = predict_all(as_of_date)
+    reports: dict[str, str] = {}
+    for symbol, fc in forecasts.items():
+        report = generate_report(fc)
+        reports[symbol] = report
+        name = SYMBOL_NAMES.get(symbol, symbol)
+        log.info("%s: report generated", name)
+    # Cache to disk
+    with open(cache_path, "w") as f:
+        json.dump(reports, f, indent=2)
+    log.info("Reports saved to %s", cache_path)
+    return reports
+def load_latest_reports() -> dict[str, str]:
+    """
+    Return the most recently generated report file, or empty dict if none.
+    Used by the dashboard to display reports without regenerating.
+    """
+    report_files = sorted(REPORTS_DIR.glob("report_*.json"), reverse=True)
+    if not report_files:
+        return {}
+    with open(report_files[0]) as f:
+        return json.load(f)
+if __name__ == "__main__":
+    import argparse
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+    parser = argparse.ArgumentParser(description="CommodiSense explainer")
+    parser.add_argument("--symbol", default=None)
+    parser.add_argument("--all",    action="store_true")
+    parser.add_argument("--date",   default=None)
+    args = parser.parse_args()
+    if args.all:
+        reports = generate_all_reports(args.date)
+        for sym, report in reports.items():
+            print(f"\n[{sym}]\n{report}")
+    elif args.symbol:
+        fc = predict(args.symbol, args.date)
+        report = generate_report(fc)
+        print(report)
+    else:
+        parser.print_help()

model/feature_builder.py ADDED Viewed

	@@ -0,0 +1,374 @@

+"""
+Feature Builder — assembles all signals (price, sentiment, events, weather,
+geopolitical) into a single feature matrix per commodity.
+CRITICAL: zero lookahead. All signal windows use T-1 to T-N only.
+Target variable uses T+7 and T+30 prices (shifted forward, excluded from features).
+Usage:
+    from model.feature_builder import build_training_data, build_prediction_features
+"""
+import logging
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+import numpy as np
+import pandas as pd
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from data.db import get_conn
+from signals.price_features import build_feature_matrix, ALL_SYMBOLS
+from signals.weather_features import get_weather_dataframe
+from signals.macro_features import build_macro_dataframe, get_macro_features
+log = logging.getLogger(__name__)
+# Per-commodity direction thresholds — calibrated to each asset's typical volatility.
+# USDINR is a managed float (rarely moves ±2% in 7 days → extreme STABLE imbalance).
+# NG=F is highly volatile → needs wider threshold to avoid noise.
+DIRECTION_THRESHOLDS: dict[str, float] = {
+    "CL=F":     2.0,
+    "NG=F":     3.5,
+    "GC=F":     1.5,
+    "ZW=F":     2.0,
+    "ZC=F":     2.0,
+    "ZS=F":     2.0,
+    "CT=F":     2.0,
+    "SB=F":     2.0,
+    "USDINR=X": 0.4,
+    "HG=F":     2.0,
+}
+DIRECTION_THRESHOLD_PCT = 2.0  # fallback
+# ── helpers ────────────────────────────────────────────────────────────────────
+def _load_prices_for_target(symbol: str) -> pd.DataFrame:
+    """Load close prices with enough future rows to compute T+7 and T+30 targets."""
+    conn = get_conn()
+    df = conn.execute(
+        "SELECT date, close FROM prices WHERE symbol = ? ORDER BY date",
+        [symbol],
+    ).df()
+    conn.close()
+    df["date"] = pd.to_datetime(df["date"]).dt.date
+    return df.sort_values("date").reset_index(drop=True)
+def _compute_targets(price_df: pd.DataFrame, symbol: str = None) -> pd.DataFrame:
+    """
+    Compute direction_7d and direction_30d target columns.
+    Labels:
+         1  (UP)     if future price > current * 1.02
+         0  (STABLE) if within ±2%
+        -1  (DOWN)   if future price < current * 0.98
+    """
+    df = price_df.copy().sort_values("date").reset_index(drop=True)
+    closes = df["close"].values
+    threshold = DIRECTION_THRESHOLDS.get(symbol, DIRECTION_THRESHOLD_PCT) if symbol else DIRECTION_THRESHOLD_PCT
+    def _direction(current: float, future: float) -> int:
+        if future == 0 or current == 0:
+            return 0
+        chg = (future - current) / current * 100
+        if chg > threshold:
+            return 1
+        if chg < -threshold:
+            return -1
+        return 0
+    dir_7d, dir_30d = [], []
+    n = len(closes)
+    for i in range(n):
+        # Find the index approximately 7 / 30 trading days forward
+        # Use calendar-day shifted date to find the nearest actual price row
+        fwd7  = df[df["date"] >= (df.at[i, "date"] + timedelta(days=7))].head(1)
+        fwd30 = df[df["date"] >= (df.at[i, "date"] + timedelta(days=30))].head(1)
+        dir_7d.append(
+            _direction(closes[i], float(fwd7["close"].values[0])) if not fwd7.empty else None
+        )
+        dir_30d.append(
+            _direction(closes[i], float(fwd30["close"].values[0])) if not fwd30.empty else None
+        )
+    df["direction_7d"]  = dir_7d
+    df["direction_30d"] = dir_30d
+    return df
+def _load_sentiment_series(symbol: str) -> pd.DataFrame:
+    """Load daily sentiment aggregates for a commodity from DuckDB."""
+    conn = get_conn()
+    df = conn.execute(
+        """
+        SELECT date, sentiment_score, article_count, positive_count
+        FROM sentiment_daily
+        WHERE commodity = ?
+        ORDER BY date
+        """,
+        [symbol],
+    ).df()
+    conn.close()
+    if df.empty:
+        return df
+    df["date"] = pd.to_datetime(df["date"]).dt.date
+    df = df.sort_values("date").reset_index(drop=True)
+    # Rolling aggregates
+    df["sentiment_3d"]       = df["sentiment_score"].rolling(3, min_periods=1).mean()
+    df["sentiment_7d"]       = df["sentiment_score"].rolling(7, min_periods=1).mean()
+    df["article_count_7d"]   = df["article_count"].rolling(7, min_periods=1).sum()
+    df["positive_ratio_7d"]  = (
+        df["positive_count"].rolling(7, min_periods=1).sum()
+        / df["article_count_7d"].replace(0, 1)
+    )
+    return df.rename(columns={"sentiment_score": "sentiment_score_1d"})
+def _load_event_series(symbol: str) -> pd.DataFrame:
+    """Load daily event aggregates for a commodity from DuckDB."""
+    conn = get_conn()
+    df = conn.execute(
+        """
+        SELECT date, event_type, direction, severity
+        FROM extracted_events
+        WHERE commodity = ?
+        ORDER BY date
+        """,
+        [symbol],
+    ).df()
+    conn.close()
+    if df.empty:
+        return pd.DataFrame()
+    df["date"] = pd.to_datetime(df["date"]).dt.date
+    df["dir_score"] = df["direction"].map({"BULLISH": 1, "BEARISH": -1, "NEUTRAL": 0}).fillna(0)
+    agg = df.groupby("date").agg(
+        bullish_events_7d  =("direction", lambda x: int((x == "BULLISH").sum())),
+        bearish_events_7d  =("direction", lambda x: int((x == "BEARISH").sum())),
+        max_severity_7d    =("severity",  "max"),
+        direction_score_7d =("dir_score", "sum"),
+        supply_shock_flag  =("event_type", lambda x: int((x == "SUPPLY_SHOCK").any())),
+        policy_change_flag =("event_type", lambda x: int((x == "POLICY_CHANGE").any())),
+    ).reset_index()
+    # Rolling 7-day window for event counts
+    agg = agg.sort_values("date").reset_index(drop=True)
+    for col in ["bullish_events_7d", "bearish_events_7d", "direction_score_7d"]:
+        agg[col] = agg[col].rolling(7, min_periods=1).sum()
+    return agg
+def _load_geo_series(symbol: str) -> pd.DataFrame:
+    """Load rolling geopolitical risk scores for a commodity."""
+    conn = get_conn()
+    df = conn.execute(
+        "SELECT date, risk_score FROM geopolitical_events WHERE commodity = ? ORDER BY date",
+        [symbol],
+    ).df()
+    conn.close()
+    if df.empty:
+        return pd.DataFrame()
+    df["date"] = pd.to_datetime(df["date"]).dt.date
+    agg = df.groupby("date")["risk_score"].mean().reset_index()
+    agg = agg.sort_values("date").reset_index(drop=True)
+    agg["risk_score_7d"]  = agg["risk_score"].rolling(7,  min_periods=1).mean()
+    agg["risk_score_30d"] = agg["risk_score"].rolling(30, min_periods=1).mean()
+    return agg[["date", "risk_score_7d", "risk_score_30d"]]
+def _safe_merge(base: pd.DataFrame, other: pd.DataFrame, on: str = "date") -> pd.DataFrame:
+    """Left-join `other` onto `base`, filling NaN with 0."""
+    if other.empty:
+        return base
+    merged = base.merge(other, on=on, how="left")
+    merged = merged.fillna(0)
+    return merged
+# ── public API ─────────────────────────────────────────────────────────────────
+def build_training_data(
+    symbol: str,
+) -> tuple[pd.DataFrame, pd.Series, pd.Series]:
+    """
+    Assemble the full feature matrix + targets for a commodity.
+    Uses all available history in DuckDB. No lookahead: signal features
+    reflect data known at close of each trading day.
+    Args:
+        symbol: Commodity ticker, e.g. "ZW=F"
+    Returns:
+        (X, y_7d, y_30d) where:
+          X      — DataFrame, one row per date, all feature columns
+          y_7d   — Series of direction labels {-1, 0, 1} for 7-day horizon
+          y_30d  — Series of direction labels {-1, 0, 1} for 30-day horizon
+    """
+    # Price features (covers ~5yr history)
+    end_date   = date.today().isoformat()
+    start_date = (date.today() - timedelta(days=365 * 5)).isoformat()
+    price_feat = build_feature_matrix(symbol, start_date, end_date)
+    if price_feat.empty:
+        log.warning("%s: no price features available", symbol)
+        return pd.DataFrame(), pd.Series(dtype=int), pd.Series(dtype=int)
+    # Targets — computed from raw close prices with per-commodity threshold
+    prices = _load_prices_for_target(symbol)
+    targets = _compute_targets(prices, symbol=symbol)[["date", "direction_7d", "direction_30d"]]
+    # All signal series
+    sentiment = _load_sentiment_series(symbol)
+    events    = _load_event_series(symbol)
+    geo       = _load_geo_series(symbol)
+    weather   = get_weather_dataframe(symbol, days=365 * 5)
+    if not weather.empty:
+        weather["date"] = pd.to_datetime(weather["date"]).dt.date
+    macro = build_macro_dataframe(symbol, start_date, end_date)
+    if not macro.empty:
+        macro["date"] = pd.to_datetime(macro["date"]).dt.date
+    # Merge everything onto price_feat (left join → zero-fill missing signal days)
+    df = price_feat.copy()
+    df = _safe_merge(df, targets,   on="date")
+    df = _safe_merge(df, sentiment[["date", "sentiment_score_1d", "sentiment_3d",
+                                     "sentiment_7d", "article_count_7d",
+                                     "positive_ratio_7d"]] if not sentiment.empty else pd.DataFrame(),
+                    on="date")
+    df = _safe_merge(df, events,  on="date")
+    df = _safe_merge(df, geo,     on="date")
+    df = _safe_merge(df, weather, on="date")
+    df = _safe_merge(df, macro,   on="date")
+    # Add binary indicator: 1 on days where we have real news signal, 0 elsewhere.
+    # This lets the model learn "trust sentiment when has_news_signal=1" rather than
+    # treating zero-padded sentiment rows as neutral-sentiment days.
+    if "sentiment_score_1d" in df.columns:
+        df["has_news_signal"] = (df["sentiment_score_1d"].abs() > 0.01).astype(int)
+    else:
+        df["has_news_signal"] = 0
+    # Drop rows where targets are unavailable (last 30 days have no T+30 target)
+    df = df.dropna(subset=["direction_7d", "direction_30d"])
+    df = df.sort_values("date").reset_index(drop=True)
+    feature_cols = [c for c in df.columns if c not in
+                    ("date", "direction_7d", "direction_30d")]
+    X     = df[feature_cols].fillna(0).astype(float)
+    y_7d  = df["direction_7d"].astype(int)
+    y_30d = df["direction_30d"].astype(int)
+    log.info("%s: training data shape %s, class dist 7d: %s",
+             symbol, X.shape, y_7d.value_counts().to_dict())
+    return X, y_7d, y_30d
+def build_prediction_features(symbol: str, as_of_date: str = None) -> pd.Series:
+    """
+    Build a single-row feature vector for inference.
+    Uses only data available up to (and including) as_of_date.
+    No future data touches this vector.
+    Args:
+        symbol:      Commodity ticker
+        as_of_date:  ISO date string. Defaults to today.
+    Returns:
+        pd.Series with the same feature names as build_training_data returns.
+    """
+    from signals.price_features import get_price_features
+    from signals.weather_features import get_weather_features
+    target_date = as_of_date or date.today().isoformat()
+    # Price features (T-1 based internally)
+    price_f = get_price_features(symbol, target_date)
+    # Sentiment: last 7 days before target_date
+    cutoff = (datetime.strptime(target_date, "%Y-%m-%d").date() - timedelta(days=7)).isoformat()
+    conn = get_conn()
+    sent_rows = conn.execute(
+        """
+        SELECT date, sentiment_score, article_count, positive_count
+        FROM sentiment_daily
+        WHERE commodity = ? AND date >= ? AND date <= ?
+        ORDER BY date DESC
+        """,
+        [symbol, cutoff, target_date],
+    ).df()
+    conn.close()
+    sentiment_1d = float(sent_rows.iloc[0]["sentiment_score"]) if not sent_rows.empty else 0.0
+    sentiment_3d = float(sent_rows.head(3)["sentiment_score"].mean()) if len(sent_rows) >= 1 else 0.0
+    sentiment_7d = float(sent_rows["sentiment_score"].mean()) if not sent_rows.empty else 0.0
+    article_count_7d = int(sent_rows["article_count"].sum()) if not sent_rows.empty else 0
+    positive_ratio_7d = (
+        float(sent_rows["positive_count"].sum() / max(article_count_7d, 1))
+        if not sent_rows.empty else 0.0
+    )
+    # Events: last 7 days
+    conn = get_conn()
+    evt_rows = conn.execute(
+        """
+        SELECT event_type, direction, severity
+        FROM extracted_events
+        WHERE commodity = ? AND date >= ? AND date <= ?
+        """,
+        [symbol, cutoff, target_date],
+    ).df()
+    conn.close()
+    bullish_events_7d  = int((evt_rows["direction"] == "BULLISH").sum()) if not evt_rows.empty else 0
+    bearish_events_7d  = int((evt_rows["direction"] == "BEARISH").sum()) if not evt_rows.empty else 0
+    max_severity_7d    = int(evt_rows["severity"].max()) if not evt_rows.empty else 0
+    dir_map = {"BULLISH": 1, "BEARISH": -1, "NEUTRAL": 0}
+    direction_score_7d = int(evt_rows["direction"].map(dir_map).fillna(0).sum()) if not evt_rows.empty else 0
+    supply_shock_flag  = int((evt_rows["event_type"] == "SUPPLY_SHOCK").any()) if not evt_rows.empty else 0
+    policy_change_flag = int((evt_rows["event_type"] == "POLICY_CHANGE").any()) if not evt_rows.empty else 0
+    # Geopolitical risk
+    cutoff_30 = (datetime.strptime(target_date, "%Y-%m-%d").date() - timedelta(days=30)).isoformat()
+    conn = get_conn()
+    geo_rows = conn.execute(
+        "SELECT risk_score FROM geopolitical_events WHERE commodity = ? AND date >= ? AND date <= ?",
+        [symbol, cutoff_30, target_date],
+    ).df()
+    conn.close()
+    risk_score_7d  = float(geo_rows.tail(7)["risk_score"].mean())  if not geo_rows.empty else 0.05
+    risk_score_30d = float(geo_rows["risk_score"].mean())           if not geo_rows.empty else 0.05
+    # Weather
+    weather_f = get_weather_features(symbol, days=90)
+    macro_f = get_macro_features(symbol, target_date)
+    features = {
+        **price_f,
+        "sentiment_score_1d":  sentiment_1d,
+        "sentiment_3d":        sentiment_3d,
+        "sentiment_7d":        sentiment_7d,
+        "article_count_7d":    article_count_7d,
+        "positive_ratio_7d":   positive_ratio_7d,
+        "bullish_events_7d":   bullish_events_7d,
+        "bearish_events_7d":   bearish_events_7d,
+        "max_severity_7d":     max_severity_7d,
+        "direction_score_7d":  direction_score_7d,
+        "supply_shock_flag":   supply_shock_flag,
+        "policy_change_flag":  policy_change_flag,
+        "risk_score_7d":       risk_score_7d,
+        "risk_score_30d":      risk_score_30d,
+        **weather_f,
+        **macro_f,
+    }
+    return pd.Series(features)

model/predictor.py ADDED Viewed

	@@ -0,0 +1,387 @@

+"""
+Predictor — loads saved XGBoost + LightGBM models and generates forecasts
+at inference time. Runs entirely on CPU.
+Usage:
+    python model/predictor.py --symbol ZW=F
+    python model/predictor.py --all
+"""
+import argparse
+import json
+import logging
+import pickle
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+import numpy as np
+import pandas as pd
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from model.feature_builder import build_prediction_features
+from data.db import get_conn
+log = logging.getLogger(__name__)
+MODELS_DIR = Path(__file__).parent.parent / "models"
+SYMBOL_NAMES: dict[str, str] = {
+    "CL=F":    "Crude Oil",
+    "NG=F":    "Natural Gas",
+    "GC=F":    "Gold",
+    "ZW=F":    "Wheat",
+    "ZC=F":    "Corn",
+    "ZS=F":    "Soybeans",
+    "CT=F":    "Cotton",
+    "SB=F":    "Sugar",
+    "USDINR=X":"USD/INR",
+    "HG=F":    "Copper",
+}
+# Human-readable labels for SHAP feature display
+FEATURE_LABELS: dict[str, str] = {
+    "rsi_14":              "RSI (14-day)",
+    "macd_signal":         "MACD crossover",
+    "bb_position":         "Bollinger Band position",
+    "atr_14":              "Average True Range",
+    "atr_pct":             "Volatility %",
+    "sma_20_50_cross":     "SMA 20/50 crossover",
+    "return_1d":           "1-day return %",
+    "return_7d":           "7-day return %",
+    "return_30d":          "30-day return %",
+    "momentum_score":      "Momentum score",
+    "month_sin":           "Seasonal cycle (sin)",
+    "month_cos":           "Seasonal cycle (cos)",
+    "harvest_season_flag": "Harvest season",
+    "days_to_opec_meeting":"Days to OPEC meeting",
+    "oil_gold_ratio":      "Oil/Gold ratio",
+    "dxy_proxy":           "USD strength proxy",
+    "sentiment_score_1d":  "News sentiment (24h)",
+    "sentiment_3d":        "News sentiment (3-day)",
+    "sentiment_7d":        "News sentiment (7-day)",
+    "article_count_7d":    "Article volume (7-day)",
+    "positive_ratio_7d":   "Positive news ratio",
+    "bullish_events_7d":   "Bullish events (7-day)",
+    "bearish_events_7d":   "Bearish events (7-day)",
+    "max_severity_7d":     "Max event severity",
+    "direction_score_7d":  "Net event direction",
+    "supply_shock_flag":   "Supply shock detected",
+    "policy_change_flag":  "Policy change detected",
+    "risk_score_7d":       "Geopolitical risk (7-day)",
+    "risk_score_30d":      "Geopolitical risk (30-day)",
+    "drought_index":       "Drought index",
+    "heat_stress_days":    "Heat stress days",
+    "precip_anomaly_pct":  "Precipitation anomaly %",
+}
+# Expected return by predicted direction (base, adjusted per-commodity)
+DIRECTION_EXPECTED_RETURN: dict[str, float] = {
+    "UP":     3.0,
+    "STABLE": 0.0,
+    "DOWN":  -3.0,
+}
+# ── model cache (loaded once per process) ─────────────────────────────────────
+_model_cache: dict[str, dict] = {}
+def _load_models(symbol: str, horizon: str = "7d") -> dict | None:
+    """
+    Load XGBoost, LightGBM, scaler, and feature names for a symbol.
+    Caches in memory for the process lifetime.
+    Returns None if models not found (not trained yet).
+    """
+    cache_key = f"{symbol}_{horizon}"
+    if cache_key in _model_cache:
+        return _model_cache[cache_key]
+    xgb_path    = MODELS_DIR / f"xgb_{symbol}_{horizon}.pkl"
+    lgbm_path   = MODELS_DIR / f"lgbm_{symbol}_{horizon}.pkl"
+    scaler_path = MODELS_DIR / f"scaler_{symbol}_{horizon}.pkl"
+    feat_path   = MODELS_DIR / f"feature_names_{symbol}_{horizon}.json"
+    if not all(p.exists() for p in [xgb_path, lgbm_path, scaler_path, feat_path]):
+        log.warning("Models not found for %s %s — run model/trainer.py first", symbol, horizon)
+        return None
+    with open(xgb_path, "rb") as f:
+        xgb_model = pickle.load(f)
+    with open(lgbm_path, "rb") as f:
+        lgbm_model = pickle.load(f)
+    with open(scaler_path, "rb") as f:
+        scaler = pickle.load(f)
+    with open(feat_path) as f:
+        feature_names = json.load(f)
+    bundle = {
+        "xgb":      xgb_model,
+        "lgbm":     lgbm_model,
+        "scaler":   scaler,
+        "features": feature_names,
+    }
+    _model_cache[cache_key] = bundle
+    return bundle
+def _get_shap_top5(xgb_model, X_row: np.ndarray, feature_names: list[str], pred_class: int) -> list[dict]:
+    """
+    Compute SHAP values for XGBoost and return top 5 features by |shap_value|
+    for the predicted class.
+    """
+    try:
+        import shap
+        explainer  = shap.TreeExplainer(xgb_model)
+        shap_vals  = explainer.shap_values(X_row)  # shape: (n_classes, n_features) or (1, n_classes, n_features)
+        # shap_values shape varies by XGBoost version
+        if isinstance(shap_vals, list):
+            vals = shap_vals[pred_class][0]  # for predicted class
+        else:
+            vals = shap_vals[0, :, pred_class] if shap_vals.ndim == 3 else shap_vals[0]
+        top_idx = np.argsort(np.abs(vals))[::-1][:5]
+        result = []
+        for i in top_idx:
+            fname = feature_names[i] if i < len(feature_names) else f"feature_{i}"
+            fval  = float(X_row[0][i])
+            shap_v = float(vals[i])
+            result.append({
+                "feature": fname,
+                "label":   FEATURE_LABELS.get(fname, fname),
+                "value":   round(fval, 4),
+                "impact":  "BULLISH" if shap_v > 0 else "BEARISH",
+                "weight":  round(abs(shap_v), 4),
+            })
+        return result
+    except Exception as exc:
+        log.debug("SHAP error: %s", exc)
+        return []
+def _get_current_price(symbol: str) -> tuple[float, float]:
+    """Return (current_close, atr_pct) from latest DB row."""
+    conn = get_conn()
+    rows = conn.execute(
+        "SELECT close FROM prices WHERE symbol = ? ORDER BY date DESC LIMIT 2",
+        [symbol],
+    ).fetchall()
+    conn.close()
+    if not rows:
+        return 0.0, 0.02
+    close = float(rows[0][0])
+    # Rough ATR proxy: |today - yesterday| / today
+    atr_pct = abs(float(rows[0][0]) - float(rows[1][0])) / close if len(rows) > 1 and close > 0 else 0.02
+    return close, atr_pct
+# ── public API ─────────────────────────────────────────────────────────────────
+def predict(symbol: str, as_of_date: str = None) -> dict:
+    """
+    Generate a forecast for a single commodity.
+    Args:
+        symbol:      Commodity ticker, e.g. "ZW=F"
+        as_of_date:  ISO date string. Defaults to today.
+    Returns:
+        Forecast dict with symbol, current price, 7d + 30d forecasts,
+        top_signals, and confidence levels. Returns error dict if models
+        are not trained.
+    """
+    as_of = as_of_date or date.today().isoformat()
+    bundle_7d  = _load_models(symbol, "7d")
+    bundle_30d = _load_models(symbol, "30d")
+    if bundle_7d is None:
+        return {"symbol": symbol, "error": "models_not_trained", "as_of_date": as_of}
+    # Build feature vector
+    features_series = build_prediction_features(symbol, as_of)
+    if features_series.empty:
+        return {"symbol": symbol, "error": "no_features", "as_of_date": as_of}
+    # Align to trained feature names
+    feat_names_7d = bundle_7d["features"]
+    X_raw = features_series.reindex(feat_names_7d, fill_value=0).values.reshape(1, -1)
+    X_scaled_7d = bundle_7d["scaler"].transform(pd.DataFrame(X_raw, columns=feat_names_7d))
+    # Ensemble prediction — 7d
+    X_df_7d = pd.DataFrame(X_scaled_7d, columns=feat_names_7d)
+    xgb_proba_7d  = bundle_7d["xgb"].predict_proba(X_df_7d)[0]
+    lgbm_proba_7d = bundle_7d["lgbm"].predict_proba(X_df_7d)[0]
+    ensemble_proba_7d = (xgb_proba_7d + lgbm_proba_7d) / 2
+    pred_class_7d = int(ensemble_proba_7d.argmax())
+    # Map encoded class back: 0=DOWN, 1=STABLE, 2=UP
+    direction_map = {0: "DOWN", 1: "STABLE", 2: "UP"}
+    direction_7d  = direction_map[pred_class_7d]
+    prob_7d       = float(ensemble_proba_7d[pred_class_7d])
+    # Ensemble prediction — 30d (may not be trained)
+    direction_30d, prob_30d = "STABLE", 0.5
+    if bundle_30d:
+        feat_names_30d = bundle_30d["features"]
+        X_raw_30d = features_series.reindex(feat_names_30d, fill_value=0).values.reshape(1, -1)
+        X_scaled_30d = bundle_30d["scaler"].transform(pd.DataFrame(X_raw_30d, columns=feat_names_30d))
+        X_df_30d = pd.DataFrame(X_scaled_30d, columns=feat_names_30d)
+        xgb_proba_30d  = bundle_30d["xgb"].predict_proba(X_df_30d)[0]
+        lgbm_proba_30d = bundle_30d["lgbm"].predict_proba(X_df_30d)[0]
+        ensemble_proba_30d = (xgb_proba_30d + lgbm_proba_30d) / 2
+        pred_class_30d = int(ensemble_proba_30d.argmax())
+        direction_30d  = direction_map[pred_class_30d]
+        prob_30d       = float(ensemble_proba_30d[pred_class_30d])
+    # Confidence tier — base probability threshold
+    def _confidence(prob: float) -> str:
+        if prob >= 0.70:
+            return "HIGH"
+        if prob >= 0.55:
+            return "MEDIUM"
+        return "LOW"
+    # High-confidence signal confirmation: require 2+ independent signals to agree.
+    # Signals checked: price momentum, COT commercial positioning, EIA/USDA flag.
+    def _confirmed_confidence(prob: float, direction: str, feat: pd.Series) -> str:
+        base = _confidence(prob)
+        if base == "LOW":
+            return "LOW"
+        confirming = 0
+        # Signal 1: price momentum agrees
+        mom = float(feat.get("momentum_score", 0) or 0)
+        ret7 = float(feat.get("return_7d", 0) or 0)
+        if direction == "UP"   and (mom > 0 or ret7 > 0):  confirming += 1
+        if direction == "DOWN" and (mom < 0 or ret7 < 0):  confirming += 1
+        # Signal 2: COT commercial positioning agrees (commercials = smart money)
+        cot_net = float(feat.get("cot_commercial_net_pct", 0) or 0)
+        cot_chg = float(feat.get("cot_commercial_chg_1w", 0) or 0)
+        if direction == "UP"   and (cot_net > 0.05 or cot_chg > 0):  confirming += 1
+        if direction == "DOWN" and (cot_net < -0.05 or cot_chg < 0): confirming += 1
+        # Signal 3: EIA supply signal agrees (for CL=F and NG=F)
+        eia_draw   = float(feat.get("eia_crude_draw", 0) or feat.get("eia_natgas_draw", 0) or 0)
+        eia_vs5yr  = float(feat.get("eia_crude_vs_5yr", 0) or feat.get("eia_natgas_vs_5yr", 0) or 0)
+        if direction == "UP"   and (eia_draw > 0 or eia_vs5yr < -0.5): confirming += 1
+        if direction == "DOWN" and eia_vs5yr > 0.5:                     confirming += 1
+        # Signal 4: USDA crop condition trend agrees (for grain/ag symbols)
+        crop_chg = float(feat.get("usda_crop_good_exc_chg", 0) or 0)
+        if direction == "DOWN" and crop_chg < -2:  confirming += 1
+        if direction == "UP"   and crop_chg >  2:  confirming += 1
+        # Upgrade if 2+ signals confirm; downgrade if none confirm
+        if confirming >= 2 and base == "MEDIUM":
+            return "HIGH"
+        if confirming == 0 and base == "MEDIUM":
+            return "LOW"
+        return base
+    # Price range using ATR
+    current_price, atr_pct = _get_current_price(symbol)
+    exp_ret = DIRECTION_EXPECTED_RETURN.get(direction_7d, 0.0) / 100
+    price_range_low  = round(current_price * (1 + exp_ret - 1.5 * atr_pct), 2)
+    price_range_high = round(current_price * (1 + exp_ret + 1.5 * atr_pct), 2)
+    # SHAP top signals
+    top_signals = _get_shap_top5(bundle_7d["xgb"], X_scaled_7d, feat_names_7d, pred_class_7d)
+    conf_7d  = _confirmed_confidence(prob_7d,  direction_7d,  features_series)
+    conf_30d = _confirmed_confidence(prob_30d, direction_30d, features_series)
+    # Symbols where 7d model has known accuracy issues — surface a warning.
+    UNRELIABLE_7D = {"ZC=F", "HG=F"}
+    model_warning = (
+        "7d model accuracy is low for this symbol — use 30d forecast instead"
+        if symbol in UNRELIABLE_7D else None
+    )
+    return {
+        "symbol":           symbol,
+        "commodity_name":   SYMBOL_NAMES.get(symbol, symbol),
+        "as_of_date":       as_of,
+        "current_price":    current_price,
+        "forecast_7d": {
+            "direction":        direction_7d,
+            "probability":      round(prob_7d, 4),
+            "price_range_low":  price_range_low,
+            "price_range_high": price_range_high,
+            "confidence":       conf_7d,
+            "model_warning":    model_warning,
+        },
+        "forecast_30d": {
+            "direction":   direction_30d,
+            "probability": round(prob_30d, 4),
+            "confidence":  conf_30d,
+        },
+        "top_signals": top_signals,
+    }
+def predict_all(as_of_date: str = None) -> dict[str, dict]:
+    """
+    Generate forecasts for all 10 commodities and save to DuckDB.
+    Returns:
+        Dict mapping symbol → forecast dict.
+    """
+    from signals.price_features import ALL_SYMBOLS
+    results = {}
+    for symbol in ALL_SYMBOLS:
+        try:
+            fc = predict(symbol, as_of_date)
+            results[symbol] = fc
+            if "error" not in fc:
+                _save_forecast(fc)
+        except Exception as exc:
+            log.error("predict %s failed: %s", symbol, exc)
+            results[symbol] = {"symbol": symbol, "error": str(exc)}
+    return results
+def _save_forecast(fc: dict) -> None:
+    """Persist a forecast to DuckDB for accuracy tracking."""
+    conn = get_conn()
+    try:
+        conn.execute(
+            """
+            INSERT OR REPLACE INTO accuracy_log
+                (date, symbol, forecast_direction, actual_direction, was_correct, confidence)
+            VALUES (?, ?, ?, NULL, NULL, ?)
+            """,
+            [
+                fc["as_of_date"],
+                fc["symbol"],
+                fc["forecast_7d"]["direction"],
+                fc["forecast_7d"]["confidence"],
+            ],
+        )
+    except Exception as exc:
+        log.debug("Forecast save error: %s", exc)
+    finally:
+        conn.close()
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="CommodiSense predictor")
+    parser.add_argument("--symbol", default=None, help="Single symbol to predict")
+    parser.add_argument("--all",    action="store_true", help="Predict all symbols")
+    parser.add_argument("--date",   default=None, help="As-of date YYYY-MM-DD")
+    args = parser.parse_args()
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+    if args.all:
+        results = predict_all(args.date)
+        for sym, fc in results.items():
+            if "error" not in fc:
+                d7 = fc["forecast_7d"]
+                print(f"{sym:<12} {d7['direction']:<7} {d7['probability']:.0%} [{d7['confidence']}]")
+    elif args.symbol:
+        fc = predict(args.symbol, args.date)
+        print(json.dumps(fc, indent=2, default=str))
+    else:
+        parser.print_help()

model/trainer.py ADDED Viewed

	@@ -0,0 +1,496 @@

+"""
+Model Trainer — trains XGBoost + LightGBM ensemble per commodity.
+Designed to run on Kaggle free notebooks (GPU available there) but
+works on CPU locally.
+IMPORTANT: Run this on Kaggle for GPU acceleration, or locally with CPU.
+Saves trained models to models/ directory.
+Usage:
+    python model/trainer.py                    # train all symbols
+    python model/trainer.py --symbol GC=F      # train one symbol
+    python model/trainer.py --symbol ZW=F --horizon 7d
+"""
+import argparse
+import json
+import logging
+import pickle
+from datetime import date, timedelta
+import sys
+import warnings
+from pathlib import Path
+import numpy as np
+import pandas as pd
+from sklearn.calibration import CalibratedClassifierCV
+from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
+from sklearn.model_selection import TimeSeriesSplit
+from sklearn.preprocessing import StandardScaler
+warnings.filterwarnings("ignore")
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from model.feature_builder import build_training_data
+from signals.price_features import ALL_SYMBOLS
+MODELS_DIR = Path(__file__).parent.parent / "models"
+MODELS_DIR.mkdir(exist_ok=True)
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+)
+log = logging.getLogger(__name__)
+# Label encoding: -1 → 0 (DOWN), 0 → 1 (STABLE), 1 → 2 (UP) for XGBoost
+LABEL_MAP     = {-1: 0, 0: 1, 1: 2}
+LABEL_REVERSE = {0: -1, 1: 0, 2: 1}
+LABEL_NAMES   = {0: "DOWN", 1: "STABLE", 2: "UP"}
+# ── Phase 6: Booster 2 — commodity-specific feature weight multipliers ─────────
+# Applied to sample weights at training time so the model learns that certain
+# features matter more for specific commodities.
+COMMODITY_FEATURE_WEIGHTS: dict[str, dict[str, float]] = {
+    "CL=F":     {"risk_score_7d": 1.5, "risk_score_30d": 1.5, "days_to_opec_meeting": 1.4,
+                 "drought_index": 0.5},
+    "NG=F":     {"days_to_opec_meeting": 1.4, "return_60d": 1.3, "atr_14": 1.3},
+    "GC=F":     {"dxy_proxy": 1.8, "risk_score_7d": 1.3, "sentiment_score_1d": 1.2},
+    "ZW=F":     {"drought_index": 2.0, "sentiment_score_1d": 1.2, "precip_anomaly_pct": 1.5},
+    "ZC=F":     {"harvest_season_flag": 1.5, "drought_index": 1.8, "precip_anomaly_pct": 1.4},
+    "ZS=F":     {"harvest_season_flag": 1.5, "drought_index": 1.6, "precip_anomaly_pct": 1.3},
+    "CT=F":     {"harvest_season_flag": 1.6, "heat_stress_days": 2.0, "precip_anomaly_pct": 1.5},
+    "SB=F":     {"harvest_season_flag": 1.5, "precip_anomaly_pct": 1.4},
+    "USDINR=X": {"return_60d": 1.4, "momentum_score": 1.3, "macd_signal": 1.2},
+    "HG=F":     {"risk_score_7d": 1.3, "return_60d": 1.4, "momentum_score": 1.2},
+}
+# ── model configs ──────────────────────────────────────────────────────────────
+XGB_PARAMS = {
+    "n_estimators":         500,
+    "max_depth":            6,
+    "learning_rate":        0.05,
+    "subsample":            0.8,
+    "colsample_bytree":     0.8,
+    "objective":            "multi:softprob",
+    "num_class":            3,
+    "eval_metric":          "mlogloss",
+    "early_stopping_rounds": 50,   # constructor param in XGBoost 3.x
+    "random_state":         42,
+    "n_jobs":               -1,
+}
+LGBM_PARAMS = {
+    "n_estimators":    500,
+    "num_leaves":      31,
+    "learning_rate":   0.05,
+    "feature_fraction": 0.8,
+    "bagging_fraction": 0.8,
+    "bagging_freq":    5,
+    "objective":       "multiclass",
+    "num_class":       3,
+    "metric":          "multi_logloss",
+    "verbose":         -1,
+    "random_state":    42,
+    "n_jobs":          -1,
+}
+# ── helpers ────────────────────────────────────────────────────────────────────
+def _encode_labels(y: pd.Series) -> np.ndarray:
+    return y.map(LABEL_MAP).values
+def _compute_sample_weights(y_encoded: np.ndarray) -> np.ndarray:
+    """Inverse-frequency sample weights. Falls back to uniform if not all 3 classes present."""
+    from sklearn.utils.class_weight import compute_sample_weight
+    if len(np.unique(y_encoded)) < 3:
+        return np.ones(len(y_encoded), dtype=float)
+    return compute_sample_weight("balanced", y_encoded)
+def _select_top_features(
+    X: pd.DataFrame,
+    importances: np.ndarray,
+    top_n: int = 20,
+    min_importance: float = 0.01,
+) -> list[str]:
+    """Return top_n feature names by importance, filtering below min_importance."""
+    feat_imp = pd.Series(importances, index=X.columns).sort_values(ascending=False)
+    selected = feat_imp[feat_imp >= min_importance].head(top_n).index.tolist()
+    if len(selected) < 5:
+        selected = feat_imp.head(top_n).index.tolist()
+    return selected
+def _detect_regime(X: pd.DataFrame) -> np.ndarray:
+    """
+    Booster 3 — Regime Detection.
+    Returns per-row regime array: 0=RANGE_BOUND, 1=TRENDING, 2=VOLATILE.
+    Uses ATR% and absolute 30-day return to classify market state.
+    Only applied when X has enough rows to compute rolling stats (>60).
+    """
+    if len(X) < 60 or "atr_pct" not in X.columns:
+        return np.zeros(len(X), dtype=int)
+    atr_pct   = X["atr_pct"].fillna(0)
+    ret_30d   = X.get("return_30d", pd.Series(0, index=X.index)).abs().fillna(0)
+    atr_mean  = atr_pct.rolling(60, min_periods=20).mean().fillna(atr_pct.mean())
+    atr_std   = atr_pct.rolling(60, min_periods=20).std().fillna(atr_pct.std())
+    atr_thresh_volatile = atr_mean + 1.5 * atr_std
+    regime = np.zeros(len(X), dtype=int)
+    regime[ret_30d.values > 10.0]                      = 1   # TRENDING
+    regime[atr_pct.values > atr_thresh_volatile.values] = 2  # VOLATILE
+    return regime
+def _apply_commodity_weights(
+    sample_weights: np.ndarray,
+    X: pd.DataFrame,
+    symbol: str,
+    regime: np.ndarray,
+) -> np.ndarray:
+    """
+    Booster 2+3 combined — scale sample weights by commodity-specific feature
+    importance multipliers, then dampen VOLATILE-regime rows (trust nothing when
+    the market is in a chaotic state).
+    """
+    w = sample_weights.copy().astype(float)
+    # Commodity-specific: up-weight rows where the key signal is strong
+    for feat, mult in COMMODITY_FEATURE_WEIGHTS.get(symbol, {}).items():
+        if feat in X.columns:
+            signal_strength = X[feat].abs().fillna(0)
+            percentile_75   = np.percentile(signal_strength, 75)
+            if percentile_75 > 0:
+                strong_rows = (signal_strength >= percentile_75).values
+                w[strong_rows] *= mult
+    # Regime: dampen volatile rows (Booster 3 — "trust nothing when volatile")
+    w[regime == 2] *= 0.6
+    # Trending rows: trust momentum features more — mild up-weight
+    w[regime == 1] *= 1.2
+    # Renormalise so total weight is unchanged
+    total = w.sum()
+    if total > 0:
+        w = w / total * len(w)
+    return w
+def _directional_accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
+    """Accuracy of predicting UP/DOWN/STABLE direction correctly."""
+    return float(np.mean(y_true == y_pred))
+def _sharpe_ratio(y_true_raw: pd.Series, y_pred_encoded: np.ndarray) -> float:
+    """
+    Naive Sharpe: long when model predicts UP, short when DOWN, flat when STABLE.
+    Uses true direction as proxy for daily return sign.
+    """
+    pred_dirs = pd.Series(y_pred_encoded).map(LABEL_REVERSE)
+    returns = pred_dirs * y_true_raw.values  # +1 correct, -1 wrong
+    mu = returns.mean()
+    sigma = returns.std()
+    return round(float(mu / sigma * np.sqrt(252)) if sigma > 0 else 0.0, 3)
+# ── training ───────────────────────────────────────────────────────────────────
+def train_symbol(
+    symbol: str,
+    horizon: str = "7d",
+    add_lag_features: bool = True,
+    last_days: int = None,
+) -> dict:
+    """
+    Train XGBoost + LightGBM ensemble for a single commodity and horizon.
+    Args:
+        symbol:            Commodity ticker, e.g. "ZW=F"
+        horizon:           "7d" or "30d"
+        add_lag_features:  Add interaction features (accuracy booster)
+        last_days:         If set, train only on the most recent N calendar days.
+                           Use this when NLP signals only cover a short window —
+                           avoids 4+ years of zero-padded sentiment rows diluting the model.
+    Returns:
+        Dict with accuracy metrics for this symbol/horizon.
+    """
+    log.info("Training %s | horizon=%s | window=%s",
+             symbol, horizon, f"last {last_days}d" if last_days else "full")
+    X, y_7d, y_30d = build_training_data(symbol)
+    if X.empty:
+        log.warning("%s: no training data, skipping", symbol)
+        return {"symbol": symbol, "horizon": horizon, "error": "no_data"}
+    # Trim to short window — keeps only rows where NLP signals are non-zero
+    if last_days is not None:
+        cutoff = date.today() - timedelta(days=last_days)
+        if "date" in X.columns:
+            mask = pd.to_datetime(X["date"]).dt.date >= cutoff
+        else:
+            # date is the index order — take the last last_days * 0.7 rows (trading days)
+            trading_days = int(last_days * 0.71)
+            mask = pd.Series([False] * len(X))
+            mask.iloc[-trading_days:] = True
+        X = X[mask.values].reset_index(drop=True)
+        y_7d  = y_7d[mask.values].reset_index(drop=True)
+        y_30d = y_30d[mask.values].reset_index(drop=True)
+        log.info("%s: trimmed to %d rows (last %d days)", symbol, len(X), last_days)
+    y = y_7d if horizon == "7d" else y_30d
+    # Skip if one class dominates >95% — model would just memorise the majority class
+    class_counts = y.value_counts(normalize=True)
+    if class_counts.max() > 0.95:
+        log.warning("%s %s: dominant class %.0f%% — skipping (too imbalanced to learn from)",
+                    symbol, horizon, class_counts.max() * 100)
+        return {"symbol": symbol, "horizon": horizon, "error": "extreme_class_imbalance"}
+    # ── Phase 6 Booster 4: lag + interaction features ──
+    if add_lag_features:
+        X = X.copy()
+        # Interaction: sentiment × momentum (strong when both agree)
+        if "sentiment_score_1d" in X.columns and "momentum_score" in X.columns:
+            X["sentiment_x_momentum"] = X["sentiment_score_1d"] * X["momentum_score"]
+        # Interaction: event direction × price momentum
+        if "direction_score_7d" in X.columns and "return_7d" in X.columns:
+            X["event_x_momentum"] = X["direction_score_7d"] * np.sign(X["return_7d"].fillna(0))
+        # Volatility regime flag (standalone feature for the model)
+        if "atr_pct" in X.columns and len(X) >= 60:
+            atr_mean = X["atr_pct"].rolling(60, min_periods=20).mean().fillna(X["atr_pct"].mean())
+            X["high_volatility_flag"] = (X["atr_pct"] > atr_mean * 1.5).astype(int)
+    y_enc = _encode_labels(y)
+    sample_weights = _compute_sample_weights(y_enc)
+    # Phase 6 Boosters 2+3: regime detection + commodity-specific weights
+    if len(X) >= 60:
+        regime = _detect_regime(X)
+        sample_weights = _apply_commodity_weights(sample_weights, X, symbol, regime)
+        trending_pct  = (regime == 1).mean() * 100
+        volatile_pct  = (regime == 2).mean() * 100
+        log.info("%s: regime — %.0f%% trending, %.0f%% volatile, %.0f%% range-bound",
+                 symbol, trending_pct, volatile_pct, 100 - trending_pct - volatile_pct)
+    # Short-window mode: use fewer folds + lighter model to avoid overfitting
+    is_short_window = last_days is not None and len(X) < 200
+    n_splits = 3 if is_short_window else 5
+    xgb_params_cv = {**XGB_PARAMS, "n_estimators": 200, "max_depth": 3} if is_short_window else XGB_PARAMS
+    tscv = TimeSeriesSplit(n_splits=n_splits)
+    fold_accs: list[float] = []
+    best_features: list[str] | None = None
+    last_fold_idx = n_splits - 1
+    # ── cross-validation to find stable feature set ──
+    for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
+        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
+        y_train, y_val = y_enc[train_idx], y_enc[val_idx]
+        sw_train = sample_weights[train_idx]
+        # Skip folds where val set has fewer than 3 samples or missing classes
+        if len(y_val) < 3:
+            continue
+        scaler_fold = StandardScaler()
+        X_tr_s = scaler_fold.fit_transform(X_train)
+        X_vl_s = scaler_fold.transform(X_val)
+        import xgboost as xgb
+        xgb_fold = xgb.XGBClassifier(**xgb_params_cv)
+        xgb_fold.fit(
+            X_tr_s, y_train,
+            sample_weight=sw_train,
+            eval_set=[(X_vl_s, y_val)],
+            verbose=False,
+        )
+        fold_acc = accuracy_score(y_val, xgb_fold.predict(X_vl_s))
+        fold_accs.append(fold_acc)
+        if fold == last_fold_idx:
+            best_features = _select_top_features(X, xgb_fold.feature_importances_)
+    if not fold_accs:
+        return {"symbol": symbol, "horizon": horizon, "error": "all_folds_skipped"}
+    cv_accuracy = float(np.mean(fold_accs))
+    log.info("%s %s: CV accuracy %.3f (folds: %s)",
+             symbol, horizon, cv_accuracy, [f"{a:.3f}" for a in fold_accs])
+    # Short window: use lighter final model to avoid overfitting on small data
+    if is_short_window:
+        XGB_PARAMS_BOOSTED  = {**XGB_PARAMS,  "n_estimators": 300, "max_depth": 4, "learning_rate": 0.03}
+        LGBM_PARAMS_BOOSTED = {**LGBM_PARAMS, "n_estimators": 300, "num_leaves": 15}
+    elif cv_accuracy < 0.90 and add_lag_features:
+        log.info("%s: below 90%%, boosting n_estimators to 1000", symbol)
+        XGB_PARAMS_BOOSTED  = {**XGB_PARAMS,  "n_estimators": 1000}
+        LGBM_PARAMS_BOOSTED = {**LGBM_PARAMS, "n_estimators": 1000}
+    else:
+        XGB_PARAMS_BOOSTED  = XGB_PARAMS
+        LGBM_PARAMS_BOOSTED = LGBM_PARAMS
+    # ── final training on full dataset using best_features ──
+    X_selected = X[best_features] if best_features else X
+    scaler = StandardScaler()
+    X_s = scaler.fit_transform(X_selected)
+    # Short window: 70/30 split to keep a meaningful test set; else 80/20
+    test_frac = 0.30 if is_short_window else 0.20
+    split = int(len(X_s) * (1 - test_frac))
+    X_train_f, X_test_f = X_s[:split], X_s[split:]
+    y_train_f, y_test_f = y_enc[:split], y_enc[split:]
+    sw_f = sample_weights[:split]
+    import xgboost as xgb
+    import lightgbm as lgb
+    xgb_model = xgb.XGBClassifier(**XGB_PARAMS_BOOSTED)
+    xgb_model.fit(
+        X_train_f, y_train_f,
+        sample_weight=sw_f,
+        eval_set=[(X_test_f, y_test_f)],
+        verbose=False,
+    )
+    lgbm_model = lgb.LGBMClassifier(**LGBM_PARAMS_BOOSTED)
+    lgbm_model.fit(
+        X_train_f, y_train_f,
+        sample_weight=sw_f,
+        eval_set=[(X_test_f, y_test_f)],
+        callbacks=[lgb.early_stopping(50, verbose=False), lgb.log_evaluation(period=-1)],
+    )
+    # Phase 6 Booster 5 — Platt/isotonic calibration on XGBoost
+    # Uses the test split as held-out calibration data (cv="prefit")
+    cal_cv = min(3, max(2, len(X_train_f) // 100))
+    try:
+        from sklearn.calibration import CalibratedClassifierCV
+        xgb_calibrated = CalibratedClassifierCV(xgb_model, method="isotonic", cv="prefit")
+        xgb_calibrated.fit(X_test_f, y_test_f)
+    except Exception:
+        xgb_calibrated = xgb_model  # fallback: uncalibrated
+    # Soft-voting ensemble on test set (calibrated XGB + raw LGBM)
+    xgb_proba  = xgb_calibrated.predict_proba(X_test_f)
+    lgbm_proba = lgbm_model.predict_proba(X_test_f)
+    ensemble_proba = (xgb_proba + lgbm_proba) / 2
+    ensemble_pred  = ensemble_proba.argmax(axis=1)
+    test_accuracy = _directional_accuracy(y_test_f, ensemble_pred)
+    sharpe = _sharpe_ratio(y.iloc[split:].reset_index(drop=True), ensemble_pred)
+    # Classification report
+    report = classification_report(
+        y_test_f, ensemble_pred,
+        target_names=["DOWN", "STABLE", "UP"],
+        output_dict=True,
+    )
+    # Feature importance (top 10 for report)
+    top10_features = (
+        pd.Series(xgb_model.feature_importances_, index=X_selected.columns)
+        .sort_values(ascending=False)
+        .head(10)
+        .to_dict()
+    )
+    log.info("%s %s: test accuracy=%.3f, Sharpe=%.2f", symbol, horizon, test_accuracy, sharpe)
+    # ── save artifacts ──
+    with open(MODELS_DIR / f"xgb_{symbol}_{horizon}.pkl", "wb") as f:
+        pickle.dump(xgb_calibrated, f)
+    with open(MODELS_DIR / f"lgbm_{symbol}_{horizon}.pkl", "wb") as f:
+        pickle.dump(lgbm_model, f)
+    with open(MODELS_DIR / f"scaler_{symbol}_{horizon}.pkl", "wb") as f:
+        pickle.dump(scaler, f)
+    with open(MODELS_DIR / f"feature_names_{symbol}_{horizon}.json", "w") as f:
+        json.dump(X_selected.columns.tolist(), f)
+    return {
+        "symbol":           symbol,
+        "horizon":          horizon,
+        "cv_accuracy":      round(cv_accuracy, 4),
+        "test_accuracy":    round(test_accuracy, 4),
+        "sharpe_ratio":     sharpe,
+        "n_features":       len(X_selected.columns),
+        "n_train_samples":  split,
+        "n_test_samples":   len(X_test_f),
+        "top10_features":   top10_features,
+        "classification_report": report,
+    }
+def train_all(horizons: list[str] = None, last_days: int = None) -> dict:
+    """
+    Train models for all 10 commodities and save accuracy report.
+    Args:
+        horizons:  List of horizons to train. Default: ["7d", "30d"]
+        last_days: If set, train each symbol on only the most recent N days.
+    Returns:
+        Dict mapping symbol → accuracy metrics per horizon.
+    """
+    if horizons is None:
+        horizons = ["7d", "30d"]
+    results: dict = {}
+    for symbol in ALL_SYMBOLS:
+        results[symbol] = {}
+        for horizon in horizons:
+            try:
+                metrics = train_symbol(symbol, horizon=horizon, last_days=last_days)
+                results[symbol][horizon] = metrics
+            except Exception as exc:
+                log.error("Failed to train %s %s: %s", symbol, horizon, exc)
+                results[symbol][horizon] = {"error": str(exc)}
+    # Save combined accuracy report
+    report_path = MODELS_DIR / "accuracy_report.json"
+    with open(report_path, "w") as f:
+        json.dump(results, f, indent=2, default=str)
+    log.info("Accuracy report saved to %s", report_path)
+    # Print summary table
+    print("\n" + "=" * 85)
+    print(f"{'Commodity':<15} {'7d Accuracy':>12} {'30d Accuracy':>13} {'Sharpe (7d)':>12} {'Samples':>8}")
+    print("=" * 85)
+    for symbol, res in results.items():
+        r7  = res.get("7d", {})
+        r30 = res.get("30d", {})
+        acc7  = f"{r7.get('test_accuracy', 0):.1%}" if "test_accuracy" in r7 else "ERR"
+        acc30 = f"{r30.get('test_accuracy', 0):.1%}" if "test_accuracy" in r30 else "ERR"
+        sh7   = f"{r7.get('sharpe_ratio', 0):.2f}"   if "sharpe_ratio"  in r7 else "ERR"
+        n     = r7.get("n_train_samples", 0)
+        print(f"{symbol:<15} {acc7:>12} {acc30:>13} {sh7:>12} {n:>8}")
+    print("=" * 85)
+    return results
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="CommodiSense model trainer")
+    parser.add_argument("--symbol",  default=None, help="Single symbol to train")
+    parser.add_argument("--horizon", default="both", choices=["7d", "30d", "both"])
+    parser.add_argument("--days",    default=None, type=int,
+                        help="Train on only the most recent N calendar days (short-window mode)")
+    args = parser.parse_args()
+    if args.symbol:
+        horizons = ["7d", "30d"] if args.horizon == "both" else [args.horizon]
+        for h in horizons:
+            result = train_symbol(args.symbol, horizon=h, last_days=args.days)
+            print(json.dumps({k: v for k, v in result.items()
+                               if k != "classification_report"}, indent=2, default=str))
+    else:
+        train_all(last_days=args.days)

requirements.txt ADDED Viewed

	@@ -0,0 +1,18 @@

+# CommodiSense Dashboard — Hugging Face Spaces deployment
+# Streamlit Cloud removed (forces Python 3.14, incompatible with numba ecosystem)
+# pandas-ta removed (requires numba which doesn't support Python 3.14)
+# shap removed (all versions require numba)
+# All technical indicators implemented with pure pandas/numpy
+duckdb>=0.10.0
+pandas>=2.0.0
+numpy>=1.24.0
+yfinance>=0.2.0
+requests>=2.28.0
+xgboost>=2.0.0
+lightgbm>=4.0.0
+scikit-learn>=1.3.0
+groq>=0.4.0
+streamlit>=1.28.0
+plotly>=5.15.0
+python-dotenv>=1.0.0

signals/__init__.py ADDED Viewed

File without changes

signals/macro_features.py ADDED Viewed

	@@ -0,0 +1,457 @@

+"""
+Macro Feature Engineering — COT, FRED, EIA, USDA signals.
+No lookahead guarantee: all features use data available at or before as_of_date.
+Missing data returns zero (model learns to weight it accordingly via has_*_data flags).
+Public API:
+    build_macro_dataframe(symbol, start_date, end_date) → pd.DataFrame  (training)
+    get_macro_features(symbol, as_of_date)              → dict            (inference)
+"""
+import logging
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+import pandas as pd
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from data.db import get_conn
+log = logging.getLogger(__name__)
+# Symbols that have EIA inventory data
+EIA_SYMBOL_MAP = {
+    "CL=F": "crude_stocks",
+    "NG=F": "natgas_storage",
+}
+# Symbols that have USDA crop data
+USDA_SYMBOLS = {"ZW=F", "ZC=F", "ZS=F", "CT=F"}
+# All macro feature names — used to guarantee consistent columns across training and inference
+ALL_MACRO_FEATURES = [
+    # COT
+    "cot_commercial_net",
+    "cot_commercial_net_pct",
+    "cot_mm_net",
+    "cot_mm_net_pct",
+    "cot_commercial_chg_1w",
+    "cot_mm_chg_1w",
+    "cot_open_interest",
+    "has_cot_data",
+    # FRED
+    "fred_dxy",
+    "fred_dxy_chg_1w",
+    "fred_dxy_chg_4w",
+    "fred_inflation_exp",
+    "fred_vix",
+    "fred_vix_chg_1w",
+    "fred_vix_high",
+    "fred_treasury_10y",
+    "fred_financial_stress",
+    "fred_indpro",
+    "fred_fedfunds",
+    "fred_yield_inv",
+    "fred_china_pmi",
+    "fred_copper_basis",
+    "has_fred_data",
+    # EIA
+    "eia_crude_stocks",
+    "eia_crude_chg_1w",
+    "eia_crude_vs_5yr",
+    "eia_crude_draw",
+    "eia_natgas_stocks",
+    "eia_natgas_chg_1w",
+    "eia_natgas_vs_5yr",
+    "eia_natgas_draw",
+    "has_eia_data",
+    # USDA
+    "usda_crop_good_exc",
+    "usda_crop_good_exc_chg",
+    "usda_stocks",
+    "usda_stocks_yoy",
+    "usda_production",
+    "has_usda_data",
+]
+_ZERO_ROW = {k: 0.0 for k in ALL_MACRO_FEATURES}
+# ── training dataframes ────────────────────────────────────────────────────────
+def _load_cot(symbol: str, start_date: str, end_date: str) -> pd.DataFrame:
+    conn = get_conn()
+    df = conn.execute("""
+        SELECT date,
+               commercial_net_long  AS cot_commercial_net,
+               commercial_net_pct   AS cot_commercial_net_pct,
+               mm_net_long          AS cot_mm_net,
+               mm_net_pct           AS cot_mm_net_pct,
+               commercial_chg_1w    AS cot_commercial_chg_1w,
+               mm_chg_1w            AS cot_mm_chg_1w,
+               open_interest        AS cot_open_interest
+        FROM cot_data
+        WHERE symbol = ? AND date >= ? AND date <= ?
+        ORDER BY date
+    """, [symbol, start_date, end_date]).df()
+    conn.close()
+    if df.empty:
+        return pd.DataFrame()
+    df["date"] = pd.to_datetime(df["date"]).dt.date
+    df["has_cot_data"] = 1
+    return df.sort_values("date").reset_index(drop=True)
+def _load_fred(start_date: str, end_date: str) -> pd.DataFrame:
+    conn = get_conn()
+    # Try to select new columns; fall back gracefully if they don't exist yet
+    try:
+        df = conn.execute("""
+            SELECT date, dxy, inflation_exp, vix, treasury_10y,
+                   financial_stress, indpro, fedfunds, china_pmi, copper_basis
+            FROM fred_data
+            WHERE date >= ? AND date <= ?
+            ORDER BY date
+        """, [start_date, end_date]).df()
+    except Exception:
+        df = conn.execute("""
+            SELECT date, dxy, inflation_exp, vix, treasury_10y,
+                   financial_stress, indpro, fedfunds
+            FROM fred_data
+            WHERE date >= ? AND date <= ?
+            ORDER BY date
+        """, [start_date, end_date]).df()
+        df["china_pmi"]    = None
+        df["copper_basis"] = None
+    conn.close()
+    if df.empty:
+        return pd.DataFrame()
+    df["date"] = pd.to_datetime(df["date"]).dt.date
+    df = df.sort_values("date").reset_index(drop=True)
+    for col in df.columns[1:]:
+        df[col] = df[col].ffill()
+    df["fred_dxy_chg_1w"]  = df["dxy"].diff(5)
+    df["fred_dxy_chg_4w"]  = df["dxy"].diff(20)
+    df["fred_vix_chg_1w"]  = df["vix"].diff(5)
+    df["fred_vix_high"]    = (df["vix"] > 25).astype(float)
+    fedfunds_safe          = df["fedfunds"].fillna(0)
+    t10y_safe              = df["treasury_10y"].fillna(0)
+    df["fred_yield_inv"]   = (t10y_safe < fedfunds_safe).astype(float)
+    df["has_fred_data"]    = 1
+    return df.rename(columns={
+        "dxy":              "fred_dxy",
+        "inflation_exp":    "fred_inflation_exp",
+        "vix":              "fred_vix",
+        "treasury_10y":     "fred_treasury_10y",
+        "financial_stress": "fred_financial_stress",
+        "indpro":           "fred_indpro",
+        "fedfunds":         "fred_fedfunds",
+        "china_pmi":        "fred_china_pmi",
+        "copper_basis":     "fred_copper_basis",
+    })
+def _load_eia(symbol: str, start_date: str, end_date: str) -> pd.DataFrame:
+    series_name = EIA_SYMBOL_MAP.get(symbol)
+    if not series_name:
+        return pd.DataFrame()
+    conn = get_conn()
+    df = conn.execute("""
+        SELECT date, value, chg_1w, vs_5yr_avg
+        FROM eia_inventory
+        WHERE series = ? AND date >= ? AND date <= ?
+        ORDER BY date
+    """, [series_name, start_date, end_date]).df()
+    conn.close()
+    if df.empty:
+        return pd.DataFrame()
+    df["date"] = pd.to_datetime(df["date"]).dt.date
+    prefix = "eia_crude" if symbol == "CL=F" else "eia_natgas"
+    df = df.rename(columns={
+        "value":      f"{prefix}_stocks",
+        "chg_1w":     f"{prefix}_chg_1w",
+        "vs_5yr_avg": f"{prefix}_vs_5yr",
+    })
+    # Drawdown flag: inventory fell (bullish supply signal)
+    chg_col = f"{prefix}_chg_1w"
+    df[f"{prefix}_draw"] = (df[chg_col].fillna(0) < -500).astype(float)
+    df["has_eia_data"] = 1
+    return df.sort_values("date").reset_index(drop=True)
+def _load_usda(symbol: str, start_date: str, end_date: str) -> pd.DataFrame:
+    if symbol not in USDA_SYMBOLS:
+        return pd.DataFrame()
+    conn = get_conn()
+    df = conn.execute("""
+        SELECT date, metric, value, yoy_chg_pct
+        FROM usda_crop
+        WHERE commodity = ? AND date >= ? AND date <= ?
+        ORDER BY date
+    """, [symbol, start_date, end_date]).df()
+    conn.close()
+    if df.empty:
+        return pd.DataFrame()
+    df["date"] = pd.to_datetime(df["date"]).dt.date
+    # Crop condition: sum % good + % excellent per date
+    cond = (
+        df[df["metric"].str.upper().str.contains("PCT GOOD|PCT EXCELLENT", na=False)]
+        .groupby("date")["value"].sum()
+        .reset_index()
+        .rename(columns={"value": "usda_crop_good_exc"})
+        .sort_values("date")
+    )
+    cond["usda_crop_good_exc_chg"] = cond["usda_crop_good_exc"].diff()
+    # Stocks
+    stk = (
+        df[df["metric"].str.upper().str.contains("STOCKS", na=False)]
+        .groupby("date")
+        .agg(usda_stocks=("value", "mean"), usda_stocks_yoy=("yoy_chg_pct", "mean"))
+        .reset_index()
+        .sort_values("date")
+    )
+    # Annual production (forward-filled across year)
+    prd = (
+        df[df["metric"].str.upper().str.contains("PRODUCTION", na=False)]
+        .groupby("date")
+        .agg(usda_production=("value", "mean"))
+        .reset_index()
+        .sort_values("date")
+    )
+    parts = [p for p in [cond, stk, prd] if not p.empty]
+    if not parts:
+        return pd.DataFrame()
+    result = parts[0]
+    for p in parts[1:]:
+        result = result.merge(p, on="date", how="outer")
+    result["has_usda_data"] = 1
+    return result.sort_values("date").reset_index(drop=True)
+def _safe_merge(base: pd.DataFrame, other: pd.DataFrame) -> pd.DataFrame:
+    """Left-merge other onto base by date, zero-fill NaN."""
+    if other.empty:
+        return base
+    merged = base.merge(other, on="date", how="left")
+    return merged.fillna(0)
+def build_macro_dataframe(symbol: str, start_date: str, end_date: str) -> pd.DataFrame:
+    """
+    Assemble all macro feature columns for a symbol over a date range.
+    Returns a DataFrame keyed on 'date' with one row per calendar day
+    that has at least one non-zero macro feature. Missing data → zeros.
+    Designed for left-joining onto the price feature matrix in feature_builder.
+    """
+    cot   = _load_cot(symbol, start_date, end_date)
+    fred  = _load_fred(start_date, end_date)
+    eia   = _load_eia(symbol, start_date, end_date)
+    usda  = _load_usda(symbol, start_date, end_date)
+    if all(df.empty for df in [cot, fred, eia, usda]):
+        return pd.DataFrame()
+    # Use FRED as the date spine (widest coverage); fall back to other sources
+    if not fred.empty:
+        base = fred[["date"]].copy()
+    elif not cot.empty:
+        base = cot[["date"]].copy()
+    else:
+        base = pd.DataFrame({"date": pd.date_range(start_date, end_date, freq="D").date})
+    df = base.copy()
+    df = _safe_merge(df, cot)
+    df = _safe_merge(df, fred)
+    df = _safe_merge(df, eia)
+    df = _safe_merge(df, usda)
+    # Ensure all expected columns present
+    for col in ALL_MACRO_FEATURES:
+        if col not in df.columns:
+            df[col] = 0.0
+    # COT and EIA are weekly; forward-fill within the merged frame
+    cot_cols  = [c for c in df.columns if c.startswith("cot_")]
+    eia_cols  = [c for c in df.columns if c.startswith("eia_")]
+    usda_cols = [c for c in df.columns if c.startswith("usda_")]
+    for col_group in [cot_cols, eia_cols, usda_cols]:
+        df[col_group] = df[col_group].replace(0, float("nan")).ffill().fillna(0)
+    # Drop columns that are >95% zero — they have no signal and add noise.
+    # This auto-excludes EIA/USDA when no API keys are set.
+    feature_cols = [c for c in ALL_MACRO_FEATURES if c in df.columns]
+    nonzero_frac = (df[feature_cols].abs() > 0).mean()
+    active_cols  = nonzero_frac[nonzero_frac >= 0.05].index.tolist()
+    if not active_cols:
+        return pd.DataFrame()
+    return df[["date"] + active_cols].sort_values("date").reset_index(drop=True)
+# ── inference: single-row feature dict ────────────────────────────────────────
+def get_macro_features(symbol: str, as_of_date: str = None) -> dict:
+    """
+    Return a flat dict of all macro features for the given symbol and date.
+    Guaranteed to return all keys in ALL_MACRO_FEATURES (zeros for missing data).
+    """
+    target = as_of_date or date.today().isoformat()
+    conn   = get_conn()
+    result = dict(_ZERO_ROW)
+    # ── COT ──────────────────────────────────────────────────────────────────
+    row = conn.execute("""
+        SELECT commercial_net_long, commercial_net_pct, mm_net_long, mm_net_pct,
+               commercial_chg_1w, mm_chg_1w, open_interest
+        FROM cot_data WHERE symbol = ? AND date <= ?
+        ORDER BY date DESC LIMIT 1
+    """, [symbol, target]).fetchone()
+    if row:
+        result.update({
+            "cot_commercial_net":     row[0] or 0,
+            "cot_commercial_net_pct": row[1] or 0,
+            "cot_mm_net":             row[2] or 0,
+            "cot_mm_net_pct":         row[3] or 0,
+            "cot_commercial_chg_1w":  row[4] or 0,
+            "cot_mm_chg_1w":          row[5] or 0,
+            "cot_open_interest":      row[6] or 0,
+            "has_cot_data":           1.0,
+        })
+    # ── FRED ─────────────────────────────────────────────────────────────────
+    try:
+        fred_now = conn.execute("""
+            SELECT dxy, inflation_exp, vix, treasury_10y, financial_stress,
+                   indpro, fedfunds, china_pmi, copper_basis
+            FROM fred_data WHERE date <= ? ORDER BY date DESC LIMIT 1
+        """, [target]).fetchone()
+    except Exception:
+        fred_now = conn.execute("""
+            SELECT dxy, inflation_exp, vix, treasury_10y, financial_stress,
+                   indpro, fedfunds
+            FROM fred_data WHERE date <= ? ORDER BY date DESC LIMIT 1
+        """, [target]).fetchone()
+        fred_now = (fred_now + (None, None)) if fred_now else None
+    week_ago = (datetime.strptime(target, "%Y-%m-%d").date() - timedelta(days=7)).isoformat()
+    fred_wk   = conn.execute("""
+        SELECT dxy, vix FROM fred_data WHERE date <= ? ORDER BY date DESC LIMIT 1
+    """, [week_ago]).fetchone()
+    month_ago = (datetime.strptime(target, "%Y-%m-%d").date() - timedelta(days=28)).isoformat()
+    fred_mo   = conn.execute("""
+        SELECT dxy FROM fred_data WHERE date <= ? ORDER BY date DESC LIMIT 1
+    """, [month_ago]).fetchone()
+    if fred_now:
+        dxy   = fred_now[0] or 0
+        vix   = fred_now[2] or 0
+        t10y  = fred_now[3] or 0
+        ff    = fred_now[6] or 0
+        dxy_w = (fred_wk[0] or dxy) if fred_wk else dxy
+        vix_w = (fred_wk[1] or vix) if fred_wk else vix
+        dxy_m = (fred_mo[0] or dxy) if fred_mo else dxy
+        result.update({
+            "fred_dxy":              dxy,
+            "fred_dxy_chg_1w":       dxy - dxy_w,
+            "fred_dxy_chg_4w":       dxy - dxy_m,
+            "fred_inflation_exp":    fred_now[1] or 0,
+            "fred_vix":              vix,
+            "fred_vix_chg_1w":       vix - vix_w,
+            "fred_vix_high":         float(vix > 25),
+            "fred_treasury_10y":     t10y,
+            "fred_financial_stress": fred_now[4] or 0,
+            "fred_indpro":           fred_now[5] or 0,
+            "fred_fedfunds":         ff,
+            "fred_yield_inv":        float(t10y < ff),
+            "fred_china_pmi":        float(fred_now[7]) if fred_now[7] is not None else 0,
+            "fred_copper_basis":     float(fred_now[8]) if fred_now[8] is not None else 0,
+            "has_fred_data":         1.0,
+        })
+    # ── EIA ──────────────────────────────────────────────────────────────────
+    series_name = EIA_SYMBOL_MAP.get(symbol)
+    prefix      = "eia_crude" if symbol == "CL=F" else "eia_natgas"
+    if series_name:
+        eia_row = conn.execute("""
+            SELECT value, chg_1w, vs_5yr_avg FROM eia_inventory
+            WHERE series = ? AND date <= ? ORDER BY date DESC LIMIT 1
+        """, [series_name, target]).fetchone()
+        if eia_row:
+            chg = eia_row[1] or 0
+            result.update({
+                f"{prefix}_stocks":  eia_row[0] or 0,
+                f"{prefix}_chg_1w":  chg,
+                f"{prefix}_vs_5yr":  eia_row[2] or 0,
+                f"{prefix}_draw":    float(chg < -500),
+                "has_eia_data":      1.0,
+            })
+    # ── USDA ─────────────────────────────────────────────────────────────────
+    if symbol in USDA_SYMBOLS:
+        latest_date_row = conn.execute("""
+            SELECT MAX(date) FROM usda_crop WHERE commodity = ? AND date <= ?
+        """, [symbol, target]).fetchone()
+        latest = latest_date_row[0] if latest_date_row and latest_date_row[0] else None
+        if latest:
+            cond_row = conn.execute("""
+                SELECT SUM(value) FROM usda_crop
+                WHERE commodity = ? AND date = ?
+                  AND (UPPER(metric) LIKE '%PCT GOOD%' OR UPPER(metric) LIKE '%PCT EXCELLENT%')
+            """, [symbol, latest]).fetchone()
+            stk_row = conn.execute("""
+                SELECT AVG(value), AVG(yoy_chg_pct) FROM usda_crop
+                WHERE commodity = ? AND date = ? AND UPPER(metric) LIKE '%STOCKS%'
+            """, [symbol, latest]).fetchone()
+            # Previous week for crop condition change
+            prev_date = (datetime.strptime(str(latest), "%Y-%m-%d").date() - timedelta(days=7)).isoformat()
+            prev_cond = conn.execute("""
+                SELECT SUM(value) FROM usda_crop
+                WHERE commodity = ? AND date = ?
+                  AND (UPPER(metric) LIKE '%PCT GOOD%' OR UPPER(metric) LIKE '%PCT EXCELLENT%')
+            """, [symbol, prev_date]).fetchone()
+            prod_row = conn.execute("""
+                SELECT value FROM usda_crop
+                WHERE commodity = ? AND UPPER(metric) LIKE '%PRODUCTION%'
+                  AND date <= ?
+                ORDER BY date DESC LIMIT 1
+            """, [symbol, target]).fetchone()
+            crop_now  = float(cond_row[0])  if cond_row and cond_row[0]  else 0
+            crop_prev = float(prev_cond[0]) if prev_cond and prev_cond[0] else crop_now
+            result.update({
+                "usda_crop_good_exc":     crop_now,
+                "usda_crop_good_exc_chg": crop_now - crop_prev,
+                "usda_stocks":            float(stk_row[0]) if stk_row and stk_row[0] else 0,
+                "usda_stocks_yoy":        float(stk_row[1]) if stk_row and stk_row[1] else 0,
+                "usda_production":        float(prod_row[0]) if prod_row and prod_row[0] else 0,
+                "has_usda_data":          1.0,
+            })
+    conn.close()
+    return result

signals/nlp_events.py ADDED Viewed

	@@ -0,0 +1,313 @@

+"""
+Event Extractor — uses spaCy + rule-based patterns to detect commodity-relevant
+events in news headlines and classify them as BULLISH / BEARISH / NEUTRAL.
+Usage:
+    python signals/nlp_events.py          # process recent news_raw articles
+    python signals/nlp_events.py --limit 200
+"""
+import argparse
+import logging
+import sys
+from datetime import date, timedelta
+from pathlib import Path
+import pandas as pd
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from data.db import get_conn, init_schema
+LOG_PATH = Path(__file__).parent.parent / "data" / "logs" / "events.log"
+LOG_PATH.parent.mkdir(exist_ok=True)
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[logging.FileHandler(LOG_PATH), logging.StreamHandler()],
+)
+log = logging.getLogger(__name__)
+# ── event pattern definitions ──────────────────────────────────────────────────
+# Each entry: (event_type, trigger_phrases, default_direction, base_severity)
+EVENT_PATTERNS: list[tuple[str, list[str], str, int]] = [
+    ("SUPPLY_SHOCK",    ["production cut", "harvest failure", "pipeline explosion",
+                         "pipeline attack", "port strike", "port closure",
+                         "sanctions imposed", "export ban", "output cut",
+                         "supply disruption", "refinery fire", "mine closure"],     "BULLISH",  4),
+    ("SUPPLY_INCREASE", ["production increase", "record output", "supply glut",
+                         "oversupply", "output raised", "inventory build",
+                         "stockpile rise"],                                          "BEARISH",  3),
+    ("DEMAND_SURGE",    ["record imports", "stockpile build", "demand forecast raised",
+                         "strong demand", "demand surge", "buying spree"],           "BULLISH",  3),
+    ("DEMAND_DROP",     ["demand falls", "demand drop", "weak demand",
+                         "economic slowdown", "recession fears", "demand cut"],      "BEARISH",  3),
+    ("POLICY_CHANGE",   ["opec decision", "fed rate", "interest rate hike",
+                         "interest rate cut", "tariff imposed", "trade deal",
+                         "subsidy cut", "subsidy increase", "central bank"],         "NEUTRAL",  2),
+    ("WEATHER_EVENT",   ["drought", "flood", "frost", "la niña", "el niño",
+                         "monsoon failure", "heatwave", "crop damage",
+                         "hurricane", "cyclone", "typhoon"],                         "BULLISH",  4),
+    ("GEOPOLITICAL",    ["war", "armed conflict", "sanctions", "embargo",
+                         "coup", "invasion", "airstrike", "blockade"],               "BULLISH",  5),
+]
+# Commodity-specific policy direction overrides
+# (event_type, commodity) → direction
+POLICY_DIRECTION_OVERRIDES: dict[tuple[str, str], str] = {
+    ("POLICY_CHANGE", "GC=F"):    "BULLISH",   # rate cuts → gold up
+    ("POLICY_CHANGE", "USDINR=X"):"BEARISH",   # rate hikes → stronger USD → bearish INR
+    ("POLICY_CHANGE", "CL=F"):    "BEARISH",   # trade deal → supply up
+}
+# Region → commodities most affected by weather in that region
+REGION_COMMODITY_WEATHER: dict[str, list[str]] = {
+    "ukraine": ["ZW=F", "ZC=F"],
+    "russia":  ["ZW=F"],
+    "brazil":  ["ZS=F", "CT=F", "SB=F"],
+    "india":   ["CT=F", "SB=F"],
+    "us":      ["ZC=F", "ZS=F", "CL=F", "NG=F"],
+    "texas":   ["CL=F", "NG=F"],
+    "chile":   ["HG=F"],
+    "middle east": ["CL=F", "NG=F"],
+    "opec":    ["CL=F", "NG=F"],
+    "gulf":    ["CL=F", "NG=F"],
+}
+# Commodity keywords for tagging (same as news collector)
+COMMODITY_KEYWORDS: dict[str, list[str]] = {
+    "CL=F":    ["oil", "petroleum", "crude", "opec", "brent", "wti"],
+    "NG=F":    ["natural gas", "lng", "gas pipeline"],
+    "GC=F":    ["gold", "bullion", "safe haven"],
+    "ZW=F":    ["wheat", "grain", "flour"],
+    "ZC=F":    ["corn", "maize"],
+    "ZS=F":    ["soybean", "soy"],
+    "CT=F":    ["cotton"],
+    "SB=F":    ["sugar", "cane"],
+    "USDINR=X":["rupee", "inr", "india forex"],
+    "HG=F":    ["copper"],
+}
+# ── spaCy loader (lazy) ────────────────────────────────────────────────────────
+_nlp = None
+def _load_nlp():
+    global _nlp
+    if _nlp is None:
+        import spacy
+        try:
+            _nlp = spacy.load("en_core_web_sm")
+        except OSError:
+            log.warning("en_core_web_sm not found — run: python -m spacy download en_core_web_sm")
+            _nlp = None
+    return _nlp
+# ── helpers ────────────────────────────────────────────────────────────────────
+def _detect_commodities(text: str) -> list[str]:
+    lower = text.lower()
+    return [sym for sym, kws in COMMODITY_KEYWORDS.items() if any(k in lower for k in kws)]
+def _detect_location(text: str) -> str:
+    """Extract first recognised location from text using spaCy GPE entities."""
+    nlp = _load_nlp()
+    if nlp is None:
+        return "unknown"
+    doc = nlp(text[:300])
+    for ent in doc.ents:
+        if ent.label_ in ("GPE", "LOC"):
+            return ent.text
+    return "unknown"
+def _resolve_direction(event_type: str, commodities: list[str], default: str) -> str:
+    """Apply commodity-specific overrides to the default direction."""
+    if not commodities:
+        return default
+    for commodity in commodities:
+        override = POLICY_DIRECTION_OVERRIDES.get((event_type, commodity))
+        if override:
+            return override
+    return default
+def _severity_from_text(text: str, base: int) -> int:
+    """Bump severity +1 if text contains intensifiers."""
+    intensifiers = ["massive", "unprecedented", "historic", "emergency",
+                    "catastrophic", "record", "major", "severe"]
+    lower = text.lower()
+    bump = sum(1 for w in intensifiers if w in lower)
+    return min(5, base + bump)
+# ── public API ─────────────────────────────────────────────────────────────────
+def extract_events(text: str, event_date: str) -> list[dict]:
+    """
+    Extract commodity-relevant events from a text string.
+    Args:
+        text:       Article headline or summary.
+        event_date: ISO date string "YYYY-MM-DD" for the event.
+    Returns:
+        List of dicts with keys: date, headline, event_type, commodity,
+        location, severity, direction, source.
+    """
+    lower = text.lower()
+    events: list[dict] = []
+    for evt_type, phrases, default_direction, base_severity in EVENT_PATTERNS:
+        matched_phrase = next((p for p in phrases if p in lower), None)
+        if not matched_phrase:
+            continue
+        commodities = _detect_commodities(text)
+        if not commodities:
+            # For weather/geopolitical, try to infer commodity from location
+            location = _detect_location(text)
+            loc_lower = location.lower()
+            for region, syms in REGION_COMMODITY_WEATHER.items():
+                if region in loc_lower:
+                    commodities = syms
+                    break
+        if not commodities:
+            commodities = ["CL=F"]  # fallback to crude oil as most globally traded
+        direction = _resolve_direction(evt_type, commodities, default_direction)
+        severity = _severity_from_text(text, base_severity)
+        location = _detect_location(text)
+        for commodity in commodities:
+            events.append({
+                "date": event_date,
+                "headline": text[:500],
+                "event_type": evt_type,
+                "commodity": commodity,
+                "location": location,
+                "severity": severity,
+                "direction": direction,
+                "source": "nlp_events",
+            })
+    return events
+def process_batch(limit: int = 100) -> int:
+    """
+    Extract events from recent news_raw articles and store in extracted_events.
+    Args:
+        limit: Max articles to scan.
+    Returns:
+        Count of events extracted.
+    """
+    conn = get_conn()
+    df = conn.execute(
+        """
+        SELECT id, title, summary, published_date
+        FROM news_raw
+        ORDER BY published_date DESC
+        LIMIT ?
+        """,
+        [limit],
+    ).df()
+    conn.close()
+    if df.empty:
+        return 0
+    total_events = 0
+    conn = get_conn()
+    for _, row in df.iterrows():
+        text = f"{row.get('title', '')} {row.get('summary', '')}".strip()
+        pub = str(row.get("published_date", date.today()))[:10]
+        events = extract_events(text, pub)
+        for evt in events:
+            try:
+                conn.execute(
+                    """
+                    INSERT INTO extracted_events
+                        (date, headline, event_type, commodity, location,
+                         severity, direction, source)
+                    VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+                    """,
+                    [
+                        evt["date"], evt["headline"], evt["event_type"],
+                        evt["commodity"], evt["location"], evt["severity"],
+                        evt["direction"], evt["source"],
+                    ],
+                )
+                total_events += 1
+            except Exception as exc:
+                log.debug("Event insert error: %s", exc)
+    conn.close()
+    log.info("Extracted %d events from %d articles", total_events, len(df))
+    return total_events
+def get_event_features(commodity: str, days: int = 30) -> pd.DataFrame:
+    """
+    Return aggregated event features for a commodity over a date window.
+    Args:
+        commodity: Ticker symbol, e.g. "CL=F"
+        days:      Look-back window in calendar days
+    Returns:
+        DataFrame with one row per date, columns:
+        event_count, bullish_count, bearish_count, max_severity,
+        direction_score (bullish=+1, bearish=-1, neutral=0, summed),
+        supply_shock_flag (1 if any SUPPLY_SHOCK that day),
+        policy_change_flag (1 if any POLICY_CHANGE that day)
+    """
+    cutoff = date.today() - timedelta(days=days)
+    conn = get_conn()
+    df = conn.execute(
+        """
+        SELECT date, event_type, direction, severity
+        FROM extracted_events
+        WHERE commodity = ? AND date >= ?
+        ORDER BY date
+        """,
+        [commodity, cutoff],
+    ).df()
+    conn.close()
+    if df.empty:
+        return pd.DataFrame(columns=[
+            "date", "event_count", "bullish_count", "bearish_count",
+            "max_severity", "direction_score", "supply_shock_flag", "policy_change_flag",
+        ])
+    df["dir_score"] = df["direction"].map({"BULLISH": 1, "BEARISH": -1, "NEUTRAL": 0}).fillna(0)
+    agg = df.groupby("date").agg(
+        event_count=("event_type", "count"),
+        bullish_count=("direction", lambda x: (x == "BULLISH").sum()),
+        bearish_count=("direction", lambda x: (x == "BEARISH").sum()),
+        max_severity=("severity", "max"),
+        direction_score=("dir_score", "sum"),
+        supply_shock_flag=("event_type", lambda x: int((x == "SUPPLY_SHOCK").any())),
+        policy_change_flag=("event_type", lambda x: int((x == "POLICY_CHANGE").any())),
+    ).reset_index()
+    return agg
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--limit", type=int, default=100)
+    args = parser.parse_args()
+    init_schema()
+    n = process_batch(limit=args.limit)
+    print(f"Extracted {n} events")

signals/nlp_sentiment.py ADDED Viewed

	@@ -0,0 +1,337 @@

+"""
+NLP Sentiment Engine — scores commodity news articles using FinBERT,
+aggregates into daily sentiment features per commodity.
+Usage:
+    python signals/nlp_sentiment.py          # process all unscored articles
+    python signals/nlp_sentiment.py --limit 200
+"""
+import argparse
+import logging
+import sys
+from datetime import date, datetime, timedelta, timezone
+from pathlib import Path
+import pandas as pd
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from data.db import get_conn, init_schema
+from data.collector_news import get_unprocessed_news, mark_processed
+LOG_PATH = Path(__file__).parent.parent / "data" / "logs" / "sentiment.log"
+LOG_PATH.parent.mkdir(exist_ok=True)
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s %(levelname)s %(message)s",
+    handlers=[logging.FileHandler(LOG_PATH), logging.StreamHandler()],
+)
+log = logging.getLogger(__name__)
+# Max tokens for FinBERT input (model limit is 512, we use 256 for speed)
+MAX_TOKENS = 256
+# ── model loading (lazy, cached) ───────────────────────────────────────────────
+_pipeline = None
+def _load_pipeline():
+    """Load FinBERT pipeline once and cache it. Falls back to DistilBERT."""
+    global _pipeline
+    if _pipeline is not None:
+        return _pipeline
+    from transformers import pipeline as hf_pipeline
+    try:
+        log.info("Loading FinBERT (ProsusAI/finbert)...")
+        _pipeline = hf_pipeline(
+            "text-classification",
+            model="ProsusAI/finbert",
+            tokenizer="ProsusAI/finbert",
+            top_k=None,          # return all 3 class probabilities
+            device=-1,           # CPU
+            truncation=True,
+            max_length=MAX_TOKENS,
+        )
+        log.info("FinBERT loaded")
+    except Exception as exc:
+        log.warning("FinBERT load failed (%s), falling back to DistilBERT", exc)
+        _pipeline = hf_pipeline(
+            "text-classification",
+            model="distilbert-base-uncased-finetuned-sst-2-english",
+            top_k=None,
+            device=-1,
+            truncation=True,
+            max_length=MAX_TOKENS,
+        )
+        log.info("DistilBERT loaded as fallback")
+    return _pipeline
+# ── keyword-based baseline (for ensemble uncertainty check) ────────────────────
+_BULLISH_WORDS = {
+    "surge", "rally", "gain", "rise", "boom", "shortage", "record high",
+    "supply cut", "output cut", "strong demand", "bullish",
+}
+_BEARISH_WORDS = {
+    "fall", "drop", "crash", "decline", "surplus", "oversupply",
+    "demand drop", "weak demand", "bearish", "glut",
+}
+def _keyword_sentiment(text: str) -> float:
+    """Fast keyword-based sentiment score in [-1, +1]."""
+    lower = text.lower()
+    pos = sum(1 for w in _BULLISH_WORDS if w in lower)
+    neg = sum(1 for w in _BEARISH_WORDS if w in lower)
+    total = pos + neg
+    return (pos - neg) / total if total > 0 else 0.0
+# ── public API ─────────────────────────────────────────────────────────────────
+def score_article(text: str) -> float:
+    """
+    Score a single text string using FinBERT.
+    Returns:
+        Sentiment score in [-1.0, +1.0].
+        positive_prob - negative_prob from FinBERT.
+        Falls back to keyword score if model unavailable.
+    """
+    if not text or len(text.strip()) < 10:
+        return 0.0
+    try:
+        pipe = _load_pipeline()
+        results = pipe(text[:512])[0]  # list of {label, score} dicts
+        scores = {r["label"].lower(): r["score"] for r in results}
+        # FinBERT labels: positive / negative / neutral
+        # DistilBERT labels: POSITIVE / NEGATIVE — normalize
+        pos = scores.get("positive", scores.get("label_1", 0.0))
+        neg = scores.get("negative", scores.get("label_0", 0.0))
+        ml_score = pos - neg  # range [-1, +1]
+        # Ensemble uncertainty check: if ML and keyword disagree strongly, use neutral
+        kw_score = _keyword_sentiment(text)
+        if abs(ml_score - kw_score) > 0.4 and abs(kw_score) > 0.1:
+            return 0.0
+        return round(ml_score, 4)
+    except Exception as exc:
+        log.debug("score_article error: %s", exc)
+        return _keyword_sentiment(text)
+def process_batch(limit: int = 100) -> int:
+    """
+    Score unprocessed articles from news_raw using batched inference and store
+    aggregated daily sentiment.
+    Batched pipeline call is 10-30x faster than scoring one article at a time.
+    Args:
+        limit: Max articles to process per call.
+    Returns:
+        Count of articles processed.
+    """
+    df = get_unprocessed_news(limit=limit)
+    if df.empty:
+        log.info("No unprocessed articles found")
+        return 0
+    log.info("Processing %d articles (batched)...", len(df))
+    # Build text list and IDs together
+    texts = [
+        f"{row.get('title', '')} {row.get('summary', '')}".strip()[:512]
+        for _, row in df.iterrows()
+    ]
+    ids = df["id"].tolist()
+    # Single batched pipeline call — far faster than N individual calls
+    pipe = _load_pipeline()
+    try:
+        batch_results = pipe(texts, batch_size=16, truncation=True)
+    except Exception as exc:
+        log.warning("Batched inference failed (%s), falling back to per-article", exc)
+        batch_results = [pipe(t)[0] for t in texts]
+    scores: list[float] = []
+    for text, result in zip(texts, batch_results):
+        try:
+            # result is a list of dicts when top_k=None
+            label_scores = {r["label"].lower(): r["score"] for r in result}
+            pos = label_scores.get("positive", label_scores.get("label_1", 0.0))
+            neg = label_scores.get("negative", label_scores.get("label_0", 0.0))
+            ml_score = pos - neg
+            kw_score = _keyword_sentiment(text)
+            if abs(ml_score - kw_score) > 0.4 and abs(kw_score) > 0.1:
+                scores.append(0.0)
+            else:
+                scores.append(round(ml_score, 4))
+        except Exception:
+            scores.append(_keyword_sentiment(text))
+    # Bulk update in one connection
+    conn = get_conn()
+    for article_id, score in zip(ids, scores):
+        conn.execute(
+            "UPDATE news_raw SET sentiment_score = ?, processed = TRUE WHERE id = ?",
+            [score, article_id],
+        )
+    conn.close()
+    _aggregate_daily_sentiment()
+    log.info("Processed %d articles", len(ids))
+    return len(ids)
+def _aggregate_daily_sentiment() -> None:
+    """
+    Recompute sentiment_daily table from scored news_raw rows.
+    Applies time-decay weights: 1.0 (<24h), 0.5 (24–48h), 0.25 (48–72h).
+    """
+    conn = get_conn()
+    # Get scored articles from last 7 days
+    cutoff = (datetime.now(timezone.utc) - timedelta(days=7)).strftime("%Y-%m-%d")
+    df = conn.execute(
+        """
+        SELECT id, published_date, commodity_tags, sentiment_score
+        FROM news_raw
+        WHERE processed = TRUE
+          AND sentiment_score IS NOT NULL
+          AND published_date >= ?
+        """,
+        [cutoff],
+    ).df()
+    conn.close()
+    if df.empty:
+        return
+    now = datetime.now(timezone.utc)
+    rows_to_upsert: list[dict] = []
+    # Explode commodity tags — one row per commodity mention
+    records = []
+    for _, row in df.iterrows():
+        tags = str(row.get("commodity_tags") or "").split(",")
+        pub = row["published_date"]
+        if isinstance(pub, str):
+            try:
+                pub = datetime.fromisoformat(pub.replace("Z", "+00:00"))
+            except Exception:
+                pub = now
+        if pub.tzinfo is None:
+            pub = pub.replace(tzinfo=timezone.utc)
+        age_hours = (now - pub).total_seconds() / 3600
+        weight = 1.0 if age_hours < 24 else (0.5 if age_hours < 48 else 0.25)
+        for tag in tags:
+            tag = tag.strip()
+            if tag:
+                records.append({
+                    "date": pub.date(),
+                    "commodity": tag,
+                    "score": row["sentiment_score"],
+                    "weight": weight,
+                })
+    if not records:
+        return
+    df_exp = pd.DataFrame(records)
+    # Weighted average per (date, commodity)
+    def _wavg(g):
+        w = g["weight"]
+        s = g["score"]
+        total_w = w.sum()
+        return {
+            "sentiment_score": (s * w).sum() / total_w if total_w > 0 else 0.0,
+            "article_count": len(g),
+            "positive_count": int((s > 0.1).sum()),
+            "negative_count": int((s < -0.1).sum()),
+        }
+    summary = df_exp.groupby(["date", "commodity"]).apply(_wavg).reset_index()
+    conn = get_conn()
+    for _, row in summary.iterrows():
+        vals = row[0] if isinstance(row[0], dict) else row.to_dict()
+        conn.execute(
+            """
+            INSERT OR REPLACE INTO sentiment_daily
+                (date, commodity, sentiment_score, article_count,
+                 positive_count, negative_count)
+            VALUES (?, ?, ?, ?, ?, ?)
+            """,
+            [
+                row["date"],
+                row["commodity"],
+                vals.get("sentiment_score", 0.0),
+                vals.get("article_count", 0),
+                vals.get("positive_count", 0),
+                vals.get("negative_count", 0),
+            ],
+        )
+    conn.close()
+def get_sentiment_features(commodity: str, days: int = 30) -> pd.DataFrame:
+    """
+    Return daily sentiment features for a commodity with rolling averages.
+    Args:
+        commodity: Ticker symbol, e.g. "ZW=F"
+        days:      Look-back window in calendar days
+    Returns:
+        DataFrame with columns: date, sentiment_score, article_count,
+        sentiment_3d, sentiment_7d, positive_ratio_7d
+    """
+    cutoff = date.today() - timedelta(days=days)
+    conn = get_conn()
+    df = conn.execute(
+        """
+        SELECT * FROM sentiment_daily
+        WHERE commodity = ? AND date >= ?
+        ORDER BY date
+        """,
+        [commodity, cutoff],
+    ).df()
+    conn.close()
+    if df.empty:
+        return df
+    df = df.sort_values("date").reset_index(drop=True)
+    df["sentiment_3d"] = df["sentiment_score"].rolling(3, min_periods=1).mean()
+    df["sentiment_7d"] = df["sentiment_score"].rolling(7, min_periods=1).mean()
+    df["positive_ratio_7d"] = (
+        df["positive_count"].rolling(7, min_periods=1).sum()
+        / df["article_count"].rolling(7, min_periods=1).sum().replace(0, 1)
+    )
+    return df
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--limit", type=int, default=100)
+    args = parser.parse_args()
+    init_schema()
+    n = process_batch(limit=args.limit)
+    print(f"Processed {n} articles")

signals/price_features.py ADDED Viewed

	@@ -0,0 +1,365 @@

+"""
+Price Feature Engineer — computes all technical, momentum, seasonality, and
+cross-commodity features from stored price data.
+All features are derived from DuckDB prices table — no live API calls.
+Usage (standalone):
+    python signals/price_features.py --symbol GC=F --date 2024-06-01
+"""
+import argparse
+import json
+import logging
+import sys
+from datetime import date, datetime, timedelta
+from pathlib import Path
+import numpy as np
+import pandas as pd
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from data.db import get_conn
+log = logging.getLogger(__name__)
+# ── commodity metadata ─────────────────────────────────────────────────────────
+SYMBOL_NAMES: dict[str, str] = {
+    "CL=F":    "Crude Oil",
+    "NG=F":    "Natural Gas",
+    "GC=F":    "Gold",
+    "ZW=F":    "Wheat",
+    "ZC=F":    "Corn",
+    "ZS=F":    "Soybeans",
+    "CT=F":    "Cotton",
+    "SB=F":    "Sugar",
+    "USDINR=X":"USD/INR",
+    "HG=F":    "Copper",
+}
+ALL_SYMBOLS = list(SYMBOL_NAMES.keys())
+# Harvest season windows (month_start, month_end) — inclusive
+HARVEST_SEASONS: dict[str, list[tuple[int, int]]] = {
+    "ZW=F":    [(6, 8)],          # Northern hemisphere wheat: June–August
+    "ZC=F":    [(9, 11)],         # US corn: September–November
+    "ZS=F":    [(9, 11), (3, 5)], # US + Brazil soy harvest windows
+    "CT=F":    [(9, 12)],         # US cotton: September–December
+    "SB=F":    [(4, 6), (10, 12)],# Brazil + India sugar
+}
+# OPEC meeting dates (ISO strings) — extend annually
+OPEC_MEETING_DATES: list[str] = [
+    "2024-06-02", "2024-11-26",
+    "2025-03-03", "2025-05-28", "2025-11-05",
+    "2026-03-02", "2026-06-01",
+]
+# ── helpers ────────────────────────────────────────────────────────────────────
+def _load_prices(symbol: str, days: int = 400) -> pd.DataFrame:
+    """
+    Load OHLCV data for a symbol from DuckDB.
+    Returns DataFrame sorted by date ascending with at least `days` rows of buffer.
+    """
+    cutoff = date.today() - timedelta(days=days)
+    conn = get_conn()
+    df = conn.execute(
+        "SELECT date, open, high, low, close, volume, adj_close FROM prices "
+        "WHERE symbol = ? AND date >= ? ORDER BY date",
+        [symbol, cutoff],
+    ).df()
+    conn.close()
+    df["date"] = pd.to_datetime(df["date"])
+    return df.sort_values("date").reset_index(drop=True)
+def _load_all_prices_latest() -> dict[str, float]:
+    """Return latest close price for every symbol (used for cross-commodity ratios)."""
+    conn = get_conn()
+    rows = conn.execute(
+        """
+        SELECT symbol, close FROM prices p
+        WHERE date = (SELECT MAX(date) FROM prices p2 WHERE p2.symbol = p.symbol)
+        """
+    ).fetchall()
+    conn.close()
+    return {r[0]: r[1] for r in rows}
+def _days_to_next_opec(as_of: date) -> int:
+    """Return calendar days until the next OPEC meeting on or after `as_of`."""
+    future = [
+        (datetime.strptime(d, "%Y-%m-%d").date() - as_of).days
+        for d in OPEC_MEETING_DATES
+        if datetime.strptime(d, "%Y-%m-%d").date() >= as_of
+    ]
+    return min(future) if future else 180  # default if no upcoming date in list
+def _harvest_season_flag(symbol: str, month: int) -> int:
+    """Return 1 if `month` falls within any harvest window for the symbol."""
+    windows = HARVEST_SEASONS.get(symbol, [])
+    for start_m, end_m in windows:
+        if start_m <= month <= end_m:
+            return 1
+    return 0
+def _compute_ta_features(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Append technical analysis columns using pandas-ta.
+    Works on a copy of df — returns augmented DataFrame.
+    """
+    try:
+        import pandas_ta as ta
+        df = df.copy()
+        df.ta.rsi(length=14, append=True)           # RSI_14
+        df.ta.macd(fast=12, slow=26, signal=9, append=True)   # MACD_12_26_9, etc.
+        df.ta.bbands(length=20, std=2, append=True) # BBL_20_2.0, BBM_20_2.0, BBU_20_2.0
+        df.ta.atr(length=14, append=True)           # ATRr_14
+        df.ta.sma(length=20, append=True)           # SMA_20
+        df.ta.sma(length=50, append=True)           # SMA_50
+    except ImportError:
+        log.warning("pandas-ta not installed — TA features will be NaN")
+    return df
+def _safe(val) -> float:
+    """Return 0.0 for NaN/None values to keep feature vector clean."""
+    if val is None:
+        return 0.0
+    try:
+        v = float(val)
+        return 0.0 if (v != v) else v  # NaN check without numpy
+    except (TypeError, ValueError):
+        return 0.0
+# ── public API ─────────────────────────────────────────────────────────────────
+def get_price_features(symbol: str, as_of_date: str = None) -> dict:
+    """
+    Compute all price-based features for a symbol on a given date.
+    Args:
+        symbol:      Commodity ticker, e.g. "GC=F"
+        as_of_date:  ISO date string. Defaults to today.
+    Returns:
+        Flat dict of feature_name → float value.
+        All values are guaranteed non-NaN (NaN → 0.0).
+    """
+    target_date = (
+        datetime.strptime(as_of_date, "%Y-%m-%d").date()
+        if as_of_date
+        else date.today()
+    )
+    df = _load_prices(symbol, days=400)
+    if df.empty or len(df) < 20:
+        log.warning("%s: insufficient price history for feature engineering", symbol)
+        return {}
+    df = _compute_ta_features(df)
+    # Locate the row nearest to target_date (T-1 to avoid lookahead)
+    df["_date"] = df["date"].dt.date
+    available = df[df["_date"] <= target_date]
+    if available.empty:
+        return {}
+    row = available.iloc[-1]
+    idx = available.index[-1]
+    close = _safe(row["close"])
+    if close == 0:
+        return {}
+    # ── momentum / returns ──
+    def _pct_change(lookback_days: int) -> float:
+        past = df[df["_date"] <= (target_date - timedelta(days=lookback_days))]
+        if past.empty:
+            return 0.0
+        past_close = _safe(past.iloc[-1]["close"])
+        return round((close - past_close) / past_close * 100, 4) if past_close else 0.0
+    ret_1d  = _pct_change(1)
+    ret_7d  = _pct_change(7)
+    ret_14d = _pct_change(14)
+    ret_30d = _pct_change(30)
+    ret_60d = _pct_change(60)
+    momentum_score = float(np.sign(ret_7d) + np.sign(ret_30d))  # -2 to +2
+    # ── technical indicators from pandas-ta ──
+    rsi = _safe(row.get("RSI_14"))
+    macd = _safe(row.get("MACD_12_26_9"))
+    macd_signal_line = _safe(row.get("MACDs_12_26_9"))
+    macd_signal = 1 if macd > macd_signal_line else (-1 if macd < macd_signal_line else 0)
+    bb_lower = _safe(row.get("BBL_20_2.0"))
+    bb_upper = _safe(row.get("BBU_20_2.0"))
+    bb_range = bb_upper - bb_lower
+    bb_position = ((close - bb_lower) / bb_range) if bb_range > 0 else 0.5
+    atr = _safe(row.get("ATRr_14"))
+    atr_pct = (atr / close * 100) if close > 0 else 0.0
+    sma20 = _safe(row.get("SMA_20"))
+    sma50 = _safe(row.get("SMA_50"))
+    sma_20_50_cross = 1 if sma20 > sma50 else -1
+    # ── seasonality ──
+    month = target_date.month
+    day_of_week = target_date.weekday()  # 0=Monday
+    month_sin = float(np.sin(2 * np.pi * month / 12))
+    month_cos = float(np.cos(2 * np.pi * month / 12))
+    harvest_flag = _harvest_season_flag(symbol, month)
+    # Oil/gas: days to next OPEC meeting
+    days_opec = _days_to_next_opec(target_date) if symbol in ("CL=F", "NG=F") else 0
+    # ── cross-commodity features ──
+    latest_prices = _load_all_prices_latest()
+    cl_price = latest_prices.get("CL=F", 0)
+    gc_price = latest_prices.get("GC=F", 0)
+    oil_gold_ratio = round(cl_price / gc_price, 6) if gc_price > 0 else 0.0
+    # DXY proxy: inverted gold price normalised (gold up → USD weak)
+    gc_hist_mean = df["close"].mean() if not df.empty else 1.0
+    dxy_proxy = round(1 - (gc_price / gc_hist_mean) if gc_hist_mean > 0 else 0.5, 4)
+    return {
+        # Technical
+        "rsi_14":           round(rsi, 4),
+        "macd_signal":      macd_signal,
+        "bb_position":      round(bb_position, 4),
+        "atr_14":           round(atr, 4),
+        "atr_pct":          round(atr_pct, 4),
+        "sma_20_50_cross":  sma_20_50_cross,
+        # Momentum
+        "return_1d":        ret_1d,
+        "return_7d":        ret_7d,
+        "return_14d":       ret_14d,
+        "return_30d":       ret_30d,
+        "return_60d":       ret_60d,
+        "momentum_score":   momentum_score,
+        # Seasonality
+        "month_sin":        round(month_sin, 4),
+        "month_cos":        round(month_cos, 4),
+        "day_of_week":      day_of_week,
+        "harvest_season_flag": harvest_flag,
+        "days_to_opec_meeting": days_opec,
+        # Cross-commodity
+        "oil_gold_ratio":   oil_gold_ratio,
+        "dxy_proxy":        dxy_proxy,
+    }
+def build_feature_matrix(
+    symbol: str,
+    start_date: str,
+    end_date: str,
+) -> pd.DataFrame:
+    """
+    Build a feature matrix for model training — one row per trading day.
+    Args:
+        symbol:     Commodity ticker
+        start_date: ISO date string "YYYY-MM-DD"
+        end_date:   ISO date string "YYYY-MM-DD"
+    Returns:
+        DataFrame with one row per date, all price feature columns.
+        Does NOT include target variable — caller adds that.
+    """
+    start = datetime.strptime(start_date, "%Y-%m-%d").date()
+    end   = datetime.strptime(end_date, "%Y-%m-%d").date()
+    # Load full price history once
+    df_prices = _load_prices(symbol, days=(end - start).days + 500)
+    if df_prices.empty:
+        return pd.DataFrame()
+    df_prices = _compute_ta_features(df_prices)
+    df_prices["_date"] = df_prices["date"].dt.date
+    latest_prices = _load_all_prices_latest()
+    cl_price = latest_prices.get("CL=F", 0)
+    gc_price = latest_prices.get("GC=F", 0)
+    gc_hist_mean = df_prices["close"].mean() if not df_prices.empty else 1.0
+    rows: list[dict] = []
+    for _, price_row in df_prices.iterrows():
+        row_date = price_row["_date"]
+        if not (start <= row_date <= end):
+            continue
+        close = _safe(price_row["close"])
+        if close == 0:
+            continue
+        # Returns — look back within df_prices to avoid reloading
+        def _ret(days: int) -> float:
+            past = df_prices[df_prices["_date"] <= (row_date - timedelta(days=days))]
+            if past.empty:
+                return 0.0
+            pc = _safe(past.iloc[-1]["close"])
+            return round((close - pc) / pc * 100, 4) if pc else 0.0
+        ret_1d  = _ret(1)
+        ret_7d  = _ret(7)
+        ret_14d = _ret(14)
+        ret_30d = _ret(30)
+        ret_60d = _ret(60)
+        rsi = _safe(price_row.get("RSI_14"))
+        macd = _safe(price_row.get("MACD_12_26_9"))
+        macd_sig = _safe(price_row.get("MACDs_12_26_9"))
+        bb_lower = _safe(price_row.get("BBL_20_2.0"))
+        bb_upper = _safe(price_row.get("BBU_20_2.0"))
+        bb_range = bb_upper - bb_lower
+        atr = _safe(price_row.get("ATRr_14"))
+        sma20 = _safe(price_row.get("SMA_20"))
+        sma50 = _safe(price_row.get("SMA_50"))
+        month = row_date.month
+        rows.append({
+            "date":                 row_date,
+            "rsi_14":               round(rsi, 4),
+            "macd_signal":          1 if macd > macd_sig else (-1 if macd < macd_sig else 0),
+            "bb_position":          round((close - bb_lower) / bb_range, 4) if bb_range > 0 else 0.5,
+            "atr_14":               round(atr, 4),
+            "atr_pct":              round(atr / close * 100, 4) if close > 0 else 0.0,
+            "sma_20_50_cross":      1 if sma20 > sma50 else -1,
+            "return_1d":            ret_1d,
+            "return_7d":            ret_7d,
+            "return_14d":           ret_14d,
+            "return_30d":           ret_30d,
+            "return_60d":           ret_60d,
+            "momentum_score":       float(np.sign(ret_7d) + np.sign(ret_30d)),
+            "month_sin":            round(float(np.sin(2 * np.pi * month / 12)), 4),
+            "month_cos":            round(float(np.cos(2 * np.pi * month / 12)), 4),
+            "day_of_week":          row_date.weekday(),
+            "harvest_season_flag":  _harvest_season_flag(symbol, month),
+            "days_to_opec_meeting": _days_to_next_opec(row_date) if symbol in ("CL=F", "NG=F") else 0,
+            "oil_gold_ratio":       round(cl_price / gc_price, 6) if gc_price > 0 else 0.0,
+            "dxy_proxy":            round(1 - gc_price / gc_hist_mean, 4) if gc_hist_mean > 0 else 0.5,
+        })
+    return pd.DataFrame(rows)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--symbol", default="GC=F")
+    parser.add_argument("--date", default=None)
+    args = parser.parse_args()
+    features = get_price_features(args.symbol, args.date)
+    print(json.dumps(features, indent=2))

signals/weather_features.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+Weather Features — thin wrapper that surfaces weather_features table data
+as commodity-specific signals for the feature builder.
+The heavy lifting (fetching + engineering drought_index etc.) is done in
+data/collector_weather.py. This module just shapes the data for ML consumption.
+"""
+import sys
+from datetime import date, timedelta
+from pathlib import Path
+import pandas as pd
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from data.db import get_conn
+# Which regions matter most per commodity
+COMMODITY_REGIONS: dict[str, list[str]] = {
+    "CL=F":    ["middle_east_oil", "texas_energy"],
+    "NG=F":    ["texas_energy", "middle_east_oil"],
+    "GC=F":    ["south_africa_gold"],
+    "ZW=F":    ["black_sea_ukraine"],
+    "ZC=F":    ["us_corn_belt", "black_sea_ukraine"],
+    "ZS=F":    ["us_corn_belt", "brazil_soy"],
+    "CT=F":    ["india_monsoon", "brazil_soy"],
+    "SB=F":    ["india_monsoon", "brazil_soy"],
+    "USDINR=X":["india_monsoon"],
+    "HG=F":    ["chile_copper"],
+}
+def get_weather_features(commodity: str, days: int = 90) -> dict:
+    """
+    Return the latest aggregated weather signals for a commodity.
+    Averages drought_index, heat_stress_days, and precip_anomaly_pct across
+    the commodity's primary regions over the last 30 days.
+    Args:
+        commodity: Ticker symbol, e.g. "ZW=F"
+        days:      Look-back window in calendar days (used for region filter)
+    Returns:
+        Dict with keys: drought_index, heat_stress_days, precip_anomaly_pct.
+        Returns zeros if no data found.
+    """
+    regions = COMMODITY_REGIONS.get(commodity, [])
+    if not regions:
+        return {"drought_index": 0.0, "heat_stress_days": 0, "precip_anomaly_pct": 0.0}
+    cutoff = date.today() - timedelta(days=30)  # use last 30 days for signal
+    placeholders = ",".join(["?"] * len(regions))
+    conn = get_conn()
+    df = conn.execute(
+        f"""
+        SELECT drought_index, heat_stress_days, precip_anomaly_pct
+        FROM weather_features
+        WHERE commodity = ?
+          AND region IN ({placeholders})
+          AND date >= ?
+        """,
+        [commodity] + regions + [cutoff],
+    ).df()
+    conn.close()
+    if df.empty:
+        return {"drought_index": 0.0, "heat_stress_days": 0, "precip_anomaly_pct": 0.0}
+    return {
+        "drought_index":      round(float(df["drought_index"].mean()), 4),
+        "heat_stress_days":   int(df["heat_stress_days"].mean()),
+        "precip_anomaly_pct": round(float(df["precip_anomaly_pct"].mean()), 2),
+    }
+def get_weather_dataframe(commodity: str, days: int = 90) -> pd.DataFrame:
+    """
+    Return time-series weather data for a commodity (all relevant regions).
+    Used by the feature builder to join weather signals into the training matrix.
+    """
+    regions = COMMODITY_REGIONS.get(commodity, [])
+    if not regions:
+        return pd.DataFrame()
+    cutoff = date.today() - timedelta(days=days)
+    placeholders = ",".join(["?"] * len(regions))
+    conn = get_conn()
+    df = conn.execute(
+        f"""
+        SELECT date, region,
+               drought_index, heat_stress_days, precip_anomaly_pct
+        FROM weather_features
+        WHERE commodity = ?
+          AND region IN ({placeholders})
+          AND date >= ?
+        ORDER BY date
+        """,
+        [commodity] + regions + [cutoff],
+    ).df()
+    conn.close()
+    if df.empty:
+        return df
+    # Average across regions per date
+    return (
+        df.groupby("date")
+        .agg(
+            drought_index=("drought_index", "mean"),
+            heat_stress_days=("heat_stress_days", "mean"),
+            precip_anomaly_pct=("precip_anomaly_pct", "mean"),
+        )
+        .reset_index()
+        .sort_values("date")
+    )