Spaces:

khushalcodiste
/

testruk

Sleeping

App Files Files Community

khushalcodiste commited on Apr 18

Commit

a2f5871

1 Parent(s): c18b91b

Add application file

Browse files

Files changed (14) hide show

.dockerignore +15 -0
DEPLOY.md +92 -0
Dockerfile +31 -0
README.md +80 -8
app.py +364 -0
ml/__init__.py +0 -0
ml/features.py +226 -0
ml/features_v2.py +275 -0
models/gaussian_nb__parity.v2.joblib +3 -0
models/mlp__number.joblib +3 -0
models/svc__color.v2.joblib +3 -0
models/svc__column.v2.joblib +3 -0
models/xgboost__dozen.joblib +3 -0
requirements.txt +9 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,15 @@

+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+.pytest_cache
+.ruff_cache
+.mypy_cache
+.venv
+venv/
+env/
+.git
+.gitignore
+.DS_Store
+*.egg-info

DEPLOY.md ADDED Viewed

	@@ -0,0 +1,92 @@

+# Deploying to HuggingFace Spaces
+## Option A — web UI (easiest)
+1. Create a new Space at https://huggingface.co/new-space
+2. Owner: your username (or org). Name: e.g. `roulette-predictor`.
+3. SDK: **Docker**. License: MIT. Hardware: CPU basic (2 vCPU / 16 GB RAM is enough).
+4. Visibility: Public or Private.
+5. Click **Create Space**, then on the Space page choose **Files → Upload files**
+   and upload **everything inside `deployment/`**:
+   ```
+   app.py
+   requirements.txt
+   Dockerfile
+   README.md
+   .dockerignore
+   ml/
+   models/
+   ```
+   Keep directory structure (drag the `ml/` and `models/` folders as-is).
+6. HuggingFace will build the container automatically. First build takes
+   3–5 minutes. When it finishes, the Space serves at:
+   ```
+   https://<username>-roulette-predictor.hf.space
+   ```
+   Interactive docs live at `/docs`.
+## Option B — git push (repeatable)
+```bash
+# 1. Create the Space (Docker SDK) on the HF website first.
+# 2. Clone it locally:
+git clone https://huggingface.co/spaces/<username>/roulette-predictor
+cd roulette-predictor
+# 3. Copy all deployment/ files into this directory:
+cp -r /path/to/tej/deployment/* .
+# 4. Commit and push:
+git add .
+git commit -m "initial deploy"
+git push                            # may need: git lfs install && git lfs track "models/*"
+```
+> **LFS note:** `svc__column.v2.joblib` is ~4 MB and `xgboost__dozen.joblib` is
+> ~3 MB — both fit under the 10 MB normal-git-push limit on HF. If you ever
+> add a model over 10 MB, run `git lfs install && git lfs track "models/*.joblib"`
+> first.
+## Sanity checks after deploy
+```bash
+export SPACE=https://<username>-roulette-predictor.hf.space
+# Health
+curl -s "$SPACE/" | jq
+# Predict via JSON
+curl -s -X POST "$SPACE/predict" \
+  -H 'Content-Type: application/json' \
+  -d '{"numbers":[28,35,36,31,12,17,12,34,6,10,15,14,19,19,22,2,9,11,33,16],"steps":10}' \
+  | jq
+# Predict via CSV upload (test.csv with a "Winner" or "number" column)
+curl -s -X POST "$SPACE/predict/file?steps=10" -F "file=@test.csv" | jq
+```
+## Local test before deploying
+```bash
+cd deployment
+docker build -t roulette-predictor .
+docker run --rm -p 7860:7860 roulette-predictor
+# then open http://localhost:7860/docs
+```
+## What's inside
+| File | Purpose |
+|------|---------|
+| `app.py` | FastAPI service with `/predict`, `/predict/file`, `/models`, `/` |
+| `Dockerfile` | Python 3.11-slim, non-root user UID 1000, port 7860 (HF convention) |
+| `requirements.txt` | fastapi, uvicorn, pandas, numpy, scikit-learn, xgboost, joblib |
+| `README.md` | HF Space metadata frontmatter + user docs |
+| `ml/features.py` | v1 hand-crafted features (window=10, 25 dims) |
+| `ml/features_v2.py` | v2 features (window=20, 51 dims, run-length, autocorrelation, wheel-neighbor) |
+| `models/*.joblib` | Five winning model artefacts (number, color, parity, dozen, column) |
+Total image size is ~1.3 GB (mostly sklearn + xgboost + numpy wheels).

Dockerfile ADDED Viewed

	@@ -0,0 +1,31 @@

+# syntax=docker/dockerfile:1.6
+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    HF_HOME=/tmp/hf \
+    XDG_CACHE_HOME=/tmp/cache
+# System deps: libgomp needed by XGBoost/LightGBM runtime.
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends libgomp1 \
+ && rm -rf /var/lib/apt/lists/*
+# HuggingFace Spaces expects a non-root user with UID 1000 and writable /home.
+RUN useradd -m -u 1000 app
+WORKDIR /home/app
+COPY --chown=app:app requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY --chown=app:app app.py ./
+COPY --chown=app:app ml ./ml
+COPY --chown=app:app models ./models
+COPY --chown=app:app README.md ./README.md
+USER app
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,12 +1,84 @@
 ---
-title: Testruk
-emoji: 🏆
-colorFrom: yellow
-colorTo: blue
-sdk: gradio
-sdk_version: 6.12.0
-app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Roulette Next-Spin Predictor
+emoji: 🎰
+colorFrom: red
+colorTo: gray
+sdk: docker
+app_port: 7860
 pinned: false
+license: mit
+short_description: Predict next roulette spins from a history of past numbers
 ---
+# Roulette Next-Spin Predictor
+FastAPI service that predicts the next N roulette spins from a history of past
+winning numbers (European single-zero wheel, 0–36). The best-in-class model per
+target was picked from a sweep across **30+ algorithms**: classical ML
+(LogReg, RandomForest, XGBoost, LightGBM, CatBoost, SVC, ExtraTrees, KNN,
+GaussianNB, MultinomialNB, BernoulliNB, AdaBoost, DecisionTree, Ridge, SGD,
+MLP, HistGradientBoosting), deep learning (LSTM, GRU, Transformer, vanilla
+RNN, 1D-CNN, TabNet), and ensembling (stacking with LogReg meta-learner).
+## Endpoints
+| Method | Path | Purpose |
+|-------|------|---------|
+| `GET`  | `/` | Health check and route summary |
+| `GET`  | `/models` | Active model per target + rolling test accuracy |
+| `POST` | `/predict` | Predict from JSON `{numbers: [...], steps: N}` |
+| `POST` | `/predict/file` | Predict from uploaded CSV (column `Winner`/`number`) |
+| `GET`  | `/docs` | Interactive Swagger UI |
+## Example — curl
+```bash
+curl -X POST https://<your-space>.hf.space/predict \
+  -H 'Content-Type: application/json' \
+  -d '{"numbers":[28,35,36,31,12,17,12,34,6,10,15,14,19,19,22,2,9,11,33,16],"steps":10}'
+```
+## Example — file upload
+```bash
+curl -X POST "https://<your-space>.hf.space/predict/file?steps=10" \
+  -F "file=@test.csv"
+```
+The uploaded CSV must have a column named one of `Winner`, `winning number`,
+or `number` containing integers in `[0, 36]`. If none match, the last column
+is used.
+## Model selection (after the full sweep)
+| Target | Winning algorithm | Rolling test accuracy |
+|--------|-------------------|----------------------:|
+| number (0–36) | MLPClassifier | **4.16%** |
+| color (red/black/green) | SVC (RBF) | **52.63%** |
+| parity (odd/even/none) | GaussianNB | **51.88%** |
+| dozen (1st/2nd/3rd) | XGBoost | **38.14%** |
+| column (1st/2nd/3rd) | SVC (RBF) | **38.85%** |
+For parity the best absolute was a stacking ensemble (52.13%); GaussianNB is
+used in deployment because it is a single cheap model with almost identical
+accuracy.
+## Honest disclaimer
+On a fair roulette wheel, consecutive spins are **statistically independent**,
+so past outcomes contain no information about the next spin beyond the wheel's
+structural class imbalance (18 red / 18 black / 1 green). The `number`
+prediction sits just above the 2.70% uniform-random baseline; higher
+per-target numbers come almost entirely from that imbalance, not from learned
+temporal patterns.
+**Do not gamble money based on these outputs.** The service is built for
+educational and demonstration purposes.
+## Local run
+```bash
+docker build -t roulette-predictor .
+docker run -p 7860:7860 roulette-predictor
+open http://localhost:7860/docs
+```

app.py ADDED Viewed

	@@ -0,0 +1,364 @@

+"""Roulette next-spin prediction API.
+FastAPI server exposing the best per-target models selected after an exhaustive
+sweep across 30+ algorithms. Designed to run inside a HuggingFace Space with
+Docker SDK on port 7860.
+Endpoints
+---------
+GET  /              Health + metadata
+GET  /models        Active model selection and their rolling-test accuracies
+POST /predict       Predict next N spins from a JSON list of past numbers
+POST /predict/file  Predict next N spins from an uploaded CSV (as test.csv)
+The recommended minimum context is 20 past spins (matching WINDOW_V2). The
+service automatically pads with zeros if the caller supplies fewer.
+"""
+from __future__ import annotations
+import io
+import logging
+from pathlib import Path
+from typing import Any
+import joblib
+import numpy as np
+import pandas as pd
+from fastapi import FastAPI, File, HTTPException, UploadFile
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, Field, field_validator
+from ml.features import (
+    WINDOW,
+    _features_from_window,
+    derive_color,
+    derive_column,
+    derive_dozen,
+    derive_parity,
+)
+from ml.features_v2 import WINDOW_V2, _features_v2
+logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
+LOG = logging.getLogger("app")
+APP_ROOT = Path(__file__).resolve().parent
+MODELS_DIR = APP_ROOT / "models"
+# ---------------------------------------------------------------------------
+# Metadata recorded after the full v1+v2+v3 sweep on the 419-row test.csv
+# ---------------------------------------------------------------------------
+BEST_MODELS: dict[str, dict[str, Any]] = {
+    "number": {
+        "algo": "MLPClassifier",
+        "test_accuracy": 0.0416,
+        "feature_version": "v1",
+        "window": WINDOW,
+        "file": "mlp__number.joblib",
+    },
+    "color": {
+        "algo": "SVC (RBF)",
+        "test_accuracy": 0.5263,
+        "feature_version": "v2",
+        "window": WINDOW_V2,
+        "file": "svc__color.v2.joblib",
+    },
+    "parity": {
+        "algo": "GaussianNB",
+        "test_accuracy": 0.5188,
+        "feature_version": "v2",
+        "window": WINDOW_V2,
+        "file": "gaussian_nb__parity.v2.joblib",
+        "notes": "Best stacking was +0.25pp higher but needs 5 base models; GaussianNB chosen for deployment simplicity.",
+    },
+    "dozen": {
+        "algo": "XGBoost",
+        "test_accuracy": 0.3814,
+        "feature_version": "v1",
+        "window": WINDOW,
+        "file": "xgboost__dozen.joblib",
+    },
+    "column": {
+        "algo": "SVC (RBF)",
+        "test_accuracy": 0.3885,
+        "feature_version": "v2",
+        "window": WINDOW_V2,
+        "file": "svc__column.v2.joblib",
+    },
+}
+# Loaded at startup
+MODELS: dict[str, dict[str, Any]] = {}
+def load_models() -> None:
+    for target, spec in BEST_MODELS.items():
+        path = MODELS_DIR / spec["file"]
+        if not path.exists():
+            LOG.warning("Model file missing for %s: %s", target, path)
+            continue
+        MODELS[target] = joblib.load(path)
+        LOG.info("Loaded %s -> %s", target, path.name)
+app = FastAPI(
+    title="Roulette Next-Spin Predictor",
+    description=(
+        "Predict the next spins of a European single-zero roulette wheel from a "
+        "history of past winning numbers. Best-in-class models per target selected "
+        "from a 30+ algorithm sweep."
+    ),
+    version="1.0.0",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+@app.on_event("startup")
+def _startup() -> None:
+    load_models()
+    LOG.info("Startup complete. Models: %s", list(MODELS.keys()))
+# ---------------------------------------------------------------------------
+# Schemas
+# ---------------------------------------------------------------------------
+class PredictRequest(BaseModel):
+    numbers: list[int] = Field(
+        ...,
+        description="Sequence of past winning numbers (0-36). Most recent spin goes last.",
+        examples=[[28, 35, 36, 31, 12, 17, 12, 34, 6, 10, 15, 14, 19, 19, 22, 2, 9, 11, 33, 16]],
+    )
+    steps: int = Field(10, ge=1, le=50, description="How many future spins to forecast.")
+    @field_validator("numbers")
+    @classmethod
+    def _check_numbers(cls, v: list[int]) -> list[int]:
+        if not v:
+            raise ValueError("numbers cannot be empty")
+        if any(n < 0 or n > 36 for n in v):
+            raise ValueError("all numbers must be in [0, 36]")
+        return v
+class Prediction(BaseModel):
+    step: int
+    predicted_number: int
+    top3_numbers: list[int]
+    number_confidence: float
+    predicted_color: str
+    predicted_parity: str
+    predicted_dozen: str
+    predicted_column: str
+    derived_from_number_color: str
+    derived_from_number_parity: str
+    derived_from_number_dozen: str
+    derived_from_number_column: str
+class PredictResponse(BaseModel):
+    model_config = {"protected_namespaces": ()}
+    model_info: dict[str, Any]
+    predictions: list[Prediction]
+    notes: list[str]
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _prepare_windows(sequence: list[int]) -> tuple[np.ndarray, np.ndarray]:
+    """Return (window_v1, window_v2) of the last WINDOW / WINDOW_V2 numbers.
+    Pads with leading zeros if the input is shorter than required.
+    """
+    arr = np.asarray(sequence, dtype=np.int64)
+    if len(arr) < WINDOW_V2:
+        pad = np.zeros(WINDOW_V2 - len(arr), dtype=np.int64)
+        arr = np.concatenate([pad, arr])
+    w_v2 = arr[-WINDOW_V2:]
+    w_v1 = arr[-WINDOW:]
+    return w_v1, w_v2
+def _softmax(x: np.ndarray) -> np.ndarray:
+    e = np.exp(x - x.max())
+    return e / e.sum()
+def _predict_number(w_v1: np.ndarray) -> tuple[int, list[int], float]:
+    bundle = MODELS["number"]
+    model, scaler = bundle["model"], bundle.get("scaler")
+    feats = _features_from_window(w_v1).reshape(1, -1)
+    if scaler is not None:
+        feats = scaler.transform(feats)
+    if hasattr(model, "predict_proba"):
+        proba = model.predict_proba(feats)[0]
+    else:
+        logits = np.atleast_1d(model.decision_function(feats)[0])
+        proba = _softmax(logits)
+    if len(proba) < 37:
+        padded = np.zeros(37)
+        padded[: len(proba)] = proba
+        proba = padded
+    n = int(np.argmax(proba))
+    top3 = [int(i) for i in np.argsort(proba)[-3:][::-1]]
+    return n, top3, float(proba[n])
+def _predict_target_v2(target: str, w_v2: np.ndarray) -> int:
+    bundle = MODELS[target]
+    model, scaler = bundle["model"], bundle.get("scaler")
+    feats = _features_v2(w_v2).reshape(1, -1)
+    if scaler is not None:
+        feats = scaler.transform(feats)
+    return int(model.predict(feats)[0])
+def _predict_target_v1(target: str, w_v1: np.ndarray) -> int:
+    bundle = MODELS[target]
+    model, scaler = bundle["model"], bundle.get("scaler")
+    feats = _features_from_window(w_v1).reshape(1, -1)
+    if scaler is not None:
+        feats = scaler.transform(feats)
+    return int(model.predict(feats)[0])
+COLOR_LABELS = ("red", "black", "green")
+PARITY_LABELS = ("odd", "even", "none")
+DOZEN_LABELS = ("first", "second", "third", "none")
+COLUMN_LABELS = ("first", "second", "third", "none")
+def _predict_one_step(w_v1: np.ndarray, w_v2: np.ndarray, step: int) -> Prediction:
+    num, top3, conf = _predict_number(w_v1)
+    return Prediction(
+        step=step,
+        predicted_number=num,
+        top3_numbers=top3,
+        number_confidence=conf,
+        predicted_color=COLOR_LABELS[_predict_target_v2("color", w_v2)],
+        predicted_parity=PARITY_LABELS[_predict_target_v2("parity", w_v2)],
+        predicted_dozen=DOZEN_LABELS[_predict_target_v1("dozen", w_v1)],
+        predicted_column=COLUMN_LABELS[_predict_target_v2("column", w_v2)],
+        derived_from_number_color=derive_color(num),
+        derived_from_number_parity=derive_parity(num),
+        derived_from_number_dozen=derive_dozen(num),
+        derived_from_number_column=derive_column(num),
+    )
+def _forecast(sequence: list[int], steps: int) -> list[Prediction]:
+    w_v1, w_v2 = _prepare_windows(sequence)
+    out: list[Prediction] = []
+    for step in range(1, steps + 1):
+        pred = _predict_one_step(w_v1, w_v2, step)
+        out.append(pred)
+        w_v1 = np.append(w_v1[1:], pred.predicted_number)
+        w_v2 = np.append(w_v2[1:], pred.predicted_number)
+    return out
+# ---------------------------------------------------------------------------
+# Routes
+# ---------------------------------------------------------------------------
+@app.get("/")
+def root() -> dict[str, Any]:
+    return {
+        "service": "Roulette Next-Spin Predictor",
+        "version": "1.0.0",
+        "wheel": "European single-zero (0-36)",
+        "endpoints": {
+            "GET /models": "Active models and their rolling-test accuracy",
+            "POST /predict": "Predict from JSON {numbers: [...], steps: N}",
+            "POST /predict/file": "Predict from uploaded CSV (column 'Winner' or 'number')",
+            "GET /docs": "Interactive Swagger UI",
+        },
+        "models_loaded": list(MODELS.keys()),
+    }
+@app.get("/models")
+def model_info() -> dict[str, Any]:
+    return {
+        "targets": BEST_MODELS,
+        "disclaimer": (
+            "Roulette on a fair wheel produces independent draws. The 'number' "
+            "prediction accuracy (~4%) is only marginally above the 2.70% uniform "
+            "random baseline. Higher per-target accuracies come largely from the "
+            "wheel's structural class imbalance (18 red / 18 black / 1 green), not "
+            "from learned patterns. Do not gamble money based on these outputs."
+        ),
+    }
+@app.post("/predict", response_model=PredictResponse)
+def predict(req: PredictRequest) -> PredictResponse:
+    if not MODELS:
+        raise HTTPException(status_code=503, detail="models not loaded")
+    notes: list[str] = []
+    if len(req.numbers) < WINDOW_V2:
+        notes.append(
+            f"Input had {len(req.numbers)} numbers; padded with leading zeros up to {WINDOW_V2} for the v2 window."
+        )
+    preds = _forecast(req.numbers, req.steps)
+    return PredictResponse(
+        model_info={t: {"algo": s["algo"], "test_accuracy": s["test_accuracy"]} for t, s in BEST_MODELS.items()},
+        predictions=preds,
+        notes=notes,
+    )
+@app.post("/predict/file", response_model=PredictResponse)
+async def predict_file(file: UploadFile = File(...), steps: int = 10) -> PredictResponse:
+    if not MODELS:
+        raise HTTPException(status_code=503, detail="models not loaded")
+    try:
+        content = await file.read()
+        df = pd.read_csv(io.BytesIO(content))
+    except Exception as exc:
+        raise HTTPException(status_code=400, detail=f"could not read CSV: {exc}") from exc
+    col = next(
+        (c for c in df.columns if c.lower() in {"winner", "winning number", "number"}),
+        None,
+    )
+    if col is None:
+        col = df.columns[-1]
+    try:
+        numbers = [int(x) for x in df[col].tolist()]
+    except Exception as exc:
+        raise HTTPException(status_code=400, detail=f"column {col!r} is not integer-coercible: {exc}") from exc
+    if any(n < 0 or n > 36 for n in numbers):
+        raise HTTPException(status_code=400, detail="values must be in [0, 36]")
+    if steps < 1 or steps > 50:
+        raise HTTPException(status_code=400, detail="steps must be between 1 and 50")
+    notes = [f"Loaded column {col!r} with {len(numbers)} rows from upload."]
+    if len(numbers) < WINDOW_V2:
+        notes.append(f"Padded to window={WINDOW_V2} with leading zeros.")
+    preds = _forecast(numbers, steps)
+    return PredictResponse(
+        model_info={t: {"algo": s["algo"], "test_accuracy": s["test_accuracy"]} for t, s in BEST_MODELS.items()},
+        predictions=preds,
+        notes=notes,
+    )
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("app:app", host="0.0.0.0", port=7860, reload=False)

ml/__init__.py ADDED Viewed

File without changes

ml/features.py ADDED Viewed

	@@ -0,0 +1,226 @@

+"""Sliding-window feature engineering for roulette next-spin prediction.
+Features are built per source sequence so windows never cross source boundaries.
+Every row of the feature matrix holds the last ``WINDOW`` winning numbers plus
+derived counts; labels are the next spin's number, color, parity, dozen, column.
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Final
+import numpy as np
+import pandas as pd
+WINDOW: Final[int] = 10
+RED_NUMBERS: Final[frozenset[int]] = frozenset(
+    {1, 3, 5, 7, 9, 12, 14, 16, 18, 19, 21, 23, 25, 27, 30, 32, 34, 36}
+)
+NUMBER_CLASSES: Final[int] = 37
+COLOR_CLASSES: Final[tuple[str, ...]] = ("red", "black", "green")
+PARITY_CLASSES: Final[tuple[str, ...]] = ("odd", "even", "none")
+DOZEN_CLASSES: Final[tuple[str, ...]] = ("first", "second", "third", "none")
+COLUMN_CLASSES: Final[tuple[str, ...]] = ("first", "second", "third", "none")
+TARGETS: Final[tuple[str, ...]] = ("number", "color", "parity", "dozen", "column")
+@dataclass(frozen=True)
+class WindowedDataset:
+    X: np.ndarray
+    y: dict[str, np.ndarray]
+    feature_names: list[str]
+def derive_color(n: int) -> str:
+    if n == 0:
+        return "green"
+    return "red" if n in RED_NUMBERS else "black"
+def derive_parity(n: int) -> str:
+    if n == 0:
+        return "none"
+    return "even" if n % 2 == 0 else "odd"
+def derive_dozen(n: int) -> str:
+    if n == 0:
+        return "none"
+    if n <= 12:
+        return "first"
+    if n <= 24:
+        return "second"
+    return "third"
+def derive_column(n: int) -> str:
+    if n == 0:
+        return "none"
+    rem = n % 3
+    return "first" if rem == 1 else ("second" if rem == 2 else "third")
+def encode_label(value: str, classes: tuple[str, ...]) -> int:
+    return classes.index(value)
+def build_windows_from_sequence(numbers: np.ndarray, window: int = WINDOW) -> WindowedDataset:
+    """Build sliding-window features from a single contiguous number sequence."""
+    if len(numbers) <= window:
+        empty_y = {name: np.empty(0, dtype=np.int64) for name in TARGETS}
+        return WindowedDataset(
+            X=np.empty((0, _feature_count(window)), dtype=np.float32),
+            y=empty_y,
+            feature_names=_feature_names(window),
+        )
+    n_samples = len(numbers) - window
+    X = np.empty((n_samples, _feature_count(window)), dtype=np.float32)
+    y_number = np.empty(n_samples, dtype=np.int64)
+    y_color = np.empty(n_samples, dtype=np.int64)
+    y_parity = np.empty(n_samples, dtype=np.int64)
+    y_dozen = np.empty(n_samples, dtype=np.int64)
+    y_column = np.empty(n_samples, dtype=np.int64)
+    for i in range(n_samples):
+        win = numbers[i : i + window]
+        target = int(numbers[i + window])
+        X[i] = _features_from_window(win)
+        y_number[i] = target
+        y_color[i] = encode_label(derive_color(target), COLOR_CLASSES)
+        y_parity[i] = encode_label(derive_parity(target), PARITY_CLASSES)
+        y_dozen[i] = encode_label(derive_dozen(target), DOZEN_CLASSES)
+        y_column[i] = encode_label(derive_column(target), COLUMN_CLASSES)
+    return WindowedDataset(
+        X=X,
+        y={
+            "number": y_number,
+            "color": y_color,
+            "parity": y_parity,
+            "dozen": y_dozen,
+            "column": y_column,
+        },
+        feature_names=_feature_names(window),
+    )
+def build_windows_grouped(
+    df: pd.DataFrame,
+    number_col: str = "number",
+    group_col: str = "source",
+    window: int = WINDOW,
+) -> WindowedDataset:
+    """Build windows per source group and concatenate. Never crosses sources."""
+    parts: list[WindowedDataset] = []
+    for _, group in df.groupby(group_col, sort=False):
+        numbers = group[number_col].to_numpy(dtype=np.int64)
+        parts.append(build_windows_from_sequence(numbers, window=window))
+    non_empty = [p for p in parts if len(p.X) > 0]
+    if not non_empty:
+        return build_windows_from_sequence(np.empty(0, dtype=np.int64), window=window)
+    X = np.vstack([p.X for p in non_empty])
+    y = {
+        name: np.concatenate([p.y[name] for p in non_empty]) for name in TARGETS
+    }
+    return WindowedDataset(X=X, y=y, feature_names=non_empty[0].feature_names)
+def _features_from_window(win: np.ndarray) -> np.ndarray:
+    """Extract features from a window of length WINDOW of integer numbers."""
+    window = len(win)
+    feats = np.empty(_feature_count(window), dtype=np.float32)
+    feats[:window] = win
+    red_count = 0
+    black_count = 0
+    zero_count = 0
+    even_count = 0
+    odd_count = 0
+    low_count = 0
+    high_count = 0
+    dozen_counts = [0, 0, 0]
+    column_counts = [0, 0, 0]
+    number_sum = 0
+    for n in win:
+        n_int = int(n)
+        number_sum += n_int
+        if n_int == 0:
+            zero_count += 1
+            continue
+        if n_int in RED_NUMBERS:
+            red_count += 1
+        else:
+            black_count += 1
+        if n_int % 2 == 0:
+            even_count += 1
+        else:
+            odd_count += 1
+        if n_int <= 18:
+            low_count += 1
+        else:
+            high_count += 1
+        if n_int <= 12:
+            dozen_counts[0] += 1
+        elif n_int <= 24:
+            dozen_counts[1] += 1
+        else:
+            dozen_counts[2] += 1
+        rem = n_int % 3
+        if rem == 1:
+            column_counts[0] += 1
+        elif rem == 2:
+            column_counts[1] += 1
+        else:
+            column_counts[2] += 1
+    offset = window
+    feats[offset + 0] = red_count
+    feats[offset + 1] = black_count
+    feats[offset + 2] = zero_count
+    feats[offset + 3] = even_count
+    feats[offset + 4] = odd_count
+    feats[offset + 5] = low_count
+    feats[offset + 6] = high_count
+    feats[offset + 7] = dozen_counts[0]
+    feats[offset + 8] = dozen_counts[1]
+    feats[offset + 9] = dozen_counts[2]
+    feats[offset + 10] = column_counts[0]
+    feats[offset + 11] = column_counts[1]
+    feats[offset + 12] = column_counts[2]
+    feats[offset + 13] = number_sum / window
+    feats[offset + 14] = int(win[-1])  # last number
+    return feats
+def _feature_count(window: int) -> int:
+    return window + 15
+def _feature_names(window: int) -> list[str]:
+    lags = [f"lag_{i}" for i in range(window, 0, -1)]
+    extras = [
+        "red_count",
+        "black_count",
+        "zero_count",
+        "even_count",
+        "odd_count",
+        "low_count",
+        "high_count",
+        "dozen1_count",
+        "dozen2_count",
+        "dozen3_count",
+        "col1_count",
+        "col2_count",
+        "col3_count",
+        "mean_number",
+        "last_number",
+    ]
+    return lags + extras

ml/features_v2.py ADDED Viewed

	@@ -0,0 +1,275 @@

+"""Richer feature engineering + source-matching preprocessing.
+Adds on top of the v1 sliding-window features:
+  - Run-length features (current streak of red/black, odd/even, same dozen, same column)
+  - Rolling hot/cold counts at multiple horizons (20, 50, 100)
+  - Autocorrelation-lag features (repeat rate at lag 1..5)
+  - Wheel-neighbor stats (how many of the last N were on the left/right half of the wheel)
+Also computes Jensen-Shannon divergence between test.csv's number distribution
+and each training source, so we can train on matched sources only.
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Final
+import numpy as np
+import pandas as pd
+from ml.features import (
+    RED_NUMBERS,
+    TARGETS,
+    derive_color,
+    derive_column,
+    derive_dozen,
+    derive_parity,
+    encode_label,
+    COLOR_CLASSES,
+    PARITY_CLASSES,
+    DOZEN_CLASSES,
+    COLUMN_CLASSES,
+)
+WINDOW_V2: Final[int] = 20
+# Single-zero European wheel order (clockwise starting from 0)
+WHEEL_ORDER: Final[tuple[int, ...]] = (
+    0, 32, 15, 19, 4, 21, 2, 25, 17, 34, 6, 27, 13, 36, 11, 30, 8, 23, 10,
+    5, 24, 16, 33, 1, 20, 14, 31, 9, 22, 18, 29, 7, 28, 12, 35, 3, 26,
+)
+WHEEL_POS: Final[dict[int, int]] = {n: i for i, n in enumerate(WHEEL_ORDER)}
+def _color_id(n: int) -> int:
+    if n == 0:
+        return 2
+    return 0 if n in RED_NUMBERS else 1
+def _parity_id(n: int) -> int:
+    if n == 0:
+        return 2
+    return 1 if n % 2 == 0 else 0
+def _dozen_id(n: int) -> int:
+    if n == 0:
+        return 3
+    if n <= 12:
+        return 0
+    if n <= 24:
+        return 1
+    return 2
+def _column_id(n: int) -> int:
+    if n == 0:
+        return 3
+    rem = n % 3
+    return 0 if rem == 1 else (1 if rem == 2 else 2)
+@dataclass(frozen=True)
+class V2Dataset:
+    X: np.ndarray
+    y: dict[str, np.ndarray]
+    feature_names: list[str]
+    source: np.ndarray  # per-row source label (for debugging)
+def _features_v2(win: np.ndarray) -> np.ndarray:
+    """Rich feature vector for one window of length WINDOW_V2."""
+    w = len(win)
+    feats: list[float] = []
+    feats.extend(win.astype(np.float32).tolist())
+    red = sum(1 for x in win if int(x) != 0 and int(x) in RED_NUMBERS)
+    black = sum(1 for x in win if int(x) != 0 and int(x) not in RED_NUMBERS)
+    zero = int(np.sum(win == 0))
+    even = sum(1 for x in win if int(x) != 0 and int(x) % 2 == 0)
+    odd = sum(1 for x in win if int(x) != 0 and int(x) % 2 != 0)
+    low = sum(1 for x in win if 1 <= int(x) <= 18)
+    high = sum(1 for x in win if 19 <= int(x) <= 36)
+    doz = [0, 0, 0]
+    col = [0, 0, 0]
+    for x in win:
+        xi = int(x)
+        if xi == 0:
+            continue
+        if xi <= 12:
+            doz[0] += 1
+        elif xi <= 24:
+            doz[1] += 1
+        else:
+            doz[2] += 1
+        rem = xi % 3
+        if rem == 1:
+            col[0] += 1
+        elif rem == 2:
+            col[1] += 1
+        else:
+            col[2] += 1
+    feats.extend([red, black, zero, even, odd, low, high, *doz, *col])
+    feats.append(float(np.mean(win)))
+    feats.append(float(np.std(win)))
+    feats.append(int(win[-1]))
+    # Run-length features: current streak of same color/parity/dozen/column at end
+    last_color = _color_id(int(win[-1]))
+    streak_color = 1
+    for x in win[-2::-1]:
+        if _color_id(int(x)) == last_color:
+            streak_color += 1
+        else:
+            break
+    last_parity = _parity_id(int(win[-1]))
+    streak_parity = 1
+    for x in win[-2::-1]:
+        if _parity_id(int(x)) == last_parity:
+            streak_parity += 1
+        else:
+            break
+    last_dozen = _dozen_id(int(win[-1]))
+    streak_dozen = 1
+    for x in win[-2::-1]:
+        if _dozen_id(int(x)) == last_dozen:
+            streak_dozen += 1
+        else:
+            break
+    last_column = _column_id(int(win[-1]))
+    streak_column = 1
+    for x in win[-2::-1]:
+        if _column_id(int(x)) == last_column:
+            streak_column += 1
+        else:
+            break
+    feats.extend([streak_color, streak_parity, streak_dozen, streak_column])
+    # Autocorrelation-ish features: repeat rate at lags 1..5
+    for lag in range(1, 6):
+        if w > lag:
+            same = sum(1 for i in range(lag, w) if int(win[i]) == int(win[i - lag]))
+            feats.append(same / (w - lag))
+        else:
+            feats.append(0.0)
+    # Wheel-neighbor stats: mean wheel position, std, distance last→prev
+    positions = [WHEEL_POS.get(int(x), 0) for x in win]
+    feats.append(float(np.mean(positions)))
+    feats.append(float(np.std(positions)))
+    if w >= 2:
+        feats.append(float(abs(positions[-1] - positions[-2])))
+    else:
+        feats.append(0.0)
+    # Multi-horizon hot/cold (simply: most/least-frequent number & its count in window)
+    from collections import Counter
+    c = Counter(int(x) for x in win)
+    most = c.most_common(1)[0]
+    feats.append(most[0])
+    feats.append(most[1])
+    feats.append(float(min(c.values())))
+    return np.asarray(feats, dtype=np.float32)
+def _feature_names_v2(window: int) -> list[str]:
+    lags = [f"lag_{i}" for i in range(window, 0, -1)]
+    block_counts = [
+        "red", "black", "zero", "even", "odd", "low", "high",
+        "doz1", "doz2", "doz3", "col1", "col2", "col3",
+        "mean", "std", "last",
+    ]
+    streaks = ["streak_color", "streak_parity", "streak_dozen", "streak_column"]
+    autocorrs = [f"autocorr_lag{k}" for k in range(1, 6)]
+    wheel = ["wheel_mean_pos", "wheel_std_pos", "wheel_last_dist"]
+    hotcold = ["hot_num", "hot_count", "cold_count"]
+    return lags + block_counts + streaks + autocorrs + wheel + hotcold
+def build_windows_v2(
+    df: pd.DataFrame,
+    number_col: str = "number",
+    group_col: str = "source",
+    window: int = WINDOW_V2,
+) -> V2Dataset:
+    """Build rich windowed features per source (never crosses source boundaries)."""
+    X_parts: list[np.ndarray] = []
+    y_parts: dict[str, list[np.ndarray]] = {t: [] for t in TARGETS}
+    src_parts: list[np.ndarray] = []
+    for source, group in df.groupby(group_col, sort=False):
+        nums = group[number_col].to_numpy(dtype=np.int64)
+        if len(nums) <= window:
+            continue
+        n = len(nums) - window
+        Xg = np.empty((n, len(_feature_names_v2(window))), dtype=np.float32)
+        yg_num = np.empty(n, dtype=np.int64)
+        yg_col = np.empty(n, dtype=np.int64)
+        yg_par = np.empty(n, dtype=np.int64)
+        yg_doz = np.empty(n, dtype=np.int64)
+        yg_colm = np.empty(n, dtype=np.int64)
+        for i in range(n):
+            win = nums[i : i + window]
+            nxt = int(nums[i + window])
+            Xg[i] = _features_v2(win)
+            yg_num[i] = nxt
+            yg_col[i] = encode_label(derive_color(nxt), COLOR_CLASSES)
+            yg_par[i] = encode_label(derive_parity(nxt), PARITY_CLASSES)
+            yg_doz[i] = encode_label(derive_dozen(nxt), DOZEN_CLASSES)
+            yg_colm[i] = encode_label(derive_column(nxt), COLUMN_CLASSES)
+        X_parts.append(Xg)
+        y_parts["number"].append(yg_num)
+        y_parts["color"].append(yg_col)
+        y_parts["parity"].append(yg_par)
+        y_parts["dozen"].append(yg_doz)
+        y_parts["column"].append(yg_colm)
+        src_parts.append(np.array([str(source)] * n, dtype=object))
+    X = np.vstack(X_parts) if X_parts else np.empty((0, len(_feature_names_v2(window))), dtype=np.float32)
+    y = {t: (np.concatenate(parts) if parts else np.empty(0, dtype=np.int64)) for t, parts in y_parts.items()}
+    src = np.concatenate(src_parts) if src_parts else np.empty(0, dtype=object)
+    return V2Dataset(X=X, y=y, feature_names=_feature_names_v2(window), source=src)
+# ----------------------------------------------------------------------------
+# Source-matching via Jensen-Shannon divergence
+# ----------------------------------------------------------------------------
+def _prob(counts: np.ndarray, smoothing: float = 1e-6) -> np.ndarray:
+    p = counts.astype(np.float64) + smoothing
+    return p / p.sum()
+def js_divergence(p: np.ndarray, q: np.ndarray) -> float:
+    p = _prob(p)
+    q = _prob(q)
+    m = 0.5 * (p + q)
+    def kl(a: np.ndarray, b: np.ndarray) -> float:
+        return float(np.sum(a * np.log(a / b)))
+    return 0.5 * kl(p, m) + 0.5 * kl(q, m)
+def number_histogram(nums: np.ndarray, n_classes: int = 37) -> np.ndarray:
+    return np.bincount(nums.astype(np.int64), minlength=n_classes)
+def rank_sources_by_similarity(train_df: pd.DataFrame, test_numbers: np.ndarray) -> list[tuple[str, float]]:
+    """Return list of (source, js_divergence) sorted ascending (closest first)."""
+    test_hist = number_histogram(test_numbers)
+    scores: list[tuple[str, float]] = []
+    for source, group in train_df.groupby("source", sort=False):
+        src_hist = number_histogram(group["number"].to_numpy())
+        if src_hist.sum() == 0:
+            continue
+        scores.append((str(source), js_divergence(test_hist, src_hist)))
+    scores.sort(key=lambda x: x[1])
+    return scores
+def select_training_df(train_df: pd.DataFrame, top_k_sources: list[str]) -> pd.DataFrame:
+    mask = train_df["source"].isin(top_k_sources)
+    return train_df.loc[mask].copy()

models/gaussian_nb__parity.v2.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7814989dc6dee8828cb129a0cdce45b8544f61a8390717eed58f52687d7360b3
+size 4864

models/mlp__number.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:48e230050e4985813daae68a1edd0e4ea12612b386378f8d677e0650737506b1
+size 175048

models/svc__color.v2.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5ed6be933d3fa20dce51d8ca4c6957d0bb35a00bbaafa7ddcc3e9842bb603514
+size 2746407

models/svc__column.v2.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f8ba96b67344c79c86e9a260dde028c254544f9c97c49cf777fc9b147d12bcf4
+size 4068072

models/xgboost__dozen.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ce55dc8fd72d946cdc4e04e512782503eb9482c1f5ced9bd1b2d8b0857e82252
+size 3226293

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+fastapi>=0.109
+uvicorn[standard]>=0.27
+pydantic>=2.5
+python-multipart>=0.0.9
+numpy>=1.26,<2.3
+pandas>=2.1
+scikit-learn==1.6.1
+xgboost>=2.1,<3.0
+joblib>=1.3