# MelodyDeterminism Patch (GPU + determinism + benchmark)

## Cosa include
- Backend NumPy/CuPy con selezione automatica e RNG deterministico (Philox/PCG64).
- Riduzioni deterministiche (TreeFixed, KahanFixed), softmax robusta, sampling canonico.
- Metadati di tolleranza (max_abs_err / max_rel_err).
- Benchmark di overhead per batch `n` e vocab `v` su CPU/GPU.
- Test edge (logit estremi, maschere, dtypes, invarianti).

## Setup rapido
1. Copia `core/` e `tests/` nel tuo Space/Repo.
2. Unisci `requirements.txt` (aggiungi CuPy se usi GPU in Space).
3. In `app.py`, importa e usa le funzioni per la tua UI Gradio.
4. **Space hardware**: imposta una GPU (es. T4/A10) su Hugging Face.

## Gradio (snippet)
```python
import gradio as gr
from core.backend import set_seed, backend_name
from core.bench import bench_suite
from core.softmax import softmax_canonical
from core.sampling import sample_canonical
from core.metrics import tol_stats
from core.deterministic import reduce_tree_fixed, sum_kahan_fixed

def run_suite(seed, n, v, dtype):
    import numpy as np
    set_seed(int(seed))
    # Input sintetico
    from core.backend import xp
    x = xp.random.standard_normal((int(n), int(v))).astype(getattr(xp, dtype))
    # una riga di esempio per le tolleranze
    p = softmax_canonical(x[0])
    idx = sample_canonical(p, seed=seed, token_idx=0)
    stats = {"backend": backend_name(), "token0": int(idx)}
    return stats

with gr.Blocks(theme=gr.themes.Soft()) as demo:
    with gr.Tab("Deterministic"):
        seed = gr.Number(42, precision=0, label="Seed")
        n = gr.Slider(1, 64, 8, step=1, label="Batch n")
        v = gr.Dropdown([1024, 8192, 32768], value=8192, label="Vocab v")
        dtype = gr.Radio(["float32", "float64"], value="float32", label="dtype")
        run = gr.Button("Esegui suite")
        out = gr.JSON(label="Output + metadata")
        run.click(run_suite, [seed, n, v, dtype], [out])
    with gr.Tab("Benchmark"):
        runb = gr.Button("Benchmark")
        table = gr.Dataframe(headers=["n","v","t_std_ms","t_can_ms","overhead_pct"], label="Latenze (ms)")
        def _bench():
            return bench_suite()
        runb.click(_bench, outputs=[table])
# demo.queue(concurrency_count=2, max_size=8).launch()
```

## Note deterministiche
- RNG: Philox (GPU) / PCG64 (CPU) con mapping u→searchsorted(CDF, side='left').
- BLAS: per i test forziamo OMP/MKL threads = 1 per ridurre variabilità.
- Dtypes: preferisci float32; per softmax/riduzioni usiamo accumulo float64.

## Policy tie-break
`searchsorted(..., side='left')` ⇒ tie-break verso min-id in caso di parità della CDF.

## Esecuzione test
```
pytest -q
```