Post
67
π **New flagship dataset β and an argument about what a dataset card should be.**
Most synthetic datasets on the Hub ship row counts, a license, and little else β pipeline opaque, rejection criteria unstated, compliance unaudited. We published the opposite.
**SynthEval Cloud β Regulated-Domain Synthetic Instruction Dataset**
π pbhappliedsystems/syntheval-cloud-regulated-instruct-1k
**1,116** quality-gated instruction records across **7 regulated domains** (medical, legal, GDPR, privacy, education, e-commerce, transport). Every record cleared a documented cascade, not a vibe check:
- π§ͺ **Dual-signal hallucination gate** β rejects only when embedding cosine *and* keyword-overlap both fail; a low score alone never rejects.
- π **Layered PII masking + independent leak audit** β a separate over-reporting scanner found **0.0% residual leak** across all 1,116 records.
- π **Whole-corpus evaluation, not a sample** β MATTR **0.769**, mean cosine **0.73**, **0%** near-duplicates, **96.9%** yield.
- π§Ύ **The 36 rejections ship too**, each tagged with its failing gate. Removal at the gate is the product; we show our work.
Every number on the card is a field in the
One release from **SynthEval**: Studio (local GPU) + Cloud (Modal+vLLM), proving quality parity across substrates.
π Whitepaper: https://pbhappliedsystems.com/SynthEval_Studio_and_Cloud_Quality-Gated_Synthetic_Data_Generation.pdf
π Overview: https://pbhappliedsystems.com/synthetic-data.html
**CC BY 4.0** β commercial use welcome, just credit it. Need defensible synthetic data at scale? Let's talk.
β Patrick Hill, PBH Applied Systems
Most synthetic datasets on the Hub ship row counts, a license, and little else β pipeline opaque, rejection criteria unstated, compliance unaudited. We published the opposite.
**SynthEval Cloud β Regulated-Domain Synthetic Instruction Dataset**
π pbhappliedsystems/syntheval-cloud-regulated-instruct-1k
**1,116** quality-gated instruction records across **7 regulated domains** (medical, legal, GDPR, privacy, education, e-commerce, transport). Every record cleared a documented cascade, not a vibe check:
- π§ͺ **Dual-signal hallucination gate** β rejects only when embedding cosine *and* keyword-overlap both fail; a low score alone never rejects.
- π **Layered PII masking + independent leak audit** β a separate over-reporting scanner found **0.0% residual leak** across all 1,116 records.
- π **Whole-corpus evaluation, not a sample** β MATTR **0.769**, mean cosine **0.73**, **0%** near-duplicates, **96.9%** yield.
- π§Ύ **The 36 rejections ship too**, each tagged with its failing gate. Removal at the gate is the product; we show our work.
Every number on the card is a field in the
evaluation_report.json shipped beside the data β full methodology + provenance (Mistral-Nemo AWQ W4A16 Β· vLLM 0.8.5.post1 Β· Modal A10G).One release from **SynthEval**: Studio (local GPU) + Cloud (Modal+vLLM), proving quality parity across substrates.
π Whitepaper: https://pbhappliedsystems.com/SynthEval_Studio_and_Cloud_Quality-Gated_Synthetic_Data_Generation.pdf
π Overview: https://pbhappliedsystems.com/synthetic-data.html
**CC BY 4.0** β commercial use welcome, just credit it. Need defensible synthetic data at scale? Let's talk.
β Patrick Hill, PBH Applied Systems