Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
pbhappliedsystemsΒ 
posted an update 1 day ago
Post
1736
πŸš€ **New flagship dataset β€” and an argument about what a dataset card should be.**

Most synthetic datasets on the Hub ship row counts, a license, and little else β€” pipeline opaque, rejection criteria unstated, compliance unaudited. We published the opposite.

**SynthEval Cloud β€” Regulated-Domain Synthetic Instruction Dataset**
πŸ‘‰ pbhappliedsystems/syntheval-cloud-regulated-instruct-1k

**1,116** quality-gated instruction records across **7 regulated domains** (medical, legal, GDPR, privacy, education, e-commerce, transport). Every record cleared a documented cascade, not a vibe check:

- πŸ§ͺ **Dual-signal hallucination gate** β€” rejects only when embedding cosine *and* keyword-overlap both fail; a low score alone never rejects.
- πŸ”’ **Layered PII masking + independent leak audit** β€” a separate over-reporting scanner found **0.0% residual leak** across all 1,116 records.
- πŸ“Š **Whole-corpus evaluation, not a sample** β€” MATTR **0.769**, mean cosine **0.73**, **0%** near-duplicates, **96.9%** yield.
- 🧾 **The 36 rejections ship too**, each tagged with its failing gate. Removal at the gate is the product; we show our work.

Every number on the card is a field in the evaluation_report.json shipped beside the data β€” full methodology + provenance (Mistral-Nemo AWQ W4A16 Β· vLLM 0.8.5.post1 Β· Modal A10G).

One release from **SynthEval**: Studio (local GPU) + Cloud (Modal+vLLM), proving quality parity across substrates.

πŸ“„ Whitepaper: https://pbhappliedsystems.com/SynthEval_Studio_and_Cloud_Quality-Gated_Synthetic_Data_Generation.pdf
πŸ”Ž Overview: https://pbhappliedsystems.com/synthetic-data.html

**CC BY 4.0** β€” commercial use welcome, just credit it. Need defensible synthetic data at scale? Let's talk.

β€” Patrick Hill, PBH Applied Systems