README / README.md
AmitZalman's picture
Update README.md
493baf5 verified
# ExpertData-Factory 🏭
### Industrial-scale expert-verified reasoning datasets for LLM fine-tuning
We build **high-fidelity reasoning data** for post-training, alignment, and evaluation β€” with an **engineering-grade QA pipeline** and domain specialization in **high-rarity niches**.
---
## πŸš€ What we deliver
- **Schema-stable reasoning records** designed for fine-tuning & eval
- **PII scanning + redaction** for enterprise safety
- **Embedding-grounded verification + consistency checks** (`text-embedding-005`)
- Exports: **JSONL** / **Parquet**
- **Public samples + gated enterprise datasets** (access on request)
---
## πŸ—οΈ Factory Pipeline (Production QA)
Our production pipeline is designed like a data platform β€” not a script.
**Stage 1 β€” Acquisition**
- Curated expert sources (technical docs, scientific papers, reports)
- URL seed mining + dedup + domain routing
**Stage 2 β€” Reasoning Extraction**
- *Alchemist Agent*: converts raw material β†’ structured reasoning assets
- Robust parsing (JSON fallback / truncation recovery)
**Stage 3 β€” Validation & Grounding**
- *Inspector Agent*: schema checks + reasoning integrity checks
- Embedding-grounded verification for factual anchoring (`text-embedding-005`)
- Consistency tests + anomaly flags
**Stage 4 β€” Safety & Sanitization**
- PII detection + redaction
- Enterprise-safe output policy
**Stage 5 β€” Packaging**
- Deterministic exports + versioned releases
- JSONL / Parquet with stable schemas and dataset cards
---
## πŸ“Œ Domains
βœ… **Cybersecurity (Public)**
Threat logic, vulnerability analysis, MITRE-aligned reasoning.
πŸ”’ **Scientific Reasoning (Gated β€” launching soon)**
Methods, causality, hypothesis validation, experimental reasoning.
*(AI / Bio / Physics β€” public sample first, full dataset via access request.)*
---
## πŸ“Š Quality Guarantees
We treat datasets like production artifacts:
- **Versioned releases** with changelogs
- **Reproducible generation** (stable pipelines, deterministic exports)
- **QA-first**: schema validation, safety checks, grounding verification
---
## 🀝 Enterprise
We support:
- **Gated datasets** for commercial fine-tuning
- **Custom domain builds** (high-rarity, high-complexity)
- **Evaluation bundles** (hard cases + stratified splits)
### πŸ” Request Access / Partnerships
To request access to gated datasets or custom generation:
- Submit an access request on the gated dataset page, or
- Message the organization on Hugging Face
---
## 🧾 Releases
- **cybersecurity-reasoning-cot-v1** (Public)
- **scientific-reasoning-sample-v1** (Public sample β€” coming soon)
- **scientific-reasoning-cot-v1** (Gated full release β€” coming soon)