Spaces:
Configuration error
Configuration error
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,18 +1,29 @@
|
|
| 1 |
# π ExpertData-Factory
|
| 2 |
-
**
|
| 3 |
|
| 4 |
-
ExpertData-Factory is a
|
| 5 |
|
| 6 |
-
##
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
4. **Sanitization**: Rigorous PII scanning and redaction for enterprise safety.
|
| 12 |
|
| 13 |
-
##
|
| 14 |
-
|
| 15 |
-
* **Scientific Reasoning (Upcoming)**: Methodological logic, hypothesis validation, and experimental analysis (Rarity Score 1.0).
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# π ExpertData-Factory
|
| 2 |
+
**Expert-verified reasoning datasets for LLM fine-tuning**
|
| 3 |
|
| 4 |
+
ExpertData-Factory is a data refinement lab building high-fidelity reasoning datasets for post-training, alignment, and evaluation. We specialize in **high-rarity domains** where structured reasoning quality is the primary bottleneck for model performance.
|
| 5 |
|
| 6 |
+
## β
What you get
|
| 7 |
+
- **Schema-stable** reasoning records (ideal for fine-tuning + eval)
|
| 8 |
+
- **PII scanning & redaction** for enterprise safety
|
| 9 |
+
- **Verification layer** using embedding-based grounding + consistency checks (`text-embedding-005`)
|
| 10 |
+
- Exports: **JSONL** and **Parquet**
|
|
|
|
| 11 |
|
| 12 |
+
## π§ͺ Methodology (Factory Pipeline)
|
| 13 |
+
Our "Alchemist" pipeline transforms expert-grade sources (technical docs + scientific papers) into structured reasoning assets through a multi-stage QA process:
|
|
|
|
| 14 |
|
| 15 |
+
1. **Extraction** β automated mining from expert-grade sources
|
| 16 |
+
2. **Refinement** β normalization into consistent reasoning structures
|
| 17 |
+
3. **Verification** β embedding-grounded checks + schema validation (powered by `text-embedding-005`)
|
| 18 |
+
4. **Sanitization** β rigorous PII detection and redaction
|
| 19 |
+
|
| 20 |
+
## π― Domains
|
| 21 |
+
- **Cybersecurity (Public)** β threat logic, vulnerability analysis, and MITRE-aligned reasoning
|
| 22 |
+
- **Scientific Reasoning (Gated, launching soon)** β methods, causality, hypothesis testing, and experimental reasoning
|
| 23 |
+
*(AI / Bio / Physics β public sample released first, full dataset via access request)*
|
| 24 |
+
|
| 25 |
+
## π Enterprise & Licensing
|
| 26 |
+
We provide **public samples** for community research and **gated/commercial** datasets for enterprise fine-tuning.
|
| 27 |
+
|
| 28 |
+
**Request access / custom builds**
|
| 29 |
+
- If youβre an enterprise or lab and want full gated access or custom data generation, **request access on the gated dataset page** or message the organization via Hugging Face.
|