Spaces:
Configuration error
Configuration error
Update org README: corporate clean, Swiss datasets with use-case tags
Browse files
README.md
CHANGED
|
@@ -1,75 +1,86 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
colorTo: green
|
| 6 |
-
sdk: static
|
| 7 |
-
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
---
|
| 19 |
|
| 20 |
-
##
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
- 🤖 **LLM Pre-training & Fine-tuning** — Sovereign language data at scale
|
| 26 |
-
- 🔍 **RAG Pipelines** — Pre-chunked, embedding-ready corpora with quality scores
|
| 27 |
-
- 🏛️ **Government & Regulatory NLP** — Domain-classified, jurisdiction-specific data
|
| 28 |
-
- 📊 **Academic Research** — Reproducible, well-documented datasets with full metadata
|
| 29 |
|
| 30 |
-
--
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|---|---|---|---|
|
| 36 |
-
| 🇱🇮 [Liechtenstein Ultra Premium](https://huggingface.co/datasets/OptiTransferData/liechtenstein-ultra-premium-li) | 35,748 | Full `.li` domain | JSONL · 37 fields |
|
| 37 |
-
| 🇫🇷 [France Sovereign RAG Chunks](https://huggingface.co/datasets/OptiTransferData/france-sovereign-rag-chunks) | 348,829 | French government & institutional web | JSONL · 8 fields |
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
---
|
| 44 |
|
| 45 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
-
- 🔐 **SHA-256 integrity verification** on all production files
|
| 49 |
-
- 📊 **Quality scoring** per record (0–100 scale)
|
| 50 |
-
- 🏷️ **Domain classification** and language detection
|
| 51 |
-
- 📜 **EU AI Act compliance** — full data provenance and licensing transparency
|
| 52 |
-
- 🧹 **Deduplication** — content-level and URL-level
|
| 53 |
|
| 54 |
---
|
| 55 |
|
| 56 |
-
##
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|---|---|
|
| 60 |
-
| **Sample** | Free with gated access — evaluate data quality |
|
| 61 |
-
| **Full Dataset** | Commercial licence — complete production data |
|
| 62 |
-
| **Enterprise** | Custom pricing — dedicated support, SLA, bespoke corpora |
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
-
|
| 67 |
-
🏦 Bank Transfer (SEPA/SWIFT) · 📱 TWINT (Swiss) · ₿ Crypto (BTC/ETH/SOL — addresses on request)
|
| 68 |
|
| 69 |
---
|
| 70 |
|
| 71 |
-
|
| 72 |
-
🏔️ Curated in Switzerland · <a href="https://optitransfer.ch">optitransfer.ch</a> · <a href="mailto:data@optitransfer.ch">data@optitransfer.ch</a>
|
| 73 |
-
</p>
|
| 74 |
|
| 75 |
-
|
|
|
|
|
|
|
|
|
| 1 |
+
# OptiTransferData
|
| 2 |
+
|
| 3 |
+
Premium web corpora for LLM pre-training, fine-tuning, RAG, and multilingual NLP.
|
| 4 |
+
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
+
## About
|
| 8 |
|
| 9 |
+
OptiTransferData is the data arm of [OptiTransfer AG](https://optitransfer.ch), a Swiss-registered technology company. We produce compliance-ready, quality-scored web datasets for AI teams building in regulated markets.
|
| 10 |
|
| 11 |
+
Every dataset ships with:
|
| 12 |
+
|
| 13 |
+
- Full data provenance and SHA256 verification
|
| 14 |
+
- PII detection and redaction
|
| 15 |
+
- Multi-dimensional quality scoring (0-100 per document)
|
| 16 |
+
- EU AI Act and Swiss FADP compliance documentation
|
| 17 |
+
- Croissant metadata for ML interoperability
|
| 18 |
+
- Multiple export formats (Parquet, JSONL, language splits, RAG chunks)
|
| 19 |
|
| 20 |
---
|
| 21 |
|
| 22 |
+
## Available Datasets
|
| 23 |
|
| 24 |
+
### *.ch Swiss Web Premium (A+)
|
| 25 |
|
| 26 |
+
110,491 documents | 78 fields | 554 MB total | Quality score: 62.3 mean
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
The flagship Swiss web corpus, extracted and quality-scored from the .ch ccTLD. Multilingual coverage across German (61.2%), French (19.0%), English (10.5%), Italian (4.7%), and additional languages. Nine-component quality model with full provenance chain.
|
| 29 |
+
|
| 30 |
+
**Best suited for:**
|
| 31 |
|
| 32 |
+
`LLM Pre-Training` `Supervised Fine-Tuning (SFT)` `Retrieval-Augmented Generation (RAG)` `Multilingual NLP` `German Language Models` `French Language Models` `Swiss Market AI` `Regulatory Compliance (EU AI Act)` `Domain-Specific Training` `Web Corpus Research` `Text Classification` `Summarisation` `Question Answering` `Translation`
|
| 33 |
|
| 34 |
+
**Formats:** Parquet (7 shards) | JSONL (7 shards) | Language Splits (DE, FR, EN, IT) | RAG Chunks (4 files)
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
| Repository | Description | Access |
|
| 37 |
+
|---|---|---|
|
| 38 |
+
| [swiss-web-premium-ch](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch) | 10,000-record stratified sample with full documentation and QA report | Gated (evaluation) |
|
| 39 |
+
| [swiss-web-premium-ch-full](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch-full) | Complete 110,491-record production dataset | Gated (licensed) |
|
| 40 |
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Data Pipeline
|
| 44 |
+
|
| 45 |
+
All datasets are processed through the OptiTransfer pipeline:
|
| 46 |
+
|
| 47 |
+
1. **Source Selection** -- Common Crawl filtered by ccTLD and domain trust scoring
|
| 48 |
+
2. **Extraction** -- Text extraction with deduplication, language detection, and structural analysis
|
| 49 |
+
3. **Quality Scoring** -- Nine-component quality model producing a composite 0-100 score per document
|
| 50 |
+
4. **Enrichment** -- Content categorisation, trust tier assignment, academic/news detection, skill tagging
|
| 51 |
+
5. **Compliance** -- PII scanning, redaction, and regulatory documentation
|
| 52 |
+
6. **Verification** -- SHA256 checksums, QA reporting, and independent audit readiness
|
| 53 |
|
| 54 |
---
|
| 55 |
|
| 56 |
+
## Quality Assurance
|
| 57 |
+
|
| 58 |
+
Each dataset is accompanied by a full QA report covering:
|
| 59 |
+
|
| 60 |
+
- Pipeline configuration and processing parameters
|
| 61 |
+
- Quality score distributions and statistical analysis
|
| 62 |
+
- Language detection accuracy and coverage
|
| 63 |
+
- Content categorisation breakdown
|
| 64 |
+
- PII detection results
|
| 65 |
+
- Domain trust tier analysis
|
| 66 |
+
- SHA256 integrity verification
|
| 67 |
|
| 68 |
+
QA reports are available in both the sample and full product repositories.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
---
|
| 71 |
|
| 72 |
+
## Licensing
|
| 73 |
|
| 74 |
+
All datasets are available under the OptiTransfer Commercial License. Sample repositories provide gated evaluation access. Full datasets require a commercial license agreement.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
**Payment methods:** Bank Transfer (SEPA/SWIFT), TWINT, Cryptocurrency (BTC/ETH/SOL)
|
| 77 |
|
| 78 |
+
For pricing, volume licensing, or custom extraction requests, contact [data@optitransfer.ch](mailto:data@optitransfer.ch).
|
|
|
|
| 79 |
|
| 80 |
---
|
| 81 |
|
| 82 |
+
## Contact
|
|
|
|
|
|
|
| 83 |
|
| 84 |
+
- **Email:** [data@optitransfer.ch](mailto:data@optitransfer.ch)
|
| 85 |
+
- **Web:** [optitransfer.ch](https://optitransfer.ch)
|
| 86 |
+
- **Location:** Switzerland
|