Spaces:
Configuration error
Configuration error
| title: OptiTransfer Data | |
| emoji: ◼️ | |
| sdk: static | |
| pinned: false | |
| colorFrom: gray | |
| colorTo: blue | |
| # OptiTransfer Data | |
| Premium web corpora for LLM pre-training, fine-tuning, RAG, and multilingual NLP. | |
| --- | |
| ## About | |
| OptiTransfer Data is the data division of [OptiTransfer AG](https://optitransfer.ch), a Swiss-registered technology company. We produce compliance-ready, quality-scored web datasets for AI teams building in regulated markets. | |
| Every dataset ships with: | |
| - Full data provenance and SHA256 verification | |
| - PII detection and redaction | |
| - Multi-dimensional quality scoring (0-100 per document) | |
| - EU AI Act and Swiss FADP compliance documentation | |
| - Croissant metadata for ML interoperability | |
| - Multiple export formats (Parquet, JSONL, language splits, RAG chunks) | |
| --- | |
| ## Available Datasets | |
| ### *.ch Swiss Web Premium (A+) | |
| 110,491 documents | 78 fields | 554 MB total | Quality score: 62.3 mean | |
| The flagship Swiss web corpus, extracted and quality-scored from the .ch ccTLD. Multilingual coverage across German (61.2%), French (19.0%), English (10.5%), Italian (4.7%), and additional languages. Nine-component quality model with full provenance chain. | |
| **Best suited for:** LLM Pre-Training, Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), Multilingual NLP, German Language Models, French Language Models, Swiss Market AI, Regulatory Compliance (EU AI Act), Domain-Specific Training, Web Corpus Research, Text Classification, Summarisation, Question Answering, Translation | |
| **Formats:** Parquet (7 shards) | JSONL (7 shards) | Language Splits (DE, FR, EN, IT) | RAG Chunks (4 files) | |
| | Repository | Description | Access | | |
| |---|---|---| | |
| | [swiss-web-premium-ch](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch) | 10,000-record stratified sample with full documentation and QA report | Gated (evaluation) | | |
| | [swiss-web-premium-ch-full](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch-full) | Complete 110,491-record production dataset | Gated (licensed) | | |
| --- | |
| ## Data Pipeline | |
| All datasets are processed through the OptiTransfer pipeline: | |
| 1. **Source Selection** -- Common Crawl filtered by ccTLD and domain trust scoring | |
| 2. **Extraction** -- Text extraction with deduplication, language detection, and structural analysis | |
| 3. **Quality Scoring** -- Nine-component quality model producing a composite 0-100 score per document | |
| 4. **Enrichment** -- Content categorisation, trust tier assignment, academic/news detection, skill tagging | |
| 5. **Compliance** -- PII scanning, redaction, and regulatory documentation | |
| 6. **Verification** -- SHA256 checksums, QA reporting, and independent audit readiness | |
| --- | |
| ## Quality Assurance | |
| Each dataset is accompanied by a full QA report covering: | |
| - Pipeline configuration and processing parameters | |
| - Quality score distributions and statistical analysis | |
| - Language detection accuracy and coverage | |
| - Content categorisation breakdown | |
| - PII detection results | |
| - Domain trust tier analysis | |
| - SHA256 integrity verification | |
| QA reports are available in both the sample and full product repositories. | |
| --- | |
| ## Licensing | |
| All datasets are available under the OptiTransfer Commercial License. Sample repositories provide gated evaluation access. Full datasets require a commercial license agreement. | |
| **Payment methods:** Bank Transfer (SEPA/SWIFT) | TWINT | Cryptocurrency (BTC / ETH / SOL) | |
| For pricing, volume licensing, or custom extraction requests, contact [data@optitransfer.ch](mailto:data@optitransfer.ch). | |
| --- | |
| ## Contact | |
| - **Email:** [data@optitransfer.ch](mailto:data@optitransfer.ch) | |
| - **Web:** [optitransfer.ch](https://optitransfer.ch) | |
| - **Location:** Switzerland | |