README / README.md
Optitransfer's picture
Fix Space config: add valid emoji field
3d4df71 verified
---
title: OptiTransfer Data
emoji: ◼️
sdk: static
pinned: false
colorFrom: gray
colorTo: blue
---
# OptiTransfer Data
Premium web corpora for LLM pre-training, fine-tuning, RAG, and multilingual NLP.
---
## About
OptiTransfer Data is the data division of [OptiTransfer AG](https://optitransfer.ch), a Swiss-registered technology company. We produce compliance-ready, quality-scored web datasets for AI teams building in regulated markets.
Every dataset ships with:
- Full data provenance and SHA256 verification
- PII detection and redaction
- Multi-dimensional quality scoring (0-100 per document)
- EU AI Act and Swiss FADP compliance documentation
- Croissant metadata for ML interoperability
- Multiple export formats (Parquet, JSONL, language splits, RAG chunks)
---
## Available Datasets
### *.ch Swiss Web Premium (A+)
110,491 documents | 78 fields | 554 MB total | Quality score: 62.3 mean
The flagship Swiss web corpus, extracted and quality-scored from the .ch ccTLD. Multilingual coverage across German (61.2%), French (19.0%), English (10.5%), Italian (4.7%), and additional languages. Nine-component quality model with full provenance chain.
**Best suited for:** LLM Pre-Training, Supervised Fine-Tuning (SFT), Retrieval-Augmented Generation (RAG), Multilingual NLP, German Language Models, French Language Models, Swiss Market AI, Regulatory Compliance (EU AI Act), Domain-Specific Training, Web Corpus Research, Text Classification, Summarisation, Question Answering, Translation
**Formats:** Parquet (7 shards) | JSONL (7 shards) | Language Splits (DE, FR, EN, IT) | RAG Chunks (4 files)
| Repository | Description | Access |
|---|---|---|
| [swiss-web-premium-ch](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch) | 10,000-record stratified sample with full documentation and QA report | Gated (evaluation) |
| [swiss-web-premium-ch-full](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch-full) | Complete 110,491-record production dataset | Gated (licensed) |
---
## Data Pipeline
All datasets are processed through the OptiTransfer pipeline:
1. **Source Selection** -- Common Crawl filtered by ccTLD and domain trust scoring
2. **Extraction** -- Text extraction with deduplication, language detection, and structural analysis
3. **Quality Scoring** -- Nine-component quality model producing a composite 0-100 score per document
4. **Enrichment** -- Content categorisation, trust tier assignment, academic/news detection, skill tagging
5. **Compliance** -- PII scanning, redaction, and regulatory documentation
6. **Verification** -- SHA256 checksums, QA reporting, and independent audit readiness
---
## Quality Assurance
Each dataset is accompanied by a full QA report covering:
- Pipeline configuration and processing parameters
- Quality score distributions and statistical analysis
- Language detection accuracy and coverage
- Content categorisation breakdown
- PII detection results
- Domain trust tier analysis
- SHA256 integrity verification
QA reports are available in both the sample and full product repositories.
---
## Licensing
All datasets are available under the OptiTransfer Commercial License. Sample repositories provide gated evaluation access. Full datasets require a commercial license agreement.
**Payment methods:** Bank Transfer (SEPA/SWIFT) | TWINT | Cryptocurrency (BTC / ETH / SOL)
For pricing, volume licensing, or custom extraction requests, contact [data@optitransfer.ch](mailto:data@optitransfer.ch).
---
## Contact
- **Email:** [data@optitransfer.ch](mailto:data@optitransfer.ch)
- **Web:** [optitransfer.ch](https://optitransfer.ch)
- **Location:** Switzerland