Premium web corpora for LLM pre-training, fine-tuning, RAG, and multilingual NLP. Swiss-registered. EU AI Act compliant. Quality-scored, PII-redacted, SHA256-verified.
Sovereign national web corpora at scale for pre-training and supervised fine-tuning
Pre-chunked, embedding-ready corpora with quality scores per chunk
Domain-classified, jurisdiction-specific government and institutional data
Reproducible datasets with full metadata, provenance tracking, and QA reports
| Dataset | Records | Formats | Access |
|---|---|---|---|
| *.ch Swiss Web Premium (A+) | 110,491 | Parquet, JSONL, Language Splits, RAG Chunks | Sample | Full |
Flagship Swiss web corpus from the .ch ccTLD. 112.4M tokens across 78 fields. Multilingual coverage: German (61.2%), French (19.0%), English (10.5%), Italian (4.7%), and 25 additional languages. Nine-component quality model, full provenance chain, and independent QA report.
Free gated samples available on each dataset. Request access to evaluate before purchasing.
Gated access. Evaluate data quality, schema, and documentation before committing.
Complete production data with commercial licence. All formats included.
Dedicated support, SLA, bespoke corpora, volume pricing.
Contact us for a quote: data@optitransfer.ch