AI & ML interests
EU AI Act compliant sovereign web corpora for LLM training and RAG pipelines
Recent Activity
OptiTransfer Data
Premium web corpora for LLM pre-training, fine-tuning, RAG, and multilingual NLP. Swiss-registered. EU AI Act compliant. Quality-scored, PII-redacted, SHA256-verified.
Capabilities
Sovereign national web corpora at scale for pre-training and supervised fine-tuning
Pre-chunked, embedding-ready corpora with quality scores per chunk
Domain-classified, jurisdiction-specific government and institutional data
Reproducible datasets with full metadata, provenance tracking, and QA reports
Available Datasets
| Dataset | Records | Formats | Access |
|---|---|---|---|
| *.ch Swiss Web Premium (A+) | 110,491 | Parquet, JSONL, Language Splits, RAG Chunks | Sample | Full |
Flagship Swiss web corpus from the .ch ccTLD. 112.4M tokens across 78 fields. Multilingual coverage: German (61.2%), French (19.0%), English (10.5%), Italian (4.7%), and 25 additional languages. Nine-component quality model, full provenance chain, and independent QA report.
Free gated samples available on each dataset. Request access to evaluate before purchasing.
Quality Standards
- Independent QA audits with documented accuracy metrics
- SHA-256 integrity verification on all production files
- Quality scoring per record (0 to 100 scale, nine components)
- Domain classification and language detection
- EU AI Act compliance with full data provenance and licensing transparency
- Content-level and URL-level deduplication
- PII detection and redaction (email, phone, IBAN, AHV, credit card)
- Croissant metadata for ML interoperability
Licensing and Pricing
Sample
Gated access. Evaluate data quality, schema, and documentation before committing.
Full Dataset
Complete production data with commercial licence. All formats included.
Enterprise
Dedicated support, SLA, bespoke corpora, volume pricing.
Contact us for a quote: data@optitransfer.ch