Optitransfer commited on
Commit
8860b80
·
verified ·
1 Parent(s): 93934a8

Update org README: corporate clean, Swiss datasets with use-case tags

Browse files
Files changed (1) hide show
  1. README.md +58 -47
README.md CHANGED
@@ -1,75 +1,86 @@
1
- ---
2
- title: README
3
- emoji: 🏔️
4
- colorFrom: blue
5
- colorTo: green
6
- sdk: static
7
- pinned: false
8
  ---
9
 
10
- <div style="max-width: 800px; margin: 0 auto;">
11
 
12
- <h2>🏔️ OptiTransferData Sovereign AI Data for Europe</h2>
13
 
14
- <p style="font-size: 1.1em; color: #555;">
15
- Production-grade, EU AI Act compliant web corpora for LLM training, RAG pipelines, and NLP research. Curated in Switzerland 🇨🇭 with independent quality assurance.
16
- </p>
 
 
 
 
 
17
 
18
  ---
19
 
20
- ### 🎯 What We Do
21
 
22
- We build **gold-standard national web corpora** — comprehensive, deduplicated, and quality-scored datasets covering entire country-level web domains. Each dataset is independently audited and delivered with full provenance tracking, SHA-256 integrity verification, and commercial licensing.
23
 
24
- **Our focus areas:**
25
- - 🤖 **LLM Pre-training & Fine-tuning** — Sovereign language data at scale
26
- - 🔍 **RAG Pipelines** — Pre-chunked, embedding-ready corpora with quality scores
27
- - 🏛️ **Government & Regulatory NLP** — Domain-classified, jurisdiction-specific data
28
- - 📊 **Academic Research** — Reproducible, well-documented datasets with full metadata
29
 
30
- ---
 
 
31
 
32
- ### 📦 Available Datasets
33
 
34
- | Dataset | Records | Coverage | Format |
35
- |---|---|---|---|
36
- | 🇱🇮 [Liechtenstein Ultra Premium](https://huggingface.co/datasets/OptiTransferData/liechtenstein-ultra-premium-li) | 35,748 | Full `.li` domain | JSONL · 37 fields |
37
- | 🇫🇷 [France Sovereign RAG Chunks](https://huggingface.co/datasets/OptiTransferData/france-sovereign-rag-chunks) | 348,829 | French government & institutional web | JSONL · 8 fields |
38
 
39
- > **Free gated samples** available on each dataset — request access to evaluate before purchasing.
 
 
 
40
 
41
- **Coming soon:** 🇩🇪 Germany · 🇦🇹 Austria · 🇨🇭 Switzerland · 🇮🇹 Italy · 🇪🇸 Spain
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ---
44
 
45
- ### Quality Standards
 
 
 
 
 
 
 
 
 
 
46
 
47
- - 📋 **Independent QA audits** with documented accuracy metrics
48
- - 🔐 **SHA-256 integrity verification** on all production files
49
- - 📊 **Quality scoring** per record (0–100 scale)
50
- - 🏷️ **Domain classification** and language detection
51
- - 📜 **EU AI Act compliance** — full data provenance and licensing transparency
52
- - 🧹 **Deduplication** — content-level and URL-level
53
 
54
  ---
55
 
56
- ### 💼 Licensing & Access
57
 
58
- | Tier | Access |
59
- |---|---|
60
- | **Sample** | Free with gated access — evaluate data quality |
61
- | **Full Dataset** | Commercial licence — complete production data |
62
- | **Enterprise** | Custom pricing — dedicated support, SLA, bespoke corpora |
63
 
64
- 📧 **Contact us for a quote:** [data@optitransfer.ch](mailto:data@optitransfer.ch)
65
 
66
- **Payment methods:**
67
- 🏦 Bank Transfer (SEPA/SWIFT) · 📱 TWINT (Swiss) · ₿ Crypto (BTC/ETH/SOL — addresses on request)
68
 
69
  ---
70
 
71
- <p style="text-align: center; color: #888; font-size: 0.9em;">
72
- 🏔️ Curated in Switzerland · <a href="https://optitransfer.ch">optitransfer.ch</a> · <a href="mailto:data@optitransfer.ch">data@optitransfer.ch</a>
73
- </p>
74
 
75
- </div>
 
 
 
1
+ # OptiTransferData
2
+
3
+ Premium web corpora for LLM pre-training, fine-tuning, RAG, and multilingual NLP.
4
+
 
 
 
5
  ---
6
 
7
+ ## About
8
 
9
+ OptiTransferData is the data arm of [OptiTransfer AG](https://optitransfer.ch), a Swiss-registered technology company. We produce compliance-ready, quality-scored web datasets for AI teams building in regulated markets.
10
 
11
+ Every dataset ships with:
12
+
13
+ - Full data provenance and SHA256 verification
14
+ - PII detection and redaction
15
+ - Multi-dimensional quality scoring (0-100 per document)
16
+ - EU AI Act and Swiss FADP compliance documentation
17
+ - Croissant metadata for ML interoperability
18
+ - Multiple export formats (Parquet, JSONL, language splits, RAG chunks)
19
 
20
  ---
21
 
22
+ ## Available Datasets
23
 
24
+ ### *.ch Swiss Web Premium (A+)
25
 
26
+ 110,491 documents | 78 fields | 554 MB total | Quality score: 62.3 mean
 
 
 
 
27
 
28
+ The flagship Swiss web corpus, extracted and quality-scored from the .ch ccTLD. Multilingual coverage across German (61.2%), French (19.0%), English (10.5%), Italian (4.7%), and additional languages. Nine-component quality model with full provenance chain.
29
+
30
+ **Best suited for:**
31
 
32
+ `LLM Pre-Training` `Supervised Fine-Tuning (SFT)` `Retrieval-Augmented Generation (RAG)` `Multilingual NLP` `German Language Models` `French Language Models` `Swiss Market AI` `Regulatory Compliance (EU AI Act)` `Domain-Specific Training` `Web Corpus Research` `Text Classification` `Summarisation` `Question Answering` `Translation`
33
 
34
+ **Formats:** Parquet (7 shards) | JSONL (7 shards) | Language Splits (DE, FR, EN, IT) | RAG Chunks (4 files)
 
 
 
35
 
36
+ | Repository | Description | Access |
37
+ |---|---|---|
38
+ | [swiss-web-premium-ch](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch) | 10,000-record stratified sample with full documentation and QA report | Gated (evaluation) |
39
+ | [swiss-web-premium-ch-full](https://huggingface.co/datasets/OptiTransferData/swiss-web-premium-ch-full) | Complete 110,491-record production dataset | Gated (licensed) |
40
 
41
+ ---
42
+
43
+ ## Data Pipeline
44
+
45
+ All datasets are processed through the OptiTransfer pipeline:
46
+
47
+ 1. **Source Selection** -- Common Crawl filtered by ccTLD and domain trust scoring
48
+ 2. **Extraction** -- Text extraction with deduplication, language detection, and structural analysis
49
+ 3. **Quality Scoring** -- Nine-component quality model producing a composite 0-100 score per document
50
+ 4. **Enrichment** -- Content categorisation, trust tier assignment, academic/news detection, skill tagging
51
+ 5. **Compliance** -- PII scanning, redaction, and regulatory documentation
52
+ 6. **Verification** -- SHA256 checksums, QA reporting, and independent audit readiness
53
 
54
  ---
55
 
56
+ ## Quality Assurance
57
+
58
+ Each dataset is accompanied by a full QA report covering:
59
+
60
+ - Pipeline configuration and processing parameters
61
+ - Quality score distributions and statistical analysis
62
+ - Language detection accuracy and coverage
63
+ - Content categorisation breakdown
64
+ - PII detection results
65
+ - Domain trust tier analysis
66
+ - SHA256 integrity verification
67
 
68
+ QA reports are available in both the sample and full product repositories.
 
 
 
 
 
69
 
70
  ---
71
 
72
+ ## Licensing
73
 
74
+ All datasets are available under the OptiTransfer Commercial License. Sample repositories provide gated evaluation access. Full datasets require a commercial license agreement.
 
 
 
 
75
 
76
+ **Payment methods:** Bank Transfer (SEPA/SWIFT), TWINT, Cryptocurrency (BTC/ETH/SOL)
77
 
78
+ For pricing, volume licensing, or custom extraction requests, contact [data@optitransfer.ch](mailto:data@optitransfer.ch).
 
79
 
80
  ---
81
 
82
+ ## Contact
 
 
83
 
84
+ - **Email:** [data@optitransfer.ch](mailto:data@optitransfer.ch)
85
+ - **Web:** [optitransfer.ch](https://optitransfer.ch)
86
+ - **Location:** Switzerland