Upload via push_to_hf.py
Browse files
report/latest/header.md
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# nanochat training report
|
| 2 |
+
|
| 3 |
+
Generated: 2025-12-17 08:44:33
|
| 4 |
+
|
| 5 |
+
## Environment
|
| 6 |
+
|
| 7 |
+
### Git Information
|
| 8 |
+
- Branch: recursive
|
| 9 |
+
- Commit: f008f9b (dirty)
|
| 10 |
+
- Message: feat: add Poisson sampling to mid/sft training and multi-recur chat eval
|
| 11 |
+
|
| 12 |
+
### Hardware
|
| 13 |
+
- Platform: Linux
|
| 14 |
+
- CPUs: 128 cores (256 logical)
|
| 15 |
+
- Memory: 1511.5 GB
|
| 16 |
+
- GPUs: 8x NVIDIA H100 80GB HBM3
|
| 17 |
+
- GPU Memory: 632.8 GB total
|
| 18 |
+
- CUDA Version: 12.8
|
| 19 |
+
- Hourly Rate: $24.00/hour
|
| 20 |
+
|
| 21 |
+
### Software
|
| 22 |
+
- Python: 3.10.12
|
| 23 |
+
- PyTorch: 2.8.0+cu128
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
### Bloat
|
| 27 |
+
- Characters: 464,005
|
| 28 |
+
- Lines: 11,225
|
| 29 |
+
- Files: 55
|
| 30 |
+
- Tokens (approx): 116,001
|
| 31 |
+
- Dependencies (uv.lock lines): 2,252
|
| 32 |
+
|
| 33 |
+
Run started: 2025-12-17 08:44:37
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
report/latest/tokenizer-evaluation.md
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Tokenizer evaluation
|
| 2 |
+
timestamp: 2025-12-17 08:46:14
|
| 3 |
+
|
| 4 |
+
### Comparison with GPT-2
|
| 5 |
+
|
| 6 |
+
| Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
|
| 7 |
+
|-----------|-------|--------------|--------------|-------------|------------|-----------------|
|
| 8 |
+
| news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% |
|
| 9 |
+
| korean | 893 | 745 | 1.20 | 721 | 1.24 | +3.2% |
|
| 10 |
+
| code | 1259 | 576 | 2.19 | 493 | 2.55 | +14.4% |
|
| 11 |
+
| math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% |
|
| 12 |
+
| science | 1112 | 260 | 4.28 | 225 | 4.94 | +13.5% |
|
| 13 |
+
| fwe-train | 4208518 | 900364 | 4.67 | 856901 | 4.91 | +4.8% |
|
| 14 |
+
| fwe-val | 4908443 | 1059062 | 4.63 | 1010356 | 4.86 | +4.6% |
|
| 15 |
+
|
| 16 |
+
### Comparison with GPT-4
|
| 17 |
+
|
| 18 |
+
| Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
|
| 19 |
+
|-----------|-------|--------------|--------------|-------------|------------|-----------------|
|
| 20 |
+
| news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% |
|
| 21 |
+
| korean | 893 | 364 | 2.45 | 721 | 1.24 | -98.1% |
|
| 22 |
+
| code | 1259 | 309 | 4.07 | 493 | 2.55 | -59.5% |
|
| 23 |
+
| math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% |
|
| 24 |
+
| science | 1112 | 249 | 4.47 | 225 | 4.94 | +9.6% |
|
| 25 |
+
| fwe-train | 4208518 | 874799 | 4.81 | 856901 | 4.91 | +2.0% |
|
| 26 |
+
| fwe-val | 4908443 | 1029691 | 4.77 | 1010356 | 4.86 | +1.9% |
|
| 27 |
+
|
report/latest/tokenizer-training.md
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Tokenizer training
|
| 2 |
+
timestamp: 2025-12-17 08:46:10
|
| 3 |
+
|
| 4 |
+
- max_chars: 2,000,000,000
|
| 5 |
+
- doc_cap: 10,000
|
| 6 |
+
- vocab_size: 65,536
|
| 7 |
+
- train_time: 86.1852
|
| 8 |
+
- num_special_tokens: 9
|
| 9 |
+
- token_bytes_min: 1
|
| 10 |
+
- token_bytes_max: 32
|
| 11 |
+
- token_bytes_mean: 6.9151
|
| 12 |
+
- token_bytes_std: 2.8736
|
| 13 |
+
|