RonanMcGovern commited on
Commit
d8c00d0
·
verified ·
1 Parent(s): 0bfc9d7

Upload via push_to_hf.py

Browse files
report/latest/header.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # nanochat training report
2
+
3
+ Generated: 2025-12-17 08:44:33
4
+
5
+ ## Environment
6
+
7
+ ### Git Information
8
+ - Branch: recursive
9
+ - Commit: f008f9b (dirty)
10
+ - Message: feat: add Poisson sampling to mid/sft training and multi-recur chat eval
11
+
12
+ ### Hardware
13
+ - Platform: Linux
14
+ - CPUs: 128 cores (256 logical)
15
+ - Memory: 1511.5 GB
16
+ - GPUs: 8x NVIDIA H100 80GB HBM3
17
+ - GPU Memory: 632.8 GB total
18
+ - CUDA Version: 12.8
19
+ - Hourly Rate: $24.00/hour
20
+
21
+ ### Software
22
+ - Python: 3.10.12
23
+ - PyTorch: 2.8.0+cu128
24
+
25
+
26
+ ### Bloat
27
+ - Characters: 464,005
28
+ - Lines: 11,225
29
+ - Files: 55
30
+ - Tokens (approx): 116,001
31
+ - Dependencies (uv.lock lines): 2,252
32
+
33
+ Run started: 2025-12-17 08:44:37
34
+
35
+ ---
36
+
report/latest/tokenizer-evaluation.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Tokenizer evaluation
2
+ timestamp: 2025-12-17 08:46:14
3
+
4
+ ### Comparison with GPT-2
5
+
6
+ | Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
7
+ |-----------|-------|--------------|--------------|-------------|------------|-----------------|
8
+ | news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% |
9
+ | korean | 893 | 745 | 1.20 | 721 | 1.24 | +3.2% |
10
+ | code | 1259 | 576 | 2.19 | 493 | 2.55 | +14.4% |
11
+ | math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% |
12
+ | science | 1112 | 260 | 4.28 | 225 | 4.94 | +13.5% |
13
+ | fwe-train | 4208518 | 900364 | 4.67 | 856901 | 4.91 | +4.8% |
14
+ | fwe-val | 4908443 | 1059062 | 4.63 | 1010356 | 4.86 | +4.6% |
15
+
16
+ ### Comparison with GPT-4
17
+
18
+ | Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
19
+ |-----------|-------|--------------|--------------|-------------|------------|-----------------|
20
+ | news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% |
21
+ | korean | 893 | 364 | 2.45 | 721 | 1.24 | -98.1% |
22
+ | code | 1259 | 309 | 4.07 | 493 | 2.55 | -59.5% |
23
+ | math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% |
24
+ | science | 1112 | 249 | 4.47 | 225 | 4.94 | +9.6% |
25
+ | fwe-train | 4208518 | 874799 | 4.81 | 856901 | 4.91 | +2.0% |
26
+ | fwe-val | 4908443 | 1029691 | 4.77 | 1010356 | 4.86 | +1.9% |
27
+
report/latest/tokenizer-training.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Tokenizer training
2
+ timestamp: 2025-12-17 08:46:10
3
+
4
+ - max_chars: 2,000,000,000
5
+ - doc_cap: 10,000
6
+ - vocab_size: 65,536
7
+ - train_time: 86.1852
8
+ - num_special_tokens: 9
9
+ - token_bytes_min: 1
10
+ - token_bytes_max: 32
11
+ - token_bytes_mean: 6.9151
12
+ - token_bytes_std: 2.8736
13
+