ttj
/

nanochat-cache

Model card Files Files and versions

nanochat-cache / report /report.md

ttj's picture

Add files using upload-large-folder tool

85a524c verified 6 months ago

|

history blame contribute delete

6.53 kB

Tokenizer training

timestamp: 2025-11-03 05:56:12

max_chars: 2,000,000,000
doc_cap: 10,000
vocab_size: 65,536
train_time: 57.1515
num_special_tokens: 9
token_bytes_min: 1
token_bytes_max: 32
token_bytes_mean: 6.9197
token_bytes_std: 2.8748

Tokenizer evaluation

timestamp: 2025-11-03 05:56:20

Comparison with GPT-2

Text Type	Bytes	GPT-2 Tokens	GPT-2 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1819	404	4.50	375	4.85	+7.2%
korean	893	745	1.20	712	1.25	+4.4%
code	1259	576	2.19	492	2.56	+14.6%
math	1834	936	1.96	966	1.90	-3.2%
science	1112	260	4.28	228	4.88	+12.3%
fwe-train	4208518	900364	4.67	856883	4.91	+4.8%
fwe-val	4908443	1059062	4.63	1010352	4.86	+4.6%

Comparison with GPT-4

Text Type	Bytes	GPT-4 Tokens	GPT-4 Ratio	Ours Tokens	Ours Ratio	Relative Diff %
news	1819	387	4.70	375	4.85	+3.1%
korean	893	364	2.45	712	1.25	-95.6%
code	1259	309	4.07	492	2.56	-59.2%
math	1834	832	2.20	966	1.90	-16.1%
science	1112	249	4.47	228	4.88	+8.4%
fwe-train	4208518	874799	4.81	856883	4.91	+2.0%
fwe-val	4908443	1029691	4.77	1010352	4.86	+1.9%

Base model training

timestamp: 2025-11-03 09:00:42

Base model loss

timestamp: 2025-11-03 09:09:42

train bpb: 0.8147
val bpb: 0.8119
sample 0: <|bos|>The capital of France is Paris. It is the largest city in France and the second largest in Europe.
sample 1: <|bos|>The chemical symbol of gold is Au. It is a soft, malleable, ductile, and malleable metal. It
sample 2: <|bos|>If yesterday was Friday, then tomorrow will be Tuesday. If tomorrow is Tuesday, then tomorrow will be Wednesday. If tomorrow is
sample 3: <|bos|>The opposite of hot is cold. The opposite of hot is cold. The opposite of hot is cold.
sample 4: <|bos|>The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,
sample 5: <|bos|>My favorite color is red. I love the color red because it is a very strong color. I
sample 6: <|bos|>If 5x + 3 = 13, then x is a multiple of 5. If 5x + 3 =

Base model evaluation

timestamp: 2025-11-03 09:13:28

Model: base_model (step 21400)
CORE metric: 0.2137
hellaswag_zeroshot: 0.2687
jeopardy: 0.1214
bigbench_qa_wikidata: 0.5278
arc_easy: 0.5314
arc_challenge: 0.1251
copa: 0.3600
commonsense_qa: 0.1145
piqa: 0.3917
openbook_qa: 0.1360
lambada_openai: 0.3549
hellaswag: 0.2634
winograd: 0.2601
winogrande: 0.1018
bigbench_dyck_languages: 0.1080
agi_eval_lsat_ar: 0.1359
bigbench_cs_algorithms: 0.3720
bigbench_operators: 0.1429
bigbench_repeat_copy_logic: 0.0000
squad: 0.2528
coqa: 0.1932
boolq: -0.2369
bigbench_language_identification: 0.1762

Midtraining

timestamp: 2025-11-03 09:23:03

run: fal-mid
device_type:
dtype: bfloat16
num_iterations: -1
max_seq_len: 2048
device_batch_size: 32
unembedding_lr: 0.0040
embedding_lr: 0.2000
matrix_lr: 0.0200
init_lr_frac: 1.0000
weight_decay: 0.0000
eval_every: 150
eval_tokens: 10,485,760
total_batch_size: 524,288
dry_run: 0
Number of iterations: 809
DDP world size: 8
Minimum validation bpb: 0.3953

Chat evaluation mid

timestamp: 2025-11-03 09:30:47

source: mid
task_name: None
dtype: bfloat16
temperature: 0.0000
max_new_tokens: 512
num_samples: 1
top_k: 50
batch_size: 8
model_tag: None
step: None
max_problems: None
device_type:
ARC-Easy: 0.3443
ARC-Challenge: 0.2927
MMLU: 0.3040
GSM8K: 0.0417
HumanEval: 0.0732
SpellingBee: 1.0000
ChatCORE metric: 0.2282

Chat SFT

timestamp: 2025-11-03 09:34:55

run: fal-sft
source: mid
device_type:
dtype: bfloat16
device_batch_size: 4
num_epochs: 1
num_iterations: -1
target_examples_per_step: 32
unembedding_lr: 0.0040
embedding_lr: 0.2000
matrix_lr: 0.0200
weight_decay: 0.0000
init_lr_frac: 0.0200
eval_every: 100
eval_steps: 100
eval_metrics_every: 200
eval_metrics_max_problems: 1024
Training rows: 22,439
Number of iterations: 701
Training loss: 0.5668
Validation loss: 1.0105

Chat evaluation sft

timestamp: 2025-11-03 09:42:00

source: sft
task_name: None
dtype: bfloat16
temperature: 0.0000
max_new_tokens: 512
num_samples: 1
top_k: 50
batch_size: 8
model_tag: None
step: None
max_problems: None
device_type:
ARC-Easy: 0.3342
ARC-Challenge: 0.2969
MMLU: 0.2954
GSM8K: 0.0629
HumanEval: 0.0427
SpellingBee: 1.0000
ChatCORE metric: 0.2235

Summary

[bloat data missing]

Metric	BASE	MID	SFT	RL
CORE	0.2137	-	-	-
ARC-Challenge	-	0.2927	0.2969	-
ARC-Easy	-	0.3443	0.3342	-
GSM8K	-	0.0417	0.0629	-
HumanEval	-	0.0732	0.0427	-
MMLU	-	0.3040	0.2954	-
ChatCORE	-	0.2282	0.2235	-

Total wall clock time: unknown