Use custom config type to prevent Qwen3 fallback

9632173 verified 7 days ago

3.62 kB

	---
	library_name: transformers
	base_model: Qwen/Qwen3-0.6B
	tags:
	- qwen3
	- causal-lm
	- tiny-language-model
	- novelty-gated-attention
	- trust-remote-code
	---

	# tinyLM-8M-exp

	Tiny 8M-class Qwen3-config causal LM with math-only novelty-gated GQA.

	## Architecture

	\| Item \| Value \|
	\| --- \| ---: \|
	\| Config type \| `tinyqwen3_novelty` \|
	\| Parameters \| 8.132M \|
	\| Layers \| 8 \|
	\| Hidden size \| 256 \|
	\| MLP size \| 896 \|
	\| Query heads \| 8 \|
	\| KV heads \| 4 \|
	\| Head dim \| 32 \|
	\| RoPE theta \| 2500 \|
	\| Tied embeddings \| yes \|

	\| Attention \| Value \|
	\| --- \| --- \|
	\| Type \| GQA \|
	\| Novelty gate \| math-only element-wise RMS-normalized abs-delta \|
	\| Gate floor \| 0.05 \|

	## Training

	\| Item \| Value \|
	\| --- \| --- \|
	\| Tokenizer \| `AxiomicLabs/GPT-S2-5M` \|
	\| Sequence length \| 512 \|
	\| Microbatch size \| 1024 \|
	\| Gradient accumulation \| 4 \|
	\| Effective batch size \| 4096 \|
	\| Steps \| 10,000 \|
	\| Validation cadence \| every 1,000 steps \|
	\| Official lm-eval \| after final Hub upload on ARC-Easy, ARC-Challenge, PIQA, HellaSwag \|
	\| LR schedule \| warmup, cosine to min by 10,000 \|
	\| Optimizer \| Muon for middle 2D weights, AdamW for the rest \|
	\| Special-token policy \| BOS/EOS are document-level; `<\|im_start\|>`/`<\|im_end\|>` are sequence-level \|

	\| Dataset \| Share \| Config \|
	\| --- \| ---: \| --- \|
	\| `HuggingFaceFW/fineweb-edu` \| 60.0% \| `sample-100BT` \|
	\| `HuggingFaceTB/smollm-corpus` \| 30.0% \| `cosmopedia-v2` only \|
	\| `epfml/FineWeb-HQ` \| 10.0% \| `default` \|

	## Validation

	\| Metric \| Value \|
	\| --- \| ---: \|
	\| Dataset \| `Salesforce/wikitext`, `wikitext-103-raw-v1`, validation \|
	\| Context / stride \| 512 / 256 \|
	\| Loss \| 3.2769 \|
	\| Perplexity \| 26.49 \|
	\| UTF-8 BPB \| 1.4992 \|
	\| Scored tokens \| 365,258 \|
	\| UTF-8 bytes \| 1,151,766 \|

	## Evaluation

	Scores were run after Hub upload against revision
	`d95a00a6edafab4bc2d6b60a28e6893b00f52699`.
	ARC-Easy, ARC-Challenge, PIQA, and HellaSwag use official `lm_eval`
	0-shot log-likelihood scoring. ArithMark-2.0 uses the same continuation
	NLL scoring style with a custom scorer because it is not available in
	`lm_eval`.

	\| Task \| n \| acc \| acc stderr \| acc_norm \| acc_norm stderr \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| ARC-Easy \| 2,376 \| 37.04% \| 0.99% \| 35.86% \| 0.98% \|
	\| ARC-Challenge \| 1,172 \| 18.77% \| 1.14% \| 22.87% \| 1.23% \|
	\| PIQA \| 1,838 \| 57.67% \| 1.15% \| 57.89% \| 1.15% \|
	\| HellaSwag \| 10,042 \| 26.88% \| 0.44% \| 27.88% \| 0.45% \|
	\| ArithMark-2.0 \| 2,500 \| 25.12% \| 0.87% \| 24.44% \| 0.86% \|

	## Load And Generate

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	repo = "User01110/tinyLM-8M-exp"
	tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	repo,
	trust_remote_code=True,
	torch_dtype="auto",
	device_map="auto",
	)

	prompt = "The future of AI is"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	print(inputs.input_ids[0][:2].tolist()) # auto-prefix: [<\|im_start\|>, <bos>]

	with torch.no_grad():
	output = model.generate(
	**inputs,
	max_new_tokens=512,
	do_sample=True,
	temperature=0.65,
	top_k=30,
	repetition_penalty=1.2,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.convert_tokens_to_ids("<\|im_end\|>"),
	)

	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	This repo uses a self-contained remote `TinyQwen3NoveltyConfig` plus model code for a Qwen3-style dense decoder with a math-only novelty-gated attention block.

	---
	library_name: transformers
	base_model: Qwen/Qwen3-0.6B
	tags:
	- qwen3
	- causal-lm
	- tiny-language-model
	- novelty-gated-attention
	- trust-remote-code
	---

	# tinyLM-8M-exp

	Tiny 8M-class Qwen3-config causal LM with math-only novelty-gated GQA.

	## Architecture

	\| Item \| Value \|
	\| --- \| ---: \|
	\| Config type \| `tinyqwen3_novelty` \|
	\| Parameters \| 8.132M \|
	\| Layers \| 8 \|
	\| Hidden size \| 256 \|
	\| MLP size \| 896 \|
	\| Query heads \| 8 \|
	\| KV heads \| 4 \|
	\| Head dim \| 32 \|
	\| RoPE theta \| 2500 \|
	\| Tied embeddings \| yes \|

	\| Attention \| Value \|
	\| --- \| --- \|
	\| Type \| GQA \|
	\| Novelty gate \| math-only element-wise RMS-normalized abs-delta \|
	\| Gate floor \| 0.05 \|

	## Training

	\| Item \| Value \|
	\| --- \| --- \|
	\| Tokenizer \| `AxiomicLabs/GPT-S2-5M` \|
	\| Sequence length \| 512 \|
	\| Microbatch size \| 1024 \|
	\| Gradient accumulation \| 4 \|
	\| Effective batch size \| 4096 \|
	\| Steps \| 10,000 \|
	\| Validation cadence \| every 1,000 steps \|
	\| Official lm-eval \| after final Hub upload on ARC-Easy, ARC-Challenge, PIQA, HellaSwag \|
	\| LR schedule \| warmup, cosine to min by 10,000 \|
	\| Optimizer \| Muon for middle 2D weights, AdamW for the rest \|
	\| Special-token policy \| BOS/EOS are document-level; `<\|im_start\|>`/`<\|im_end\|>` are sequence-level \|

	\| Dataset \| Share \| Config \|
	\| --- \| ---: \| --- \|
	\| `HuggingFaceFW/fineweb-edu` \| 60.0% \| `sample-100BT` \|
	\| `HuggingFaceTB/smollm-corpus` \| 30.0% \| `cosmopedia-v2` only \|
	\| `epfml/FineWeb-HQ` \| 10.0% \| `default` \|

	## Validation

	\| Metric \| Value \|
	\| --- \| ---: \|
	\| Dataset \| `Salesforce/wikitext`, `wikitext-103-raw-v1`, validation \|
	\| Context / stride \| 512 / 256 \|
	\| Loss \| 3.2769 \|
	\| Perplexity \| 26.49 \|
	\| UTF-8 BPB \| 1.4992 \|
	\| Scored tokens \| 365,258 \|
	\| UTF-8 bytes \| 1,151,766 \|

	## Evaluation

	Scores were run after Hub upload against revision
	`d95a00a6edafab4bc2d6b60a28e6893b00f52699`.
	ARC-Easy, ARC-Challenge, PIQA, and HellaSwag use official `lm_eval`
	0-shot log-likelihood scoring. ArithMark-2.0 uses the same continuation
	NLL scoring style with a custom scorer because it is not available in
	`lm_eval`.

	\| Task \| n \| acc \| acc stderr \| acc_norm \| acc_norm stderr \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| ARC-Easy \| 2,376 \| 37.04% \| 0.99% \| 35.86% \| 0.98% \|
	\| ARC-Challenge \| 1,172 \| 18.77% \| 1.14% \| 22.87% \| 1.23% \|
	\| PIQA \| 1,838 \| 57.67% \| 1.15% \| 57.89% \| 1.15% \|
	\| HellaSwag \| 10,042 \| 26.88% \| 0.44% \| 27.88% \| 0.45% \|
	\| ArithMark-2.0 \| 2,500 \| 25.12% \| 0.87% \| 24.44% \| 0.86% \|

	## Load And Generate

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	repo = "User01110/tinyLM-8M-exp"
	tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	repo,
	trust_remote_code=True,
	torch_dtype="auto",
	device_map="auto",
	)

	prompt = "The future of AI is"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	print(inputs.input_ids[0][:2].tolist()) # auto-prefix: [<\|im_start\|>, <bos>]

	with torch.no_grad():
	output = model.generate(
	**inputs,
	max_new_tokens=512,
	do_sample=True,
	temperature=0.65,
	top_k=30,
	repetition_penalty=1.2,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.convert_tokens_to_ids("<\|im_end\|>"),
	)

	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	This repo uses a self-contained remote `TinyQwen3NoveltyConfig` plus model code for a Qwen3-style dense decoder with a math-only novelty-gated attention block.