--- library_name: transformers base_model: Qwen/Qwen3-0.6B tags: - qwen3 - causal-lm - tiny-language-model - novelty-gated-attention - trust-remote-code --- # tinyLM-8M-exp Tiny 8M-class Qwen3-config causal LM with math-only novelty-gated GQA. ## Architecture | Item | Value | | --- | ---: | | Config type | `tinyqwen3_novelty` | | Parameters | 8.132M | | Layers | 8 | | Hidden size | 256 | | MLP size | 896 | | Query heads | 8 | | KV heads | 4 | | Head dim | 32 | | RoPE theta | 2500 | | Tied embeddings | yes | | Attention | Value | | --- | --- | | Type | GQA | | Novelty gate | math-only element-wise RMS-normalized abs-delta | | Gate floor | 0.05 | ## Training | Item | Value | | --- | --- | | Tokenizer | `AxiomicLabs/GPT-S2-5M` | | Sequence length | 512 | | Microbatch size | 1024 | | Gradient accumulation | 4 | | Effective batch size | 4096 | | Steps | 10,000 | | Validation cadence | every 1,000 steps | | Official lm-eval | after final Hub upload on ARC-Easy, ARC-Challenge, PIQA, HellaSwag | | LR schedule | warmup, cosine to min by 10,000 | | Optimizer | Muon for middle 2D weights, AdamW for the rest | | Special-token policy | BOS/EOS are document-level; `<|im_start|>`/`<|im_end|>` are sequence-level | | Dataset | Share | Config | | --- | ---: | --- | | `HuggingFaceFW/fineweb-edu` | 60.0% | `sample-100BT` | | `HuggingFaceTB/smollm-corpus` | 30.0% | `cosmopedia-v2` only | | `epfml/FineWeb-HQ` | 10.0% | `default` | ## Validation | Metric | Value | | --- | ---: | | Dataset | `Salesforce/wikitext`, `wikitext-103-raw-v1`, validation | | Context / stride | 512 / 256 | | Loss | 3.2769 | | Perplexity | 26.49 | | UTF-8 BPB | 1.4992 | | Scored tokens | 365,258 | | UTF-8 bytes | 1,151,766 | ## Evaluation Scores were run after Hub upload against revision `d95a00a6edafab4bc2d6b60a28e6893b00f52699`. ARC-Easy, ARC-Challenge, PIQA, and HellaSwag use official `lm_eval` 0-shot log-likelihood scoring. ArithMark-2.0 uses the same continuation NLL scoring style with a custom scorer because it is not available in `lm_eval`. | Task | n | acc | acc stderr | acc_norm | acc_norm stderr | | --- | ---: | ---: | ---: | ---: | ---: | | ARC-Easy | 2,376 | 37.04% | 0.99% | 35.86% | 0.98% | | ARC-Challenge | 1,172 | 18.77% | 1.14% | 22.87% | 1.23% | | PIQA | 1,838 | 57.67% | 1.15% | 57.89% | 1.15% | | HellaSwag | 10,042 | 26.88% | 0.44% | 27.88% | 0.45% | | ArithMark-2.0 | 2,500 | 25.12% | 0.87% | 24.44% | 0.86% | ## Load And Generate ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch repo = "User01110/tinyLM-8M-exp" tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( repo, trust_remote_code=True, torch_dtype="auto", device_map="auto", ) prompt = "The future of AI is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) print(inputs.input_ids[0][:2].tolist()) # auto-prefix: [<|im_start|>, ] with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=512, do_sample=True, temperature=0.65, top_k=30, repetition_penalty=1.2, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"), ) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` This repo uses a self-contained remote `TinyQwen3NoveltyConfig` plus model code for a Qwen3-style dense decoder with a math-only novelty-gated attention block.