Datatest-460m

A 460M-parameter language model trained from scratch on a single NVIDIA RTX 2080 Ti to help with scikit-learn, matplotlib, and general ML coding tasks. Trained on ~24.6B tokens of curated code, ML papers, library docs, and Stack Overflow Q&A; instruction-tuned with SmolTalk + ~950 hand-crafted sklearn/matplotlib examples.

⚠️ Requires trust_remote_code=True. The architecture has several non-standard features (value embeddings, ReLU², custom RMSNorm) that are not in the standard transformers model registry. The repo ships its own modeling_nanochat.py that loads via the auto class.

Quick start

For best results, use self-consistency (generate N=5 candidates per query and pick the first whose first python block parses cleanly). This is +~14pp on a custom 18-problem sklearn benchmark vs N=1 β€” the small model degenerates often, but rarely on every parallel sample. Cost: ~2Γ— wall time for ~14pp accuracy.

import ast, re, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "scottejin/Datatest-460m"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, trust_remote_code=True, torch_dtype="auto"
).eval()

CODE_BLOCK = re.compile(r"```(?:python)?\s*\n(.*?)\n```", re.DOTALL)
def pick_best(candidates):
    for c in candidates:
        m = CODE_BLOCK.search(c)
        if m:
            try:
                ast.parse(m.group(1)); return c
            except SyntaxError:
                pass
    return candidates[0]

messages = [{"role": "user", "content": "How do I tune SVM hyperparameters with GridSearchCV? Show a complete pipeline."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
attention_mask = inputs.ne(tokenizer.eos_token_id).long()
with torch.inference_mode():
    out = model.generate(
        inputs, attention_mask=attention_mask,
        max_new_tokens=512, do_sample=True,
        temperature=0.6, top_k=50,        # measured optimal with N=5
        num_return_sequences=5,           # self-consistency
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )
prompt_len = inputs.shape[1]
candidates = [tokenizer.decode(out[i, prompt_len:], skip_special_tokens=True) for i in range(5)]
print(pick_best(candidates))

For a single-sample / streaming version (faster, lower quality), use num_return_sequences=1 and do_sample=True, temperature=0.6.

What this model is good at

  • βœ… sklearn classification recipes: SVMs (linear/RBF/poly), Logistic Regression, Random Forest, Gradient Boosting, KNN, Naive Bayes, Decision Trees
  • βœ… Pipelines: Pipeline, make_pipeline, ColumnTransformer, FunctionTransformer
  • βœ… Cross-validation: KFold, StratifiedKFold, GridSearchCV, RandomizedSearchCV, validation_curve, learning_curve
  • βœ… Metrics: classification_report, confusion_matrix, ROC/AUC (binary + multiclass OvR), regression metrics (MSE, RMSE, RΒ², MAE)
  • βœ… Model interpretation: permutation_importance, PartialDependenceDisplay, plot_tree, CalibratedClassifierCV
  • βœ… Matplotlib recipes: subplots, scatter, bar, hist, violin, heatmap, contour, errorbar, custom legends
  • βœ… Debugging common errors: ConvergenceWarning, NotFittedError, shape mismatches, scaling-after-split leakage
  • βœ… Conceptual explanations: precision vs recall, RMSE vs MAE, regularization, kernel trick, bias-variance

What this model is NOT good at

  • ❌ Anything outside the ML/data-science scope β€” general chit-chat, history, philosophy, current events. It will confidently confabulate.
  • ❌ Sklearn APIs that don't exist β€” small models invent plausible-looking API names. Always run the generated code to verify.
  • ❌ Deep learning β€” almost no PyTorch/TensorFlow training data. Don't ask about transformers, CNNs, or LLMs.
  • ❌ Long-form reasoning β€” 460M params + 768-token context is small. Multi-step proofs, complex algorithm derivation, etc. are unreliable.
  • ❌ Math beyond basic statistics β€” no symbolic math training.
  • ❌ Code that needs to be exactly correct on first try β€” treat outputs as a starting draft, not production code.

Important: 1024-token context window

Pretrained at sequence_len=768, then SFT-extended to 1024. Keep prompts and history under ~900 tokens for reliable behavior. The architecture supports more (RoPE base 100k), but quality drops outside the trained range.

Recommended generation settings

Use case temperature top_k num_return_sequences Notes
Best quality (recommended) 0.6 50 5 Self-consistency: +~14pp accuracy, ~2Γ— compute
Fast (single-sample) 0.6 50 1 Quick streaming chat; degenerates more often
Strict / deterministic 0.0 1 1 Greedy β€” fully reproducible
Brainstorming 0.8 50 1 More creative, more hallucination

Training summary

  • Pretraining: 500k iterations Γ— 49,152 tokens = ~24.6B tokens, ~21 days on a single 2080 Ti (FP16, 32-step gradient accumulation)
  • Pretrain corpus (17.4M docs after dedup): FineWeb-Edu (general web 65%, technical 20%, code 5%, ML-tagged 0.6%) + Cosmopedia STEM 4.5% + library docs (sklearn 40%, matplotlib 25%, numpy/pandas/scipy) 1.4% + ML supplemental (Kaggle notebooks, Starcoder Python ML, ML ArXiv abstracts, StackOverflow Python)
  • SFT: SmolTalk 460k + MMLU 1 epoch + GSM8K 4 epochs + 953 hand-crafted sklearn/matplotlib examples Γ— 45 epochs (~6% of mixture). 8000 optimizer steps.
  • Optimizer: Muon (matrices) + AdamW (embeddings + scalars)

Architecture (why trust_remote_code=True)

This is a 20-layer GPT with several non-standard features. The HF wrapper code (modeling_nanochat.py) is shipped in this repo and loaded automatically via auto_map.

Feature Standard This model
Activation SiLU/SwiGLU ReLUΒ² (square of ReLU)
RMSNorm Has learnable weight No learned scale
Embedding ↔ LM head Often tied Untied
RoPE base 10000 100000
Q/K scaling 1.0 Γ— 1.2 sharpening
Per-layer extras None Value embeddings (alternating layers, ResFormer-style) gated by ve_gate
Cross-layer Standard residuals x0_lambdas blends initial embedding into every layer
Mid-trunk None Backout: subtract layer L/2 residual at logit head
Token mixing None Smear: cheap bigram via gated previous-token embedding
Logits Linear Softcap: 15 * tanh(logits/15)

Spec: n_layer=20, n_embd=1280, n_head=10, n_kv_head=1 (GQA), vocab=32768, seq_len=1024 (SFT-extended from 768 pretrain).

Files in this repo

File Purpose
model.safetensors Weights (fp16, ~920MB)
config.json Model configuration
configuration_nanochat.py NanochatConfig class
modeling_nanochat.py NanochatForCausalLM class (the model)
tokenization_nanochat.py NanochatTokenizer (rustbpe wrapper)
tokenizer_config.json HF tokenizer config + Jinja chat template
tokenizer.pkl Pickled tiktoken encoder (must ship β€” not just tokenizer.json)
generation_config.json Default sampling parameters

Hardware requirements

  • Inference (fp16, single sample): 1.0 GB VRAM (model) + ~250 MB (KV cache for 1024 tokens) + workspace = **1.5 GB**
  • Inference (fp16, num_return_sequences=5): 1.0 GB (model) + ~1.25 GB (KV cache Γ— 5) = **2.5 GB**
  • Runs comfortably on: any GPU with β‰₯3 GB VRAM for SC, β‰₯2 GB for single-sample
  • CPU inference: works (use torch_dtype=torch.float32), expect ~5-15 sec per response

Acknowledgements

Built on karpathy/nanochat. Trained with substantial assistance from Claude Code.

License

MIT (model weights and wrapper code).

Downloads last month
21
Safetensors
Model size
0.5B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support