nanochat-darija-73m-base

Base NanoChat causal language model for Moroccan Darija.

This repo is exported in Hugging Face Transformers format with custom model code. Load it with trust_remote_code=True.

Preview Checkpoint Notice

This is a pilot/test checkpoint, not the final full-data model. It was trained to validate the Darija data pipeline, tokenizer, NanoChat architecture export, and SFT workflow before a larger billion-plus-token training run.

The cleaned base corpus contains 5M Darija rows and approximately 2B tokens with the included tokenizer. That number describes the available cleaned corpus; this checkpoint was intentionally trained on a much smaller/shorter schedule.

Model Details

  • Parameters: 73.5M (73,531,538)
  • Context length: 2048
  • Vocab size: 32768
  • Layers: 6
  • Hidden size: 384
  • Attention heads: 3
  • Checkpoint tag: d6_target12
  • Checkpoint step: 1062
  • Export dtype: bfloat16

Training

Pretrained on the cleaned custom made Moroccan Darija FineWeb-Edu translation corpus.

The instruction-tuned variant is small and experimental. It is useful for lightweight Darija chat tests, but it is not reliable for math, factuality, code debugging, translation fidelity, or safety-critical decisions.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Lyte/nanochat-darija-73m-base"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

inputs = tokenizer("العصور الوسطى هي", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.3,
    top_k=100,
    top_p=0.95,
    repetition_penalty=1.1,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Files

  • model.safetensors: model weights
  • config.json: NanoChat architecture config
  • generation_config.json: default sampling config
  • tokenizer.json, tokenizer_config.json, special_tokens_map.json: tokenizer files
  • configuration_nanochat.py, modeling_nanochat.py: custom Transformers code
  • nanochat_export.json: source checkpoint metadata

Limitations

This is a tiny model. Expect fluent-looking but wrong answers, repetition on some prompts, and brittle instruction following. Use it as a research artifact or local baseline, not as a production assistant.

Downloads last month
37
Safetensors
Model size
73.5M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Lyte/nanochat-darija-73m-base

Finetunes
1 model