Hybrid Tiny LM POC on WikiText-2

Proof-of-concept causal language model combining multiple token-mixing approaches:

  • local/full causal self-attention for exact token lookup
  • gated causal depthwise convolution for local pattern mixing
  • input-gated diagonal recurrent mixer as a tiny SSM/RWKV-like compressed memory
  • hybrid blocks that run attention + conv + recurrence in parallel and fuse them

Dataset: Salesforce/wikitext, config wikitext-2-raw-v1. Tokenizer: openai-community/gpt2.

POC metrics

{
  "eval_loss": 7.388123512268066,
  "eval_runtime": 9.5676,
  "eval_samples_per_second": 26.757,
  "eval_steps_per_second": 3.345,
  "epoch": 0.78125,
  "perplexity": 1616.6696041601817,
  "train_loss": 7.8487934923172,
  "params": 10403040
}

This is intentionally tiny and trained briefly as an architecture POC, not a competitive LM.

Reproduce

pip install "transformers>=4.54.0" datasets torch accelerate trackio
python hybrid_lm_poc.py \
  --max_steps 200 \
  --max_train_samples 2048 \
  --max_eval_samples 256 \
  --batch_size 8 \
  --d_model 96 \
  --n_layer 6 \
  --block_size 64 \
  --learning_rate 8e-4

Sample generation after the short run is in sample_generation.txt.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'rahulshetty/hybrid-tiny-wikitext-poc'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month
-
Safetensors
Model size
10.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train rahulshetty/hybrid-tiny-wikitext-poc