CompactAI
/

TMLM-Haiku-1

+---
+license: apache-2.0
+datasets:
+- shuyuej/English-Pretraining-Dataset
+- HuggingFaceFW/fineweb-edu
+- mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
+- tatsu-lab/alpaca
+- databricks/databricks-dolly-15k
+- TeichAI/Step-3.5-Flash-2600x
+- TeichAI/convo-v1
+language:
+- en
+tags:
+- small
+- haiku
+---
+# TinyMemoryLM
+> **⚠️ IMPORTANT NOTICE**
+>
+> 1. **The infrence script is not publicly available yet (soon!)**. This release contains only the model weights and tokenizer.
+> 2. **The model is really dumb.** This is a ~1M parameter research model designed for experimentation, not production use.
+> 3. **Do not expect it to answer any questions.** It is prone to repetition, hallucination, and format collapse.
+## Overview
+TinyMemoryLM is an ultra-lightweight language model optimized for edge cases and architectural experimentation. Despite its small footprint, it incorporates several novel training innovations aimed at stabilizing tiny model convergence, including hybrid tokenization, loss boosting strategies, and context-aware relevance modeling.
+This release includes both **Pretrained Weights** (base language modeling) and **Instruction Weights** (fine-tuned for chat/completion).
+## Files Provided
+| File | Description |
+| :--- | :--- |
+| `tokenizer.json` | Hybrid word/character tokenizer vocabulary. |
+| `pretrain.pt` | Base pretrained checkpoint (language modeling). |
+| `model.pt` | Instruction-tuned checkpoint (SFT/Chat). |
+## Model Specifications
+| Parameter | Value |
+| :--- | :--- |
+| **Architecture** | Transformer Decoder |
+| **Parameters** | ~1 Million |
+| **Context Length** | 2,048 tokens |
+| **Dimensions** | `d_model=160`, `layers=6`, `heads=4`, `ffn=256` |
+| **Vocabulary** | ~2,111 tokens (Hybrid Char + Word) |
+| **Normalization** | RMSNorm + QK-Norm |
+| **Embeddings** | Rotary Embeddings (RoPE) |
+| **Activation** | SwiGLU |
+## Architecture Highlights
+TinyMemoryLM implements several research-focused modifications to standard transformer architectures:
+*   **Hybrid Tokenizer:** Combines character-level fallback with frequent word tokens to balance compression and vocabulary size.
+*   **QK-Norm:** Applies RMSNorm to Query and Key projections for improved stability in low-precision training.
+*   **Word Token Loss Boosting:** Upweights loss signals for multi-character tokens to prevent the model from ignoring them in favor of character-level spelling.
+*   **Response-Start Weighting:** Prioritizes the first tokens of assistant responses to improve prompt conditioning.
+*   **Pretrain Replay:** Mixes pretraining data during instruction tuning to prevent catastrophic forgetting of language fluency.
+## Training Loss Curve
+Below is the training loss progression during the instruction tuning phase. Note the stability measures taken to prevent collapse in such a small parameter regime.
+![Training Loss Curve]({loss})
+##Limitations & Expectations
+Please manage your expectations when using TinyMemoryLM:
+*   **Reasoning:** While trained with Chain-of-Thought markers (`<|begin_of_thought|>`, `<|begin_of_solution|>`), the model often memorizes the format scaffolding without genuine reasoning capability.
+*   **Repetition:** Tiny models are prone to collapsing into repetitive token loops.
+*   **Knowledge:** The model has limited world knowledge due to parameter constraints.
+*   **Usage:** This model is intended for **research, educational purposes, and architectural benchmarking**. It is not suitable for assistant tasks or reliable information retrieval.
+---
+*Generated for research purposes. Use responsibly.*