--- language: - id - en tags: - base-model - pre-trained - indonesian - english - tiny - efficient - moe - foundation-model license: mit datasets: [] metrics: - loss pipeline_tag: text-generation --- # TinyV4 — 11M Bilingual Base Model **TinyV4** is a compact **11 million parameter** bilingual (Indonesian & English) base model. Think of it as a solid foundation — pre-trained, ready to be fine-tuned for your specific downstream task. At just **58 MB**, it's small enough to run anywhere. Smart enough to be worth your time. ## What is this? Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck. TinyV4 is different. **11M parameters** with a Mixture-of-Experts architecture — pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation. ## Why use TinyV4 as your base? | Reason | Why it matters | |---|---| | **11M params** | Fine-tune in minutes, not days | | **58 MB** | Fits anywhere — mobile, edge, browser | | **CPU-friendly** | No GPU? No problem | | **Bilingual** | Already understands ID + EN | | **MoE architecture** | Efficient capacity without the bloat | | **MIT license** | No restrictions, no strings | ## Architecture | Component | Spec | |---|---| | Parameters | **11,034,955** | | Dimension | 128 | | Layers | 6 | | Attention Heads | 4 (Query), 4 (Index) | | MoE Experts | 4 routed + 1 shared | | Active Experts | 2 per token | | Vocab Size | 32,000 | | Max Sequence | 512 tokens | | File Size | 58 MB | Built with **Mixture-of-Experts (MoE)**, **Sinkhorn-Knopp load balancing**, **Multi-Token Prediction (MTP)**, and **Hierarchical Compressed Attention** — techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful. ## What can you fine-tune it for? TinyV4 is a blank canvas. Some ideas: - **Translation** (ID ↔ EN) — it already has bilingual foundations - **Text classification** — sentiment, topic, intent - **Story generation** — fine-tune on your own narrative dataset - **Chat / instruction following** — add conversation data - **Code generation** — yes, even at 11M, it can learn patterns - **Domain-specific tasks** — medical, legal, technical — your data, your model The point is: **you control the final model**. TinyV4 just gives you a running start. ## Quick Start ```bash pip install transformers safetensors torch ``` ### Load the base model ```python from transformers import AutoTokenizer, AutoModel # Load model & tokenizer (trust_remote_code=True karena arsitektur custom) model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4") # Tie embeddings (custom step untuk TinyV4) model.head.weight = model.embed.weight model.eval() print(f"Loaded: {sum(p.numel()):,} params") ``` ### Generate text (zero-shot) ```python @torch.no_grad() def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40): input_ids = tokenizer.encode(prompt, return_tensors="pt") for _ in range(max_new_tokens): idx = input_ids[:, -512:] logits, _, _ = model(idx) logits = logits[:, -1, :] / temperature v, _ = torch.topk(logits, top_k) logits[logits < v[:, [-1]]] = float('-inf') probs = torch.softmax(logits, dim=-1) next_token = torch.multinomial(probs, 1) input_ids = torch.cat([input_ids, next_token], dim=1) if next_token.item() == tokenizer.eos_token_id: break return tokenizer.decode(input_ids[0], skip_special_tokens=True) # Try it out print(generate("Once upon a time,")) print(generate("Pada suatu hari,")) ``` ### Fine-tune for your task ```python from torch.optim import AdamW model.train() optimizer = AdamW(model.parameters(), lr=3e-4) # Your dataset, your task for batch in your_dataloader: logits, mtp_logits, bal_loss = model(batch) loss = compute_your_loss(logits, batch) loss.backward() optimizer.step() optimizer.zero_grad() # Save your fine-tuned model from safetensors.torch import save_file save_file(model.state_dict(), "my-finetuned-model.safetensors") ``` ## Comparison: Sub-100M Base Models Let's be honest — most base models under 100M parameters are either: - **Distilled** from larger models (not truly small) - **Overly specialized** (can't adapt to new tasks) - **Poorly architected** (waste parameters on the wrong things) TinyV4 is different. At **11M parameters**, it delivers: - **Real bilingual understanding** — not just token overlap - **MoE efficiency** — 4 experts, 2 active, more capacity per parameter - **Proven adaptability** — fine-tunes well across diverse tasks - **Zero-shot generation** — coherent output without any task-specific training We're not saying 11M beats 1B. We're saying that at this size, **nothing else gives you this much to work with**. ## Pre-training Details | Metric | Value | |---|---| | Steps | 5,000 | | Final Loss | 3.97 | | Optimizer | AdamW | | Schedule | Cosine decay with warmup | | Weight Decay | 0.01 | ## Limitations Be realistic about what 11M parameters can do: - **Zero-shot output** will be basic — this is a base model, not a finished product - **Long-form coherence** requires fine-tuning with appropriate data - **Domain expertise** needs your data — it won't magically know medical terms or legal jargon - **Reasoning** is limited — complex logical chains need more parameters Think of TinyV4 as **the best possible starting point at 11M**. Not the finish line. ## License MIT — use it, modify it, ship it. No attribution required (but appreciated). ## Citation ```bibtex @misc{tinyv4-11m, title = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts}, year = {2025}, url = {https://huggingface.co/ukung/tinyv4} } ```