| --- |
| language: |
| - id |
| - en |
| tags: |
| - base-model |
| - pre-trained |
| - indonesian |
| - english |
| - tiny |
| - efficient |
| - moe |
| - foundation-model |
| license: mit |
| datasets: [] |
| metrics: |
| - loss |
| pipeline_tag: text-generation |
| --- |
| |
| # TinyV4 β 11M Bilingual Base Model |
|
|
| **TinyV4** is a compact **11 million parameter** bilingual (Indonesian & English) base model. Think of it as a solid foundation β pre-trained, ready to be fine-tuned for your specific downstream task. |
|
|
| At just **58 MB**, it's small enough to run anywhere. Smart enough to be worth your time. |
|
|
| ## What is this? |
|
|
| Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck. |
|
|
| TinyV4 is different. **11M parameters** with a Mixture-of-Experts architecture β pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation. |
|
|
| ## Why use TinyV4 as your base? |
|
|
| | Reason | Why it matters | |
| |---|---| |
| | **11M params** | Fine-tune in minutes, not days | |
| | **58 MB** | Fits anywhere β mobile, edge, browser | |
| | **CPU-friendly** | No GPU? No problem | |
| | **Bilingual** | Already understands ID + EN | |
| | **MoE architecture** | Efficient capacity without the bloat | |
| | **MIT license** | No restrictions, no strings | |
|
|
| ## Architecture |
|
|
| | Component | Spec | |
| |---|---| |
| | Parameters | **11,034,955** | |
| | Dimension | 128 | |
| | Layers | 6 | |
| | Attention Heads | 4 (Query), 4 (Index) | |
| | MoE Experts | 4 routed + 1 shared | |
| | Active Experts | 2 per token | |
| | Vocab Size | 32,000 | |
| | Max Sequence | 512 tokens | |
| | File Size | 58 MB | |
|
|
| Built with **Mixture-of-Experts (MoE)**, **Sinkhorn-Knopp load balancing**, **Multi-Token Prediction (MTP)**, and **Hierarchical Compressed Attention** β techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful. |
|
|
| ## What can you fine-tune it for? |
|
|
| TinyV4 is a blank canvas. Some ideas: |
|
|
| - **Translation** (ID β EN) β it already has bilingual foundations |
| - **Text classification** β sentiment, topic, intent |
| - **Story generation** β fine-tune on your own narrative dataset |
| - **Chat / instruction following** β add conversation data |
| - **Code generation** β yes, even at 11M, it can learn patterns |
| - **Domain-specific tasks** β medical, legal, technical β your data, your model |
|
|
| The point is: **you control the final model**. TinyV4 just gives you a running start. |
|
|
| ## Quick Start |
|
|
| ```bash |
| pip install transformers safetensors torch |
| ``` |
|
|
| ### Load the base model |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| |
| # Load model & tokenizer (trust_remote_code=True karena arsitektur custom) |
| model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4") |
| |
| # Tie embeddings (custom step untuk TinyV4) |
| model.head.weight = model.embed.weight |
| model.eval() |
| |
| print(f"Loaded: {sum(p.numel()):,} params") |
| ``` |
|
|
| ### Generate text (zero-shot) |
|
|
| ```python |
| @torch.no_grad() |
| def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40): |
| input_ids = tokenizer.encode(prompt, return_tensors="pt") |
| |
| for _ in range(max_new_tokens): |
| idx = input_ids[:, -512:] |
| logits, _, _ = model(idx) |
| logits = logits[:, -1, :] / temperature |
| |
| v, _ = torch.topk(logits, top_k) |
| logits[logits < v[:, [-1]]] = float('-inf') |
| probs = torch.softmax(logits, dim=-1) |
| |
| next_token = torch.multinomial(probs, 1) |
| input_ids = torch.cat([input_ids, next_token], dim=1) |
| |
| if next_token.item() == tokenizer.eos_token_id: |
| break |
| |
| return tokenizer.decode(input_ids[0], skip_special_tokens=True) |
| |
| # Try it out |
| print(generate("Once upon a time,")) |
| print(generate("Pada suatu hari,")) |
| ``` |
|
|
| ### Fine-tune for your task |
|
|
| ```python |
| from torch.optim import AdamW |
| |
| model.train() |
| optimizer = AdamW(model.parameters(), lr=3e-4) |
| |
| # Your dataset, your task |
| for batch in your_dataloader: |
| logits, mtp_logits, bal_loss = model(batch) |
| loss = compute_your_loss(logits, batch) |
| loss.backward() |
| optimizer.step() |
| optimizer.zero_grad() |
| |
| # Save your fine-tuned model |
| from safetensors.torch import save_file |
| save_file(model.state_dict(), "my-finetuned-model.safetensors") |
| ``` |
|
|
| ## Comparison: Sub-100M Base Models |
|
|
| Let's be honest β most base models under 100M parameters are either: |
|
|
| - **Distilled** from larger models (not truly small) |
| - **Overly specialized** (can't adapt to new tasks) |
| - **Poorly architected** (waste parameters on the wrong things) |
|
|
| TinyV4 is different. At **11M parameters**, it delivers: |
|
|
| - **Real bilingual understanding** β not just token overlap |
| - **MoE efficiency** β 4 experts, 2 active, more capacity per parameter |
| - **Proven adaptability** β fine-tunes well across diverse tasks |
| - **Zero-shot generation** β coherent output without any task-specific training |
|
|
| We're not saying 11M beats 1B. We're saying that at this size, **nothing else gives you this much to work with**. |
|
|
| ## Pre-training Details |
|
|
| | Metric | Value | |
| |---|---| |
| | Steps | 5,000 | |
| | Final Loss | 3.97 | |
| | Optimizer | AdamW | |
| | Schedule | Cosine decay with warmup | |
| | Weight Decay | 0.01 | |
|
|
| ## Limitations |
|
|
| Be realistic about what 11M parameters can do: |
|
|
| - **Zero-shot output** will be basic β this is a base model, not a finished product |
| - **Long-form coherence** requires fine-tuning with appropriate data |
| - **Domain expertise** needs your data β it won't magically know medical terms or legal jargon |
| - **Reasoning** is limited β complex logical chains need more parameters |
|
|
| Think of TinyV4 as **the best possible starting point at 11M**. Not the finish line. |
|
|
| ## License |
|
|
| MIT β use it, modify it, ship it. No attribution required (but appreciated). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{tinyv4-11m, |
| title = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts}, |
| year = {2025}, |
| url = {https://huggingface.co/ukung/tinyv4} |
| } |
| ``` |
|
|