File size: 5,965 Bytes
9a517d7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | ---
language:
- id
- en
tags:
- base-model
- pre-trained
- indonesian
- english
- tiny
- efficient
- moe
- foundation-model
license: mit
datasets: []
metrics:
- loss
pipeline_tag: text-generation
---
# TinyV4 β 11M Bilingual Base Model
**TinyV4** is a compact **11 million parameter** bilingual (Indonesian & English) base model. Think of it as a solid foundation β pre-trained, ready to be fine-tuned for your specific downstream task.
At just **58 MB**, it's small enough to run anywhere. Smart enough to be worth your time.
## What is this?
Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck.
TinyV4 is different. **11M parameters** with a Mixture-of-Experts architecture β pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation.
## Why use TinyV4 as your base?
| Reason | Why it matters |
|---|---|
| **11M params** | Fine-tune in minutes, not days |
| **58 MB** | Fits anywhere β mobile, edge, browser |
| **CPU-friendly** | No GPU? No problem |
| **Bilingual** | Already understands ID + EN |
| **MoE architecture** | Efficient capacity without the bloat |
| **MIT license** | No restrictions, no strings |
## Architecture
| Component | Spec |
|---|---|
| Parameters | **11,034,955** |
| Dimension | 128 |
| Layers | 6 |
| Attention Heads | 4 (Query), 4 (Index) |
| MoE Experts | 4 routed + 1 shared |
| Active Experts | 2 per token |
| Vocab Size | 32,000 |
| Max Sequence | 512 tokens |
| File Size | 58 MB |
Built with **Mixture-of-Experts (MoE)**, **Sinkhorn-Knopp load balancing**, **Multi-Token Prediction (MTP)**, and **Hierarchical Compressed Attention** β techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful.
## What can you fine-tune it for?
TinyV4 is a blank canvas. Some ideas:
- **Translation** (ID β EN) β it already has bilingual foundations
- **Text classification** β sentiment, topic, intent
- **Story generation** β fine-tune on your own narrative dataset
- **Chat / instruction following** β add conversation data
- **Code generation** β yes, even at 11M, it can learn patterns
- **Domain-specific tasks** β medical, legal, technical β your data, your model
The point is: **you control the final model**. TinyV4 just gives you a running start.
## Quick Start
```bash
pip install transformers safetensors torch
```
### Load the base model
```python
from transformers import AutoTokenizer, AutoModel
# Load model & tokenizer (trust_remote_code=True karena arsitektur custom)
model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4")
# Tie embeddings (custom step untuk TinyV4)
model.head.weight = model.embed.weight
model.eval()
print(f"Loaded: {sum(p.numel()):,} params")
```
### Generate text (zero-shot)
```python
@torch.no_grad()
def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40):
input_ids = tokenizer.encode(prompt, return_tensors="pt")
for _ in range(max_new_tokens):
idx = input_ids[:, -512:]
logits, _, _ = model(idx)
logits = logits[:, -1, :] / temperature
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = float('-inf')
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, 1)
input_ids = torch.cat([input_ids, next_token], dim=1)
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0], skip_special_tokens=True)
# Try it out
print(generate("Once upon a time,"))
print(generate("Pada suatu hari,"))
```
### Fine-tune for your task
```python
from torch.optim import AdamW
model.train()
optimizer = AdamW(model.parameters(), lr=3e-4)
# Your dataset, your task
for batch in your_dataloader:
logits, mtp_logits, bal_loss = model(batch)
loss = compute_your_loss(logits, batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Save your fine-tuned model
from safetensors.torch import save_file
save_file(model.state_dict(), "my-finetuned-model.safetensors")
```
## Comparison: Sub-100M Base Models
Let's be honest β most base models under 100M parameters are either:
- **Distilled** from larger models (not truly small)
- **Overly specialized** (can't adapt to new tasks)
- **Poorly architected** (waste parameters on the wrong things)
TinyV4 is different. At **11M parameters**, it delivers:
- **Real bilingual understanding** β not just token overlap
- **MoE efficiency** β 4 experts, 2 active, more capacity per parameter
- **Proven adaptability** β fine-tunes well across diverse tasks
- **Zero-shot generation** β coherent output without any task-specific training
We're not saying 11M beats 1B. We're saying that at this size, **nothing else gives you this much to work with**.
## Pre-training Details
| Metric | Value |
|---|---|
| Steps | 5,000 |
| Final Loss | 3.97 |
| Optimizer | AdamW |
| Schedule | Cosine decay with warmup |
| Weight Decay | 0.01 |
## Limitations
Be realistic about what 11M parameters can do:
- **Zero-shot output** will be basic β this is a base model, not a finished product
- **Long-form coherence** requires fine-tuning with appropriate data
- **Domain expertise** needs your data β it won't magically know medical terms or legal jargon
- **Reasoning** is limited β complex logical chains need more parameters
Think of TinyV4 as **the best possible starting point at 11M**. Not the finish line.
## License
MIT β use it, modify it, ship it. No attribution required (but appreciated).
## Citation
```bibtex
@misc{tinyv4-11m,
title = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts},
year = {2025},
url = {https://huggingface.co/ukung/tinyv4}
}
```
|