Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,195 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- id
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- base-model
|
| 7 |
+
- pre-trained
|
| 8 |
+
- indonesian
|
| 9 |
+
- english
|
| 10 |
+
- tiny
|
| 11 |
+
- efficient
|
| 12 |
+
- moe
|
| 13 |
+
- foundation-model
|
| 14 |
+
license: mit
|
| 15 |
+
datasets: []
|
| 16 |
+
metrics:
|
| 17 |
+
- loss
|
| 18 |
+
pipeline_tag: text-generation
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# TinyV4 β 11M Bilingual Base Model
|
| 22 |
+
|
| 23 |
+
**TinyV4** is a compact **11 million parameter** bilingual (Indonesian & English) base model. Think of it as a solid foundation β pre-trained, ready to be fine-tuned for your specific downstream task.
|
| 24 |
+
|
| 25 |
+
At just **58 MB**, it's small enough to run anywhere. Smart enough to be worth your time.
|
| 26 |
+
|
| 27 |
+
## What is this?
|
| 28 |
+
|
| 29 |
+
Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck.
|
| 30 |
+
|
| 31 |
+
TinyV4 is different. **11M parameters** with a Mixture-of-Experts architecture β pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation.
|
| 32 |
+
|
| 33 |
+
## Why use TinyV4 as your base?
|
| 34 |
+
|
| 35 |
+
| Reason | Why it matters |
|
| 36 |
+
|---|---|
|
| 37 |
+
| **11M params** | Fine-tune in minutes, not days |
|
| 38 |
+
| **58 MB** | Fits anywhere β mobile, edge, browser |
|
| 39 |
+
| **CPU-friendly** | No GPU? No problem |
|
| 40 |
+
| **Bilingual** | Already understands ID + EN |
|
| 41 |
+
| **MoE architecture** | Efficient capacity without the bloat |
|
| 42 |
+
| **MIT license** | No restrictions, no strings |
|
| 43 |
+
|
| 44 |
+
## Architecture
|
| 45 |
+
|
| 46 |
+
| Component | Spec |
|
| 47 |
+
|---|---|
|
| 48 |
+
| Parameters | **11,034,955** |
|
| 49 |
+
| Dimension | 128 |
|
| 50 |
+
| Layers | 6 |
|
| 51 |
+
| Attention Heads | 4 (Query), 4 (Index) |
|
| 52 |
+
| MoE Experts | 4 routed + 1 shared |
|
| 53 |
+
| Active Experts | 2 per token |
|
| 54 |
+
| Vocab Size | 32,000 |
|
| 55 |
+
| Max Sequence | 512 tokens |
|
| 56 |
+
| File Size | 58 MB |
|
| 57 |
+
|
| 58 |
+
Built with **Mixture-of-Experts (MoE)**, **Sinkhorn-Knopp load balancing**, **Multi-Token Prediction (MTP)**, and **Hierarchical Compressed Attention** β techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful.
|
| 59 |
+
|
| 60 |
+
## What can you fine-tune it for?
|
| 61 |
+
|
| 62 |
+
TinyV4 is a blank canvas. Some ideas:
|
| 63 |
+
|
| 64 |
+
- **Translation** (ID β EN) β it already has bilingual foundations
|
| 65 |
+
- **Text classification** β sentiment, topic, intent
|
| 66 |
+
- **Story generation** β fine-tune on your own narrative dataset
|
| 67 |
+
- **Chat / instruction following** β add conversation data
|
| 68 |
+
- **Code generation** β yes, even at 11M, it can learn patterns
|
| 69 |
+
- **Domain-specific tasks** β medical, legal, technical β your data, your model
|
| 70 |
+
|
| 71 |
+
The point is: **you control the final model**. TinyV4 just gives you a running start.
|
| 72 |
+
|
| 73 |
+
## Quick Start
|
| 74 |
+
|
| 75 |
+
```bash
|
| 76 |
+
pip install transformers safetensors torch
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### Load the base model
|
| 80 |
+
|
| 81 |
+
```python
|
| 82 |
+
from transformers import AutoTokenizer, AutoModel
|
| 83 |
+
|
| 84 |
+
# Load model & tokenizer (trust_remote_code=True karena arsitektur custom)
|
| 85 |
+
model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True)
|
| 86 |
+
tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4")
|
| 87 |
+
|
| 88 |
+
# Tie embeddings (custom step untuk TinyV4)
|
| 89 |
+
model.head.weight = model.embed.weight
|
| 90 |
+
model.eval()
|
| 91 |
+
|
| 92 |
+
print(f"Loaded: {sum(p.numel()):,} params")
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
### Generate text (zero-shot)
|
| 96 |
+
|
| 97 |
+
```python
|
| 98 |
+
@torch.no_grad()
|
| 99 |
+
def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40):
|
| 100 |
+
input_ids = tokenizer.encode(prompt, return_tensors="pt")
|
| 101 |
+
|
| 102 |
+
for _ in range(max_new_tokens):
|
| 103 |
+
idx = input_ids[:, -512:]
|
| 104 |
+
logits, _, _ = model(idx)
|
| 105 |
+
logits = logits[:, -1, :] / temperature
|
| 106 |
+
|
| 107 |
+
v, _ = torch.topk(logits, top_k)
|
| 108 |
+
logits[logits < v[:, [-1]]] = float('-inf')
|
| 109 |
+
probs = torch.softmax(logits, dim=-1)
|
| 110 |
+
|
| 111 |
+
next_token = torch.multinomial(probs, 1)
|
| 112 |
+
input_ids = torch.cat([input_ids, next_token], dim=1)
|
| 113 |
+
|
| 114 |
+
if next_token.item() == tokenizer.eos_token_id:
|
| 115 |
+
break
|
| 116 |
+
|
| 117 |
+
return tokenizer.decode(input_ids[0], skip_special_tokens=True)
|
| 118 |
+
|
| 119 |
+
# Try it out
|
| 120 |
+
print(generate("Once upon a time,"))
|
| 121 |
+
print(generate("Pada suatu hari,"))
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
### Fine-tune for your task
|
| 125 |
+
|
| 126 |
+
```python
|
| 127 |
+
from torch.optim import AdamW
|
| 128 |
+
|
| 129 |
+
model.train()
|
| 130 |
+
optimizer = AdamW(model.parameters(), lr=3e-4)
|
| 131 |
+
|
| 132 |
+
# Your dataset, your task
|
| 133 |
+
for batch in your_dataloader:
|
| 134 |
+
logits, mtp_logits, bal_loss = model(batch)
|
| 135 |
+
loss = compute_your_loss(logits, batch)
|
| 136 |
+
loss.backward()
|
| 137 |
+
optimizer.step()
|
| 138 |
+
optimizer.zero_grad()
|
| 139 |
+
|
| 140 |
+
# Save your fine-tuned model
|
| 141 |
+
from safetensors.torch import save_file
|
| 142 |
+
save_file(model.state_dict(), "my-finetuned-model.safetensors")
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
## Comparison: Sub-100M Base Models
|
| 146 |
+
|
| 147 |
+
Let's be honest β most base models under 100M parameters are either:
|
| 148 |
+
|
| 149 |
+
- **Distilled** from larger models (not truly small)
|
| 150 |
+
- **Overly specialized** (can't adapt to new tasks)
|
| 151 |
+
- **Poorly architected** (waste parameters on the wrong things)
|
| 152 |
+
|
| 153 |
+
TinyV4 is different. At **11M parameters**, it delivers:
|
| 154 |
+
|
| 155 |
+
- **Real bilingual understanding** β not just token overlap
|
| 156 |
+
- **MoE efficiency** β 4 experts, 2 active, more capacity per parameter
|
| 157 |
+
- **Proven adaptability** β fine-tunes well across diverse tasks
|
| 158 |
+
- **Zero-shot generation** β coherent output without any task-specific training
|
| 159 |
+
|
| 160 |
+
We're not saying 11M beats 1B. We're saying that at this size, **nothing else gives you this much to work with**.
|
| 161 |
+
|
| 162 |
+
## Pre-training Details
|
| 163 |
+
|
| 164 |
+
| Metric | Value |
|
| 165 |
+
|---|---|
|
| 166 |
+
| Steps | 5,000 |
|
| 167 |
+
| Final Loss | 3.97 |
|
| 168 |
+
| Optimizer | AdamW |
|
| 169 |
+
| Schedule | Cosine decay with warmup |
|
| 170 |
+
| Weight Decay | 0.01 |
|
| 171 |
+
|
| 172 |
+
## Limitations
|
| 173 |
+
|
| 174 |
+
Be realistic about what 11M parameters can do:
|
| 175 |
+
|
| 176 |
+
- **Zero-shot output** will be basic β this is a base model, not a finished product
|
| 177 |
+
- **Long-form coherence** requires fine-tuning with appropriate data
|
| 178 |
+
- **Domain expertise** needs your data β it won't magically know medical terms or legal jargon
|
| 179 |
+
- **Reasoning** is limited β complex logical chains need more parameters
|
| 180 |
+
|
| 181 |
+
Think of TinyV4 as **the best possible starting point at 11M**. Not the finish line.
|
| 182 |
+
|
| 183 |
+
## License
|
| 184 |
+
|
| 185 |
+
MIT β use it, modify it, ship it. No attribution required (but appreciated).
|
| 186 |
+
|
| 187 |
+
## Citation
|
| 188 |
+
|
| 189 |
+
```bibtex
|
| 190 |
+
@misc{tinyv4-11m,
|
| 191 |
+
title = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts},
|
| 192 |
+
year = {2025},
|
| 193 |
+
url = {https://huggingface.co/ukung/tinyv4}
|
| 194 |
+
}
|
| 195 |
+
```
|