tinyv4 / README.md
ukung's picture
Upload README.md with huggingface_hub
9a517d7 verified
---
language:
- id
- en
tags:
- base-model
- pre-trained
- indonesian
- english
- tiny
- efficient
- moe
- foundation-model
license: mit
datasets: []
metrics:
- loss
pipeline_tag: text-generation
---
# TinyV4 β€” 11M Bilingual Base Model
**TinyV4** is a compact **11 million parameter** bilingual (Indonesian & English) base model. Think of it as a solid foundation β€” pre-trained, ready to be fine-tuned for your specific downstream task.
At just **58 MB**, it's small enough to run anywhere. Smart enough to be worth your time.
## What is this?
Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck.
TinyV4 is different. **11M parameters** with a Mixture-of-Experts architecture β€” pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation.
## Why use TinyV4 as your base?
| Reason | Why it matters |
|---|---|
| **11M params** | Fine-tune in minutes, not days |
| **58 MB** | Fits anywhere β€” mobile, edge, browser |
| **CPU-friendly** | No GPU? No problem |
| **Bilingual** | Already understands ID + EN |
| **MoE architecture** | Efficient capacity without the bloat |
| **MIT license** | No restrictions, no strings |
## Architecture
| Component | Spec |
|---|---|
| Parameters | **11,034,955** |
| Dimension | 128 |
| Layers | 6 |
| Attention Heads | 4 (Query), 4 (Index) |
| MoE Experts | 4 routed + 1 shared |
| Active Experts | 2 per token |
| Vocab Size | 32,000 |
| Max Sequence | 512 tokens |
| File Size | 58 MB |
Built with **Mixture-of-Experts (MoE)**, **Sinkhorn-Knopp load balancing**, **Multi-Token Prediction (MTP)**, and **Hierarchical Compressed Attention** β€” techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful.
## What can you fine-tune it for?
TinyV4 is a blank canvas. Some ideas:
- **Translation** (ID ↔ EN) β€” it already has bilingual foundations
- **Text classification** β€” sentiment, topic, intent
- **Story generation** β€” fine-tune on your own narrative dataset
- **Chat / instruction following** β€” add conversation data
- **Code generation** β€” yes, even at 11M, it can learn patterns
- **Domain-specific tasks** β€” medical, legal, technical β€” your data, your model
The point is: **you control the final model**. TinyV4 just gives you a running start.
## Quick Start
```bash
pip install transformers safetensors torch
```
### Load the base model
```python
from transformers import AutoTokenizer, AutoModel
# Load model & tokenizer (trust_remote_code=True karena arsitektur custom)
model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4")
# Tie embeddings (custom step untuk TinyV4)
model.head.weight = model.embed.weight
model.eval()
print(f"Loaded: {sum(p.numel()):,} params")
```
### Generate text (zero-shot)
```python
@torch.no_grad()
def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40):
input_ids = tokenizer.encode(prompt, return_tensors="pt")
for _ in range(max_new_tokens):
idx = input_ids[:, -512:]
logits, _, _ = model(idx)
logits = logits[:, -1, :] / temperature
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = float('-inf')
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, 1)
input_ids = torch.cat([input_ids, next_token], dim=1)
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0], skip_special_tokens=True)
# Try it out
print(generate("Once upon a time,"))
print(generate("Pada suatu hari,"))
```
### Fine-tune for your task
```python
from torch.optim import AdamW
model.train()
optimizer = AdamW(model.parameters(), lr=3e-4)
# Your dataset, your task
for batch in your_dataloader:
logits, mtp_logits, bal_loss = model(batch)
loss = compute_your_loss(logits, batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Save your fine-tuned model
from safetensors.torch import save_file
save_file(model.state_dict(), "my-finetuned-model.safetensors")
```
## Comparison: Sub-100M Base Models
Let's be honest β€” most base models under 100M parameters are either:
- **Distilled** from larger models (not truly small)
- **Overly specialized** (can't adapt to new tasks)
- **Poorly architected** (waste parameters on the wrong things)
TinyV4 is different. At **11M parameters**, it delivers:
- **Real bilingual understanding** β€” not just token overlap
- **MoE efficiency** β€” 4 experts, 2 active, more capacity per parameter
- **Proven adaptability** β€” fine-tunes well across diverse tasks
- **Zero-shot generation** β€” coherent output without any task-specific training
We're not saying 11M beats 1B. We're saying that at this size, **nothing else gives you this much to work with**.
## Pre-training Details
| Metric | Value |
|---|---|
| Steps | 5,000 |
| Final Loss | 3.97 |
| Optimizer | AdamW |
| Schedule | Cosine decay with warmup |
| Weight Decay | 0.01 |
## Limitations
Be realistic about what 11M parameters can do:
- **Zero-shot output** will be basic β€” this is a base model, not a finished product
- **Long-form coherence** requires fine-tuning with appropriate data
- **Domain expertise** needs your data β€” it won't magically know medical terms or legal jargon
- **Reasoning** is limited β€” complex logical chains need more parameters
Think of TinyV4 as **the best possible starting point at 11M**. Not the finish line.
## License
MIT β€” use it, modify it, ship it. No attribution required (but appreciated).
## Citation
```bibtex
@misc{tinyv4-11m,
title = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts},
year = {2025},
url = {https://huggingface.co/ukung/tinyv4}
}
```