---
language:
  - id
  - en
tags:
  - base-model
  - pre-trained
  - indonesian
  - english
  - tiny
  - efficient
  - moe
  - foundation-model
license: mit
datasets: []
metrics:
  - loss
pipeline_tag: text-generation
---

# TinyV4 — 11M Bilingual Base Model

**TinyV4** is a compact **11 million parameter** bilingual (Indonesian & English) base model. Think of it as a solid foundation — pre-trained, ready to be fine-tuned for your specific downstream task.

At just **58 MB**, it's small enough to run anywhere. Smart enough to be worth your time.

## What is this?

Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck.

TinyV4 is different. **11M parameters** with a Mixture-of-Experts architecture — pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation.

## Why use TinyV4 as your base?

| Reason | Why it matters |
|---|---|
| **11M params** | Fine-tune in minutes, not days |
| **58 MB** | Fits anywhere — mobile, edge, browser |
| **CPU-friendly** | No GPU? No problem |
| **Bilingual** | Already understands ID + EN |
| **MoE architecture** | Efficient capacity without the bloat |
| **MIT license** | No restrictions, no strings |

## Architecture

| Component | Spec |
|---|---|
| Parameters | **11,034,955** |
| Dimension | 128 |
| Layers | 6 |
| Attention Heads | 4 (Query), 4 (Index) |
| MoE Experts | 4 routed + 1 shared |
| Active Experts | 2 per token |
| Vocab Size | 32,000 |
| Max Sequence | 512 tokens |
| File Size | 58 MB |

Built with **Mixture-of-Experts (MoE)**, **Sinkhorn-Knopp load balancing**, **Multi-Token Prediction (MTP)**, and **Hierarchical Compressed Attention** — techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful.

## What can you fine-tune it for?

TinyV4 is a blank canvas. Some ideas:

- **Translation** (ID ↔ EN) — it already has bilingual foundations
- **Text classification** — sentiment, topic, intent
- **Story generation** — fine-tune on your own narrative dataset
- **Chat / instruction following** — add conversation data
- **Code generation** — yes, even at 11M, it can learn patterns
- **Domain-specific tasks** — medical, legal, technical — your data, your model

The point is: **you control the final model**. TinyV4 just gives you a running start.

## Quick Start

```bash
pip install transformers safetensors torch
```

### Load the base model

```python
from transformers import AutoTokenizer, AutoModel

# Load model & tokenizer (trust_remote_code=True karena arsitektur custom)
model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4")

# Tie embeddings (custom step untuk TinyV4)
model.head.weight = model.embed.weight
model.eval()

print(f"Loaded: {sum(p.numel()):,} params")
```

### Generate text (zero-shot)

```python
@torch.no_grad()
def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    for _ in range(max_new_tokens):
        idx = input_ids[:, -512:]
        logits, _, _ = model(idx)
        logits = logits[:, -1, :] / temperature

        v, _ = torch.topk(logits, top_k)
        logits[logits < v[:, [-1]]] = float('-inf')
        probs = torch.softmax(logits, dim=-1)

        next_token = torch.multinomial(probs, 1)
        input_ids = torch.cat([input_ids, next_token], dim=1)

        if next_token.item() == tokenizer.eos_token_id:
            break

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)

# Try it out
print(generate("Once upon a time,"))
print(generate("Pada suatu hari,"))
```

### Fine-tune for your task

```python
from torch.optim import AdamW

model.train()
optimizer = AdamW(model.parameters(), lr=3e-4)

# Your dataset, your task
for batch in your_dataloader:
    logits, mtp_logits, bal_loss = model(batch)
    loss = compute_your_loss(logits, batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Save your fine-tuned model
from safetensors.torch import save_file
save_file(model.state_dict(), "my-finetuned-model.safetensors")
```

## Comparison: Sub-100M Base Models

Let's be honest — most base models under 100M parameters are either:

- **Distilled** from larger models (not truly small)
- **Overly specialized** (can't adapt to new tasks)
- **Poorly architected** (waste parameters on the wrong things)

TinyV4 is different. At **11M parameters**, it delivers:

- **Real bilingual understanding** — not just token overlap
- **MoE efficiency** — 4 experts, 2 active, more capacity per parameter
- **Proven adaptability** — fine-tunes well across diverse tasks
- **Zero-shot generation** — coherent output without any task-specific training

We're not saying 11M beats 1B. We're saying that at this size, **nothing else gives you this much to work with**.

## Pre-training Details

| Metric | Value |
|---|---|
| Steps | 5,000 |
| Final Loss | 3.97 |
| Optimizer | AdamW |
| Schedule | Cosine decay with warmup |
| Weight Decay | 0.01 |

## Limitations

Be realistic about what 11M parameters can do:

- **Zero-shot output** will be basic — this is a base model, not a finished product
- **Long-form coherence** requires fine-tuning with appropriate data
- **Domain expertise** needs your data — it won't magically know medical terms or legal jargon
- **Reasoning** is limited — complex logical chains need more parameters

Think of TinyV4 as **the best possible starting point at 11M**. Not the finish line.

## License

MIT — use it, modify it, ship it. No attribution required (but appreciated).

## Citation

```bibtex
@misc{tinyv4-11m,
  title  = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts},
  year   = {2025},
  url    = {https://huggingface.co/ukung/tinyv4}
}
```