File size: 5,965 Bytes
9a517d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
language:
  - id
  - en
tags:
  - base-model
  - pre-trained
  - indonesian
  - english
  - tiny
  - efficient
  - moe
  - foundation-model
license: mit
datasets: []
metrics:
  - loss
pipeline_tag: text-generation
---

# TinyV4 β€” 11M Bilingual Base Model

**TinyV4** is a compact **11 million parameter** bilingual (Indonesian & English) base model. Think of it as a solid foundation β€” pre-trained, ready to be fine-tuned for your specific downstream task.

At just **58 MB**, it's small enough to run anywhere. Smart enough to be worth your time.

## What is this?

Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck.

TinyV4 is different. **11M parameters** with a Mixture-of-Experts architecture β€” pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation.

## Why use TinyV4 as your base?

| Reason | Why it matters |
|---|---|
| **11M params** | Fine-tune in minutes, not days |
| **58 MB** | Fits anywhere β€” mobile, edge, browser |
| **CPU-friendly** | No GPU? No problem |
| **Bilingual** | Already understands ID + EN |
| **MoE architecture** | Efficient capacity without the bloat |
| **MIT license** | No restrictions, no strings |

## Architecture

| Component | Spec |
|---|---|
| Parameters | **11,034,955** |
| Dimension | 128 |
| Layers | 6 |
| Attention Heads | 4 (Query), 4 (Index) |
| MoE Experts | 4 routed + 1 shared |
| Active Experts | 2 per token |
| Vocab Size | 32,000 |
| Max Sequence | 512 tokens |
| File Size | 58 MB |

Built with **Mixture-of-Experts (MoE)**, **Sinkhorn-Knopp load balancing**, **Multi-Token Prediction (MTP)**, and **Hierarchical Compressed Attention** β€” techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful.

## What can you fine-tune it for?

TinyV4 is a blank canvas. Some ideas:

- **Translation** (ID ↔ EN) β€” it already has bilingual foundations
- **Text classification** β€” sentiment, topic, intent
- **Story generation** β€” fine-tune on your own narrative dataset
- **Chat / instruction following** β€” add conversation data
- **Code generation** β€” yes, even at 11M, it can learn patterns
- **Domain-specific tasks** β€” medical, legal, technical β€” your data, your model

The point is: **you control the final model**. TinyV4 just gives you a running start.

## Quick Start

```bash
pip install transformers safetensors torch
```

### Load the base model

```python
from transformers import AutoTokenizer, AutoModel

# Load model & tokenizer (trust_remote_code=True karena arsitektur custom)
model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4")

# Tie embeddings (custom step untuk TinyV4)
model.head.weight = model.embed.weight
model.eval()

print(f"Loaded: {sum(p.numel()):,} params")
```

### Generate text (zero-shot)

```python
@torch.no_grad()
def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    for _ in range(max_new_tokens):
        idx = input_ids[:, -512:]
        logits, _, _ = model(idx)
        logits = logits[:, -1, :] / temperature

        v, _ = torch.topk(logits, top_k)
        logits[logits < v[:, [-1]]] = float('-inf')
        probs = torch.softmax(logits, dim=-1)

        next_token = torch.multinomial(probs, 1)
        input_ids = torch.cat([input_ids, next_token], dim=1)

        if next_token.item() == tokenizer.eos_token_id:
            break

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)

# Try it out
print(generate("Once upon a time,"))
print(generate("Pada suatu hari,"))
```

### Fine-tune for your task

```python
from torch.optim import AdamW

model.train()
optimizer = AdamW(model.parameters(), lr=3e-4)

# Your dataset, your task
for batch in your_dataloader:
    logits, mtp_logits, bal_loss = model(batch)
    loss = compute_your_loss(logits, batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Save your fine-tuned model
from safetensors.torch import save_file
save_file(model.state_dict(), "my-finetuned-model.safetensors")
```

## Comparison: Sub-100M Base Models

Let's be honest β€” most base models under 100M parameters are either:

- **Distilled** from larger models (not truly small)
- **Overly specialized** (can't adapt to new tasks)
- **Poorly architected** (waste parameters on the wrong things)

TinyV4 is different. At **11M parameters**, it delivers:

- **Real bilingual understanding** β€” not just token overlap
- **MoE efficiency** β€” 4 experts, 2 active, more capacity per parameter
- **Proven adaptability** β€” fine-tunes well across diverse tasks
- **Zero-shot generation** β€” coherent output without any task-specific training

We're not saying 11M beats 1B. We're saying that at this size, **nothing else gives you this much to work with**.

## Pre-training Details

| Metric | Value |
|---|---|
| Steps | 5,000 |
| Final Loss | 3.97 |
| Optimizer | AdamW |
| Schedule | Cosine decay with warmup |
| Weight Decay | 0.01 |

## Limitations

Be realistic about what 11M parameters can do:

- **Zero-shot output** will be basic β€” this is a base model, not a finished product
- **Long-form coherence** requires fine-tuning with appropriate data
- **Domain expertise** needs your data β€” it won't magically know medical terms or legal jargon
- **Reasoning** is limited β€” complex logical chains need more parameters

Think of TinyV4 as **the best possible starting point at 11M**. Not the finish line.

## License

MIT β€” use it, modify it, ship it. No attribution required (but appreciated).

## Citation

```bibtex
@misc{tinyv4-11m,
  title  = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts},
  year   = {2025},
  url    = {https://huggingface.co/ukung/tinyv4}
}
```