File size: 2,516 Bytes
85bdb11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c7c56b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85bdb11
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
---
# MiniLingua-1b-Instruct

**MiniLingua-1b-Instruct** is an instruction-tuned multilingual model based on the [MiniLingua-1b](https://huggingface.co/minilingua-ai/MiniLingua-1b) base model. It supports a diverse set of European languages and programming code, making it suitable for instruction-following, multilingual generation, and downstream tasks like question answering, summarisation etc.

## Supported Languages

- Bulgarian  
- Czech  
- Dutch  
- English  
- Finnish  
- French  
- German  
- Greek  
- Italian  
- Polish  
- Portuguese  
- Spanish  
- Swedish  
- Programming code  

## Instruction Tuning

This preview instruction-tuned version of MiniLingua-1b was trained over 1 epoch on 1.2 million instructions from the following high-quality datasets:

- [CohereLabs/aya_collection_language_split](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split)  
- [MBZUAI/Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)  
- [GAIR/lima](https://huggingface.co/datasets/GAIR/lima)  
- [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k)  
- [minilingua-ai/mcqa-minilingua-sft](https://huggingface.co/datasets/minilingua-ai/mcqa-minilingua-sft)  

The supervised fine-tuning (SFT) was performed on the [Triton Aalto cluster](https://scicomp.aalto.fi/triton/) using 4 H200 GPUs.

## Intended Use

This model is a **preview release** intended for:

- Multilingual instruction following  
- Evaluation and benchmarking  
- Research in low- and high-resource European languages  


## Use with transformers

Quick start with `Transformers` both for GPU and CPU enabled envs:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_name = "minilingua-ai/MiniLingua-1b-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", dtype=torch.float16)
gen = pipeline("text-generation", model=model, tokenizer=tokenizer, trust_remote_code=True)

prompt = "Translate from Bulgarian: Здравейте! Как сте? Translation:"
out = gen(prompt, max_new_tokens=128, do_sample=False)
print(out[0])
```

## Limitations

- This version is a first-stage SFT release; alignment steps is not applied.
- Some languages may show uneven instruction-following ability depending on resource availability and instruction diversity.

---

**License**: Apache-2.0