|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
# MiniLingua-1b-Instruct |
|
|
|
|
|
**MiniLingua-1b-Instruct** is an instruction-tuned multilingual model based on the [MiniLingua-1b](https://huggingface.co/minilingua-ai/MiniLingua-1b) base model. It supports a diverse set of European languages and programming code, making it suitable for instruction-following, multilingual generation, and downstream tasks like question answering, summarisation etc. |
|
|
|
|
|
## Supported Languages |
|
|
|
|
|
- Bulgarian |
|
|
- Czech |
|
|
- Dutch |
|
|
- English |
|
|
- Finnish |
|
|
- French |
|
|
- German |
|
|
- Greek |
|
|
- Italian |
|
|
- Polish |
|
|
- Portuguese |
|
|
- Spanish |
|
|
- Swedish |
|
|
- Programming code |
|
|
|
|
|
## Instruction Tuning |
|
|
|
|
|
This preview instruction-tuned version of MiniLingua-1b was trained over 1 epoch on 1.2 million instructions from the following high-quality datasets: |
|
|
|
|
|
- [CohereLabs/aya_collection_language_split](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split) |
|
|
- [MBZUAI/Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X) |
|
|
- [GAIR/lima](https://huggingface.co/datasets/GAIR/lima) |
|
|
- [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k) |
|
|
- [minilingua-ai/mcqa-minilingua-sft](https://huggingface.co/datasets/minilingua-ai/mcqa-minilingua-sft) |
|
|
|
|
|
The supervised fine-tuning (SFT) was performed on the [Triton Aalto cluster](https://scicomp.aalto.fi/triton/) using 4 H200 GPUs. |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is a **preview release** intended for: |
|
|
|
|
|
- Multilingual instruction following |
|
|
- Evaluation and benchmarking |
|
|
- Research in low- and high-resource European languages |
|
|
|
|
|
|
|
|
## Use with transformers |
|
|
|
|
|
Quick start with `Transformers` both for GPU and CPU enabled envs: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline |
|
|
import torch |
|
|
|
|
|
model_name = "minilingua-ai/MiniLingua-1b-Instruct" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", dtype=torch.float16) |
|
|
gen = pipeline("text-generation", model=model, tokenizer=tokenizer, trust_remote_code=True) |
|
|
|
|
|
prompt = "Translate from Bulgarian: Здравейте! Как сте? Translation:" |
|
|
out = gen(prompt, max_new_tokens=128, do_sample=False) |
|
|
print(out[0]) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- This version is a first-stage SFT release; alignment steps is not applied. |
|
|
- Some languages may show uneven instruction-following ability depending on resource availability and instruction diversity. |
|
|
|
|
|
--- |
|
|
|
|
|
**License**: Apache-2.0 |
|
|
|