--- license: apache-2.0 --- # MiniLingua-1b-Instruct **MiniLingua-1b-Instruct** is an instruction-tuned multilingual model based on the [MiniLingua-1b](https://huggingface.co/minilingua-ai/MiniLingua-1b) base model. It supports a diverse set of European languages and programming code, making it suitable for instruction-following, multilingual generation, and downstream tasks like question answering, summarisation etc. ## Supported Languages - Bulgarian - Czech - Dutch - English - Finnish - French - German - Greek - Italian - Polish - Portuguese - Spanish - Swedish - Programming code ## Instruction Tuning This preview instruction-tuned version of MiniLingua-1b was trained over 1 epoch on 1.2 million instructions from the following high-quality datasets: - [CohereLabs/aya_collection_language_split](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split) - [MBZUAI/Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X) - [GAIR/lima](https://huggingface.co/datasets/GAIR/lima) - [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k) - [minilingua-ai/mcqa-minilingua-sft](https://huggingface.co/datasets/minilingua-ai/mcqa-minilingua-sft) The supervised fine-tuning (SFT) was performed on the [Triton Aalto cluster](https://scicomp.aalto.fi/triton/) using 4 H200 GPUs. ## Intended Use This model is a **preview release** intended for: - Multilingual instruction following - Evaluation and benchmarking - Research in low- and high-resource European languages ## Use with transformers Quick start with `Transformers` both for GPU and CPU enabled envs: ```python from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline import torch model_name = "minilingua-ai/MiniLingua-1b-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", dtype=torch.float16) gen = pipeline("text-generation", model=model, tokenizer=tokenizer, trust_remote_code=True) prompt = "Translate from Bulgarian: Здравейте! Как сте? Translation:" out = gen(prompt, max_new_tokens=128, do_sample=False) print(out[0]) ``` ## Limitations - This version is a first-stage SFT release; alignment steps is not applied. - Some languages may show uneven instruction-following ability depending on resource availability and instruction diversity. --- **License**: Apache-2.0