minilingua-ai
/

MiniLingua-1b-Instruct

Model card Files Files and versions

MiniLingua-1b-Instruct / README.md

boriszverkov's picture

Added code example

4c7c56b verified about 2 months ago

|

history blame contribute delete

2.52 kB

	---
	license: apache-2.0
	---
	# MiniLingua-1b-Instruct

	MiniLingua-1b-Instruct is an instruction-tuned multilingual model based on the [MiniLingua-1b](https://huggingface.co/minilingua-ai/MiniLingua-1b) base model. It supports a diverse set of European languages and programming code, making it suitable for instruction-following, multilingual generation, and downstream tasks like question answering, summarisation etc.

	## Supported Languages

	- Bulgarian
	- Czech
	- Dutch
	- English
	- Finnish
	- French
	- German
	- Greek
	- Italian
	- Polish
	- Portuguese
	- Spanish
	- Swedish
	- Programming code

	## Instruction Tuning

	This preview instruction-tuned version of MiniLingua-1b was trained over 1 epoch on 1.2 million instructions from the following high-quality datasets:

	- [CohereLabs/aya_collection_language_split](https://huggingface.co/datasets/CohereLabs/aya_collection_language_split)
	- [MBZUAI/Bactrian-X](https://huggingface.co/datasets/MBZUAI/Bactrian-X)
	- [GAIR/lima](https://huggingface.co/datasets/GAIR/lima)
	- [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k)
	- [minilingua-ai/mcqa-minilingua-sft](https://huggingface.co/datasets/minilingua-ai/mcqa-minilingua-sft)

	The supervised fine-tuning (SFT) was performed on the [Triton Aalto cluster](https://scicomp.aalto.fi/triton/) using 4 H200 GPUs.

	## Intended Use

	This model is a preview release intended for:

	- Multilingual instruction following
	- Evaluation and benchmarking
	- Research in low- and high-resource European languages


	## Use with transformers

	Quick start with `Transformers` both for GPU and CPU enabled envs:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
	import torch

	model_name = "minilingua-ai/MiniLingua-1b-Instruct"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", dtype=torch.float16)
	gen = pipeline("text-generation", model=model, tokenizer=tokenizer, trust_remote_code=True)

	prompt = "Translate from Bulgarian: Здравейте! Как сте? Translation:"
	out = gen(prompt, max_new_tokens=128, do_sample=False)
	print(out[0])
	```

	## Limitations

	- This version is a first-stage SFT release; alignment steps is not applied.
	- Some languages may show uneven instruction-following ability depending on resource availability and instruction diversity.

	---

	License: Apache-2.0