faur-ai
/

LLMic

Model card Files Files and versions

LLMic / README.md

Vlad-Andrei Badoiu

Add how to instructions for transformers

6ddbdf3 about 1 year ago

|

history blame contribute delete

2.57 kB

	---
	license: apache-2.0
	datasets:
	- faur-ai/fulg
	language:
	- ro
	---

	# LLMic Model Card

	[LLMic: Romanian Foundation Language Model](https://arxiv.org/abs/2501.07721)

	## Model Summary

	LLMic is a bilingual Romanian-English foundation model. LLmic is a 3B
	parameters dense decoder-only Transformer model based on Llama2.

	## Architecture

	\| Parameter \| Value \|
	\|-----------\|---------\|
	\| Sequence Length \| 2048 \|
	\| Number of Layers \| 24 \|
	\| Embedding Size \| 2,560 \|
	\| FFN Hidden Size \| 10,240 \|
	\| Number of Heads \| 20 \|
	\| Number of KV Heads \| 5 \|
	\| Activation Function \| SiLU \|
	\| Position Encodings \| RoPE (Θ=500,000) \|
	\| Layer Norm \| RMSNorm (ε=10⁻⁵) \|
	\| Tied Embeddings \| No \|

	## Intended Use

	Our model is designed to accelerate research on Romanian language models, serving as a building block for generative AI applications.

	## Use with transformers

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

	device = "cuda"
	model_id = "faur-ai/LLMic"
	prompt = "Capitala României este"

	model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	streamer = TextStreamer(tokenizer)

	inputs = tokenizer.encode(
	prompt,
	add_special_tokens=False,
	return_tensors='pt',
	).to(device)

	outputs = model.generate(
	streamer=streamer,
	input_ids=inputs,
	temperature=0.8,
	do_sample=True
	)
	```

	## Data Overview

	### Training Datasets

	\| Source \| Size \|
	\|---------\|------\|
	\| Romanian (300B) \| \|
	\| Web Sources \| 621 GB \|
	\| Discussions, Curated & Parallel \| 10 GB \|
	\| English (700B) \| \|
	\| FineWebEdu \| -- \|
	\| Dolma Subset \| 109 GB \|

	#### Benchmark datasets

	We evaluated LLMic on the WMT16 language translation benchmark for English-to-Romanian.

	\| Model \| Score \|
	\|--------\|--------\|
	\| LLMIC \| 41.01 \|
	\| mBART \| 38.50 \|
	\| Llama-3.1-8B-Instruct \| 29.02 \|
	\| RoMistral-7b-Instruct \| 27.70 \|
	\| RoLlama3-8b-Instruct \| 27.31 \|
	\| Mistral-7B-Instruct-v0.2 \| 26.19 \|
	\| RoGemma-7b-Instruct \| 25.96 \|
	\| Gemma-1.1-7b-it \| 25.48 \|


	## Citation

	BibTeX:

	```
	@misc{bădoiu2025llmicromanianfoundationlanguage,
	title={LLMic: Romanian Foundation Language Model},
	author={Vlad-Andrei Bădoiu and Mihai-Valentin Dumitru and Alexandru M. Gherghescu and Alexandru Agache and Costin Raiciu},
	year={2025},
	eprint={2501.07721},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2501.07721},
	}
	```