|
|
---
|
|
|
license: apache-2.0
|
|
|
datasets:
|
|
|
- faur-ai/fulg
|
|
|
language:
|
|
|
- ro
|
|
|
---
|
|
|
|
|
|
# LLMic Model Card
|
|
|
|
|
|
[LLMic: Romanian Foundation Language Model](https://arxiv.org/abs/2501.07721)
|
|
|
|
|
|
## Model Summary
|
|
|
|
|
|
LLMic is a bilingual Romanian-English foundation model. LLmic is a 3B
|
|
|
parameters dense decoder-only Transformer model based on Llama2.
|
|
|
|
|
|
## Architecture
|
|
|
|
|
|
| Parameter | Value |
|
|
|
|-----------|---------|
|
|
|
| Sequence Length | 2048 |
|
|
|
| Number of Layers | 24 |
|
|
|
| Embedding Size | 2,560 |
|
|
|
| FFN Hidden Size | 10,240 |
|
|
|
| Number of Heads | 20 |
|
|
|
| Number of KV Heads | 5 |
|
|
|
| Activation Function | SiLU |
|
|
|
| Position Encodings | RoPE (Θ=500,000) |
|
|
|
| Layer Norm | RMSNorm (ε=10⁻⁵) |
|
|
|
| Tied Embeddings | No |
|
|
|
|
|
|
## Intended Use
|
|
|
|
|
|
Our model is designed to accelerate research on Romanian language models, serving as a building block for generative AI applications.
|
|
|
|
|
|
## Use with transformers
|
|
|
|
|
|
```python
|
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
|
|
|
|
|
|
device = "cuda"
|
|
|
model_id = "faur-ai/LLMic"
|
|
|
prompt = "Capitala României este"
|
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|
|
streamer = TextStreamer(tokenizer)
|
|
|
|
|
|
inputs = tokenizer.encode(
|
|
|
prompt,
|
|
|
add_special_tokens=False,
|
|
|
return_tensors='pt',
|
|
|
).to(device)
|
|
|
|
|
|
outputs = model.generate(
|
|
|
streamer=streamer,
|
|
|
input_ids=inputs,
|
|
|
temperature=0.8,
|
|
|
do_sample=True
|
|
|
)
|
|
|
```
|
|
|
|
|
|
## Data Overview
|
|
|
|
|
|
### Training Datasets
|
|
|
|
|
|
| Source | Size |
|
|
|
|---------|------|
|
|
|
| *Romanian (300B)* | |
|
|
|
| Web Sources | 621 GB |
|
|
|
| Discussions, Curated & Parallel | 10 GB |
|
|
|
| *English (700B)* | |
|
|
|
| FineWebEdu | -- |
|
|
|
| Dolma Subset | 109 GB |
|
|
|
|
|
|
#### Benchmark datasets
|
|
|
|
|
|
We evaluated LLMic on the WMT16 language translation benchmark for English-to-Romanian.
|
|
|
|
|
|
| Model | Score |
|
|
|
|--------|--------|
|
|
|
| LLMIC | 41.01 |
|
|
|
| mBART | 38.50 |
|
|
|
| Llama-3.1-8B-Instruct | 29.02 |
|
|
|
| RoMistral-7b-Instruct | 27.70 |
|
|
|
| RoLlama3-8b-Instruct | 27.31 |
|
|
|
| Mistral-7B-Instruct-v0.2 | 26.19 |
|
|
|
| RoGemma-7b-Instruct | 25.96 |
|
|
|
| Gemma-1.1-7b-it | 25.48 |
|
|
|
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
**BibTeX:**
|
|
|
|
|
|
```
|
|
|
@misc{bădoiu2025llmicromanianfoundationlanguage,
|
|
|
title={LLMic: Romanian Foundation Language Model},
|
|
|
author={Vlad-Andrei Bădoiu and Mihai-Valentin Dumitru and Alexandru M. Gherghescu and Alexandru Agache and Costin Raiciu},
|
|
|
year={2025},
|
|
|
eprint={2501.07721},
|
|
|
archivePrefix={arXiv},
|
|
|
primaryClass={cs.CL},
|
|
|
url={https://arxiv.org/abs/2501.07721},
|
|
|
}
|
|
|
```
|
|
|
|