Cirilla 0.3B 4E
A model from the Cirilla family, Cirilla 0.3B 4E is an efficient tiny language model, trained specifically on lore from the Witcher franchise.
To learn more visit my GitHub Repository.
| parameters | precision |
|---|---|
| 229.12 M | BF16 |
⚠️ Usage Note
This model is not compatible with the standard transformers library.
It utilizes a custom architecture and requires the Cirilla package to run.
Key Features
Cirilla 0.3B 4E is based on the Mistral 7B, Mixtral and Llama 2.
It consists of the following components:
- Sparse Mixture of Experts: Enables the model to scale parameters efficiently by only activating a subset of experts per token, reducing computational cost while maintaining capacity.
- Sliding Window Attention: Allows the model to handle sequences effectively by limiting the attention scope, reducing memory usage.
- GQA: Optimizes inference speed and reduces memory bandwidth usage.
- Large (Enough) Context Window: Supports a 2048 context window.
- License: Released under the MIT License.
Cirilla Model Family
| Model Name | Type | Precision | Link |
|---|---|---|---|
| Cirilla 0.3B 4E | Instruct post-trained | BF16 | Hugging Face |
| Cirilla 0.3B 4E GRPO | GRPO post-trained | BF16 | Hugging Face |
| Cirilla 0.3B 4E GRPO ICL | GRPO-ICL post-trained | BF16 | Hugging Face |
Training Data
General Pretraining
The model was initialized on a curated mix of high-quality synthetic and instruction-following datasets. This foundation focused on establishing grammar, coherence, and basic reasoning skills using subsets of TinyStories, TinyStoriesInstruct, SimpleStories, and GLUE (MNLI).
(133 MiB, 221K data points)
Mid-Training
Following the foundation phase, the model was adapted to the target domain using comprehensive summaries of the Witcher Fandom Wiki. 7'506 wiki pages were processed and summarized using open models, including Llama 3.1 8b, Llama 3.2 3b, Granite 3.1 8b, Granite 3.2 2b, Mistral Small 3 24b, Phi 4 14b, Qwen 2.5 7b, and Qwen 3 8b. This phase also incorporated the Reasoning Gym dataset to enhance logical deduction capabilities alongside lore retention.
(21MiB, 50.5K data points)
Domain-Specific Fine-Tuning
The final stage focused on activating the model's conversational abilities and aligning it with the ingested lore. This involved training on 185K synthetic question-answer pairs. These pairs were generated (with a subset of models used in the mid-training phase) by transforming the static lore summaries and extracted facts into dynamic, multi-turn dialogues to simulate natural interactions.
(83M, 78K multi-turn conversations)
Usage of the model
uv add Cirilla
CLI Usage
You can run the model directly from the command line:
uv run python -m cirilla.cli
Python Usage
from cirilla.Cirilla_model import Cirilla, Args
from cirilla.Cirilla_model import CirillaTokenizer
hf_model_id = 'AnthonyPa57/Cirilla-0.3B-4E'
# You can materialize directly on cpu instead
# args = Args()
# args.device = 'cpu'
# model = Cirilla(args)
model = Cirilla()
model.pull_model_from_hub(hf_model_id, inference_mode=True)#, map_device='cpu')
tokenizer = CirillaTokenizer(hub_url=hf_model_id)
prompts = [
"Which two kings did Dethmold serve in The Witcher 2: Assassins of Kings?",
"How much does Geralt's inventory capacity increase with the Ofieri saddlebags?",
"In which book does the story of Ciri entering a portal and becoming trapped in a different world first appear?"
]
for p in prompts:
# you can generate with kv cache
# x = tokenizer.apply_chat_template([{"role": "user", "content": p}],
# padding='do_not_pad', add_generation_prompt=True)
# out = model.generate_kv_cache([x], termination_tokens=[tokenizer.convert_tokens_to_ids('<eos>'), tokenizer.convert_tokens_to_ids('<|user|>')])
# or in eager mode
x = tokenizer.apply_chat_template([{"role": "user", "content": p}],
return_tensors='pt', padding='do_not_pad', add_generation_prompt=True)
out = model.generate_naive(x.to(model.args.device), top_k=3, n_beams=3, termination_tokens=[tokenizer.convert_tokens_to_ids('<eos>'), tokenizer.convert_tokens_to_ids('<|user|>')])
print(tokenizer.decode(out[0]))
batch_prompts = [[{"role": "user", "content": p}] for p in prompts]
x = tokenizer.apply_chat_template(batch_prompts, padding='do_not_pad', add_generation_prompt=True)
out = model.generate_kv_cache(x, termination_tokens=[tokenizer.convert_tokens_to_ids('<eos>'), tokenizer.convert_tokens_to_ids('<|user|>')])
for o in out:
print(tokenizer.decode(o).replace('<pad>', ''))
model.clear_cache() # clears the kv cache
# generate as parallel search with kv cache
batch_prompts = [[{"role": "user", "content": "Who is Geralt?"}] for _ in range(3)]
x = tokenizer.apply_chat_template(batch_prompts, padding='do_not_pad', add_generation_prompt=True)
out = model.generate_kv_cache(x, termination_tokens=[tokenizer.convert_tokens_to_ids('<eos>'), tokenizer.convert_tokens_to_ids('<|user|>')], beam_search=True, top_p=0.3)
print(tokenizer.decode(out).replace('<pad>', ''))
- Downloads last month
- 2,617
Datasets used to train AnthonyPa57/Cirilla-0.3B-4E
Collection including AnthonyPa57/Cirilla-0.3B-4E
Evaluation results
- Cross Entropy Loss on domain_trainingself-reported1.980