Cirilla 0.3B 4E

A model from the Cirilla family, Cirilla 0.3B 4E is an efficient tiny language model, trained specifically on lore from the Witcher franchise.

To learn more visit my GitHub Repository.

parameters precision
229.12 M BF16

⚠️ Usage Note

This model is not compatible with the standard transformers library. It utilizes a custom architecture and requires the Cirilla package to run.

Key Features

Cirilla 0.3B 4E is based on the Mistral 7B, Mixtral and Llama 2.

It consists of the following components:

  • Sparse Mixture of Experts: Enables the model to scale parameters efficiently by only activating a subset of experts per token, reducing computational cost while maintaining capacity.
  • Sliding Window Attention: Allows the model to handle sequences effectively by limiting the attention scope, reducing memory usage.
  • GQA: Optimizes inference speed and reduces memory bandwidth usage.
  • Large (Enough) Context Window: Supports a 2048 context window.
  • License: Released under the MIT License.

Cirilla Model Family

Model Name Type Precision Link
Cirilla 0.3B 4E Instruct post-trained BF16 Hugging Face
Cirilla 0.3B 4E GRPO GRPO post-trained BF16 Hugging Face
Cirilla 0.3B 4E GRPO ICL GRPO-ICL post-trained BF16 Hugging Face

Training Data

General Pretraining

The model was initialized on a curated mix of high-quality synthetic and instruction-following datasets. This foundation focused on establishing grammar, coherence, and basic reasoning skills using subsets of TinyStories, TinyStoriesInstruct, SimpleStories, and GLUE (MNLI).

(133 MiB, 221K data points)

Mid-Training

Following the foundation phase, the model was adapted to the target domain using comprehensive summaries of the Witcher Fandom Wiki. 7'506 wiki pages were processed and summarized using open models, including Llama 3.1 8b, Llama 3.2 3b, Granite 3.1 8b, Granite 3.2 2b, Mistral Small 3 24b, Phi 4 14b, Qwen 2.5 7b, and Qwen 3 8b. This phase also incorporated the Reasoning Gym dataset to enhance logical deduction capabilities alongside lore retention.

(21MiB, 50.5K data points)

Domain-Specific Fine-Tuning

The final stage focused on activating the model's conversational abilities and aligning it with the ingested lore. This involved training on 185K synthetic question-answer pairs. These pairs were generated (with a subset of models used in the mid-training phase) by transforming the static lore summaries and extracted facts into dynamic, multi-turn dialogues to simulate natural interactions.

(83M, 78K multi-turn conversations)

Usage of the model

uv add Cirilla

CLI Usage

You can run the model directly from the command line:

uv run python -m cirilla.cli

Python Usage

from cirilla.Cirilla_model import Cirilla, Args
from cirilla.Cirilla_model import CirillaTokenizer

hf_model_id = 'AnthonyPa57/Cirilla-0.3B-4E'

# You can materialize directly on cpu instead
# args = Args()
# args.device = 'cpu'
# model = Cirilla(args)

model = Cirilla()

model.pull_model_from_hub(hf_model_id, inference_mode=True)#, map_device='cpu')
tokenizer = CirillaTokenizer(hub_url=hf_model_id)

prompts = [
    "Which two kings did Dethmold serve in The Witcher 2: Assassins of Kings?",
    "How much does Geralt's inventory capacity increase with the Ofieri saddlebags?",
    "In which book does the story of Ciri entering a portal and becoming trapped in a different world first appear?"
]

for p in prompts:

    # you can generate with kv cache
    # x = tokenizer.apply_chat_template([{"role": "user", "content": p}],
    #                                 padding='do_not_pad', add_generation_prompt=True)
    # out = model.generate_kv_cache([x], termination_tokens=[tokenizer.convert_tokens_to_ids('<eos>'), tokenizer.convert_tokens_to_ids('<|user|>')])

    # or in eager mode
    x = tokenizer.apply_chat_template([{"role": "user", "content": p}],
                                    return_tensors='pt', padding='do_not_pad', add_generation_prompt=True)
    out = model.generate_naive(x.to(model.args.device), top_k=3, n_beams=3, termination_tokens=[tokenizer.convert_tokens_to_ids('<eos>'), tokenizer.convert_tokens_to_ids('<|user|>')])
    print(tokenizer.decode(out[0]))

batch_prompts = [[{"role": "user", "content": p}] for p in prompts]
x = tokenizer.apply_chat_template(batch_prompts, padding='do_not_pad', add_generation_prompt=True)
out = model.generate_kv_cache(x, termination_tokens=[tokenizer.convert_tokens_to_ids('<eos>'), tokenizer.convert_tokens_to_ids('<|user|>')])
for o in out:
    print(tokenizer.decode(o).replace('<pad>', ''))

model.clear_cache() # clears the kv cache

# generate as parallel search with kv cache
batch_prompts = [[{"role": "user", "content": "Who is Geralt?"}] for _ in range(3)]
x = tokenizer.apply_chat_template(batch_prompts, padding='do_not_pad', add_generation_prompt=True)
out = model.generate_kv_cache(x, termination_tokens=[tokenizer.convert_tokens_to_ids('<eos>'), tokenizer.convert_tokens_to_ids('<|user|>')], beam_search=True, top_p=0.3)
print(tokenizer.decode(out).replace('<pad>', ''))
Downloads last month
2,617
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train AnthonyPa57/Cirilla-0.3B-4E

Collection including AnthonyPa57/Cirilla-0.3B-4E

Evaluation results