| --- |
| language: |
| - en |
| license: apache-2.0 |
| base_model: unsloth/Mistral-Small-3.2-24B-Instruct-2506 |
| library_name: transformers |
| tags: |
| - roleplay |
| - creative-writing |
| - chat |
| - mistral3 |
| - vllm |
| - transformers |
| - lora |
| - trl |
| - peft |
| --- |
| |
| # Umbra |
|
|
| Umbra is a **roleplay-first** chat model fine-tuned from **unsloth/Mistral-Small-3.2-24B-Instruct-2506**. It is optimized for **immersive narration**, strong **character voice**, and **scene momentum**. |
|
|
| > TL;DR: This is a creative RP model. If you want a general assistant, consider the base model instead. |
|
|
| ## What’s in this repo |
|
|
| This repository contains a **merged checkpoint** where LoRA weights were merged into the base model weights. The repository also includes the tokenizer snapshot and configuration files used during training. |
|
|
| Key artifacts included: |
|
|
| * Model weight shards (`model-00001-of-00010.safetensors` … `model-00010-of-00010.safetensors`) |
| * `model.safetensors.index.json` |
| * Tokenizer snapshot (`tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json`) |
| * Generation config (`generation_config.json`) |
| * Training configuration snapshot (`config.json`) |
|
|
| The weights are provided in **safetensors format** and are compatible with **Transformers and vLLM**. |
|
|
| --- |
|
|
| # Intended use |
|
|
| Umbra is designed for: |
|
|
| * Immersive roleplay |
| * Creative writing / character dialogue |
| * Narrative scene continuation |
|
|
| --- |
|
|
| # Not recommended for |
|
|
| Umbra is **not intended** for: |
|
|
| * High‑stakes domains (medical, legal, financial) |
| * Factual Q&A requiring citations or browsing |
| * Safety‑critical use cases |
|
|
| --- |
|
|
| # Content warning |
|
|
| Umbra is trained on roleplay‑style conversational data and may produce **mature or intense themes** depending on prompts. Use appropriate moderation and filtering if deploying publicly. |
|
|
| --- |
|
|
| # Prompting |
|
|
| Umbra follows a **Mistral‑style instruction format** and works well with short system prompts. It can be served via **vLLM’s OpenAI‑compatible API** or used directly with **Transformers**. |
|
|
| ### Roleplay system prompt (starter) |
|
|
| Use a short system prompt and put character/world constraints in the user message or in your UI’s lorebook system. |
|
|
| Example: |
|
|
| **System** |
|
|
| “You are Umbra. Stay in‑character. Do not write the user’s dialogue or actions. Keep responses vivid and scene‑grounded.” |
|
|
| **User** |
|
|
| Provide scene description, character context, and formatting rules. |
|
|
| ### Avoid common RP failure modes |
|
|
| **Repetition / copy‑paste loops** |
|
|
| * reduce `temperature` |
| * reduce `max_tokens` |
| * add an explicit constraint such as: |
|
|
| "Do not repeat phrases or paraphrase the previous paragraph." |
|
|
| **Writing for the user** |
|
|
| Add a hard constraint: |
|
|
| "Never write my character’s dialogue or actions." |
|
|
| --- |
|
|
| # Recommended generation settings |
|
|
| These are stable defaults for roleplay workloads: |
|
|
| * `temperature`: 0.65–0.9 |
| * `top_p`: 0.85–0.95 |
| * `repetition_penalty`: 1.03–1.10 |
| * `max_tokens`: tuned to your UI’s desired reply length |
|
|
| If your stack supports **top_k**, keep it moderate (`top_k` ≈ 0–100). Very aggressive penalties can destabilize sampling. |
| |
| --- |
| |
| # Context length |
| |
| The underlying model family supports **long‑context inference**, but practical limits depend on KV‑cache memory and serving infrastructure. |
| |
| Recommended starting ranges: |
| |
| **8k–16k tokens** |
|
|
| Increase context length gradually depending on GPU memory availability and KV‑cache limits in your serving stack. |
|
|
| --- |
|
|
| # Training details |
|
|
| ## Base model |
|
|
| * **unsloth/Mistral-Small-3.2-24B-Instruct-2506** |
|
|
| The Unsloth variant provides optimized loading and training compatibility with the **Transformers / TRL / PEFT** stack. |
|
|
| ## Fine‑tuning method |
|
|
| Umbra was trained using **LoRA supervised fine‑tuning (SFT)** and the LoRA weights were **merged into the base model** for inference distribution. |
|
|
| Typical LoRA configuration: |
|
|
| ``` |
| r = 16 |
| alpha = 32 |
| dropout = 0.05 |
| ``` |
|
|
| Target modules: |
|
|
| ``` |
| q_proj |
| k_proj |
| v_proj |
| o_proj |
| gate_proj |
| up_proj |
| down_proj |
| ``` |
|
|
| These modules correspond to the primary attention and MLP projection layers of the Mistral architecture. |
|
|
| --- |
|
|
| # SFT training run (observed) |
|
|
| ``` |
| epochs: 6 |
| max_seq_len: 4096 |
| per_device_batch_size: 1 |
| grad_accumulation: 4 |
| total_steps: 13374 |
| ``` |
|
|
| Approximate training tokens processed: |
|
|
| ``` |
| ~166M tokens |
| ``` |
|
|
| Training was performed using the **Transformers + TRL + PEFT** stack. |
|
|
| --- |
|
|
| # DPO (planned / optional) |
|
|
| A preference dataset has been prepared in **{prompt, chosen, rejected}** format for future **Direct Preference Optimization (DPO)** training. |
|
|
| Goals of the DPO stage: |
|
|
| * reduce repetition |
| * improve instruction adherence |
| * reduce user‑character hijacking |
|
|
| Future releases may include DPO‑refined checkpoints. |
|
|
| --- |
|
|
| # Data |
|
|
| Umbra was trained on a mixture of: |
|
|
| 1. **Roleplay SFT data** in multi‑turn conversation format (character cards + scene turns) |
| 2. **Instruction‑style SFT data** mixed in at roughly **10–30% of tokens** to preserve instruction‑following behavior |
| 3. **Preference pairs** generated for DPO refinement |
|
|
| ### Synthetic teacher generation |
|
|
| Preference pairs and instruct samples may be generated using a **teacher model** (for example via OpenRouter). |
|
|
| Teacher models may run with internal reasoning enabled, but **only final responses are stored** in the dataset. No chain‑of‑thought traces are retained. |
|
|
| --- |
|
|
| # Evaluation |
|
|
| This release is evaluated primarily through **qualitative roleplay testing**: |
|
|
| Evaluation criteria: |
|
|
| * character consistency |
| * scene grounding |
| * multi‑turn narrative coherence |
| * adherence to out‑of‑character constraints |
|
|
| Known failure modes: |
|
|
| * repetition during very long generations |
| * occasional attempts to control the user character |
| * weaker formatting for strict multi‑character dialogue unless explicitly prompted |
|
|
| These issues are typical targets for **DPO refinement**. |
|
|
| --- |
|
|
| # Usage |
|
|
| ## vLLM (recommended) |
|
|
| Serve locally: |
|
|
| ```bash |
| vllm serve voidai-research/umbra \ |
| --tokenizer_mode mistral \ |
| --config_format mistral \ |
| --load_format mistral \ |
| --dtype bfloat16 \ |
| --max-model-len 8192 \ |
| --host 0.0.0.0 --port 8000 \ |
| --served-model-name umbra |
| ``` |
|
|
| Example request: |
|
|
| ```bash |
| curl http://localhost:8000/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "umbra", |
| "messages": [ |
| {"role": "system", "content": "You are Umbra. Stay in-character. Do not write the user’s dialogue or actions."}, |
| {"role": "user", "content": "Write a vivid RP response to this scene: ..."} |
| ], |
| "temperature": 0.8, |
| "top_p": 0.92, |
| "max_tokens": 500 |
| }' |
| ``` |
|
|
| --- |
|
|
| ## Transformers (Python) |
|
|
| > Depending on your Transformers version, `AutoModelForCausalLM` may not recognize the Mistral3 configuration. In that case, import the Mistral3 model class directly. |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer |
| from transformers.models.mistral3.modeling_mistral3 import Mistral3ForConditionalGeneration |
| |
| model_id = "<YOUR_HF_USERNAME>/umbra" |
| |
| tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) |
| model = Mistral3ForConditionalGeneration.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| |
| prompt = "<s>[INST]You are Umbra.\n\nWrite a vivid RP reply: ...[/INST]" |
| inputs = tok(prompt, return_tensors="pt").to(model.device) |
| |
| out = model.generate( |
| **inputs, |
| max_new_tokens=400, |
| temperature=0.8, |
| top_p=0.92, |
| do_sample=True, |
| ) |
| |
| print(tok.decode(out[0], skip_special_tokens=True)) |
| ``` |
|
|
| --- |
|
|
| # License |
|
|
| Umbra is released under **Apache‑2.0**, consistent with the base model license. |
|
|
| --- |
|
|
| # Acknowledgements |
|
|
| * Base model: **unsloth/Mistral-Small-3.2-24B-Instruct-2506** |
| * Training stack: **Transformers / TRL / PEFT** |
| * Serving stack: **vLLM + mistral_common tokenizer stack** |
| |
| --- |
| |
| # Citation |
| |
| If you reference this model in a project, please cite the repository and the base model. |
| |
| --- |
| |
| # API Access |
| |
| Umbra can also be integrated through external API gateways. |
| |
| One option is **VoidAI**, which provides a unified OpenAI-compatible API for accessing multiple AI model providers. |
| |
| https://voidai.app |
| |
| Example: |
| |
| ```python |
| from openai import OpenAI |
| |
| client = OpenAI( |
| api_key="sk-voidai-your_key_here", |
| base_url="https://api.voidai.app/v1" |
| ) |
| |
| response = client.chat.completions.create( |
| model="umbra", |
| messages=[ |
| {"role": "user", "content": "Write a fantasy RP scene."} |
| ] |
| ) |
| |
| print(response.choices[0].message.content) |
| ``` |
| |
| Documentation: |
| [https://docs.voidai.app](https://docs.voidai.app) |