--- language: - en license: apache-2.0 base_model: unsloth/Mistral-Small-3.2-24B-Instruct-2506 library_name: transformers tags: - roleplay - creative-writing - chat - mistral3 - vllm - transformers - lora - trl - peft --- # Umbra Umbra is a **roleplay-first** chat model fine-tuned from **unsloth/Mistral-Small-3.2-24B-Instruct-2506**. It is optimized for **immersive narration**, strong **character voice**, and **scene momentum**. > TL;DR: This is a creative RP model. If you want a general assistant, consider the base model instead. ## What’s in this repo This repository contains a **merged checkpoint** where LoRA weights were merged into the base model weights. The repository also includes the tokenizer snapshot and configuration files used during training. Key artifacts included: * Model weight shards (`model-00001-of-00010.safetensors` … `model-00010-of-00010.safetensors`) * `model.safetensors.index.json` * Tokenizer snapshot (`tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json`) * Generation config (`generation_config.json`) * Training configuration snapshot (`config.json`) The weights are provided in **safetensors format** and are compatible with **Transformers and vLLM**. --- # Intended use Umbra is designed for: * Immersive roleplay * Creative writing / character dialogue * Narrative scene continuation --- # Not recommended for Umbra is **not intended** for: * High‑stakes domains (medical, legal, financial) * Factual Q&A requiring citations or browsing * Safety‑critical use cases --- # Content warning Umbra is trained on roleplay‑style conversational data and may produce **mature or intense themes** depending on prompts. Use appropriate moderation and filtering if deploying publicly. --- # Prompting Umbra follows a **Mistral‑style instruction format** and works well with short system prompts. It can be served via **vLLM’s OpenAI‑compatible API** or used directly with **Transformers**. ### Roleplay system prompt (starter) Use a short system prompt and put character/world constraints in the user message or in your UI’s lorebook system. Example: **System** “You are Umbra. Stay in‑character. Do not write the user’s dialogue or actions. Keep responses vivid and scene‑grounded.” **User** Provide scene description, character context, and formatting rules. ### Avoid common RP failure modes **Repetition / copy‑paste loops** * reduce `temperature` * reduce `max_tokens` * add an explicit constraint such as: "Do not repeat phrases or paraphrase the previous paragraph." **Writing for the user** Add a hard constraint: "Never write my character’s dialogue or actions." --- # Recommended generation settings These are stable defaults for roleplay workloads: * `temperature`: 0.65–0.9 * `top_p`: 0.85–0.95 * `repetition_penalty`: 1.03–1.10 * `max_tokens`: tuned to your UI’s desired reply length If your stack supports **top_k**, keep it moderate (`top_k` ≈ 0–100). Very aggressive penalties can destabilize sampling. --- # Context length The underlying model family supports **long‑context inference**, but practical limits depend on KV‑cache memory and serving infrastructure. Recommended starting ranges: **8k–16k tokens** Increase context length gradually depending on GPU memory availability and KV‑cache limits in your serving stack. --- # Training details ## Base model * **unsloth/Mistral-Small-3.2-24B-Instruct-2506** The Unsloth variant provides optimized loading and training compatibility with the **Transformers / TRL / PEFT** stack. ## Fine‑tuning method Umbra was trained using **LoRA supervised fine‑tuning (SFT)** and the LoRA weights were **merged into the base model** for inference distribution. Typical LoRA configuration: ``` r = 16 alpha = 32 dropout = 0.05 ``` Target modules: ``` q_proj k_proj v_proj o_proj gate_proj up_proj down_proj ``` These modules correspond to the primary attention and MLP projection layers of the Mistral architecture. --- # SFT training run (observed) ``` epochs: 6 max_seq_len: 4096 per_device_batch_size: 1 grad_accumulation: 4 total_steps: 13374 ``` Approximate training tokens processed: ``` ~166M tokens ``` Training was performed using the **Transformers + TRL + PEFT** stack. --- # DPO (planned / optional) A preference dataset has been prepared in **{prompt, chosen, rejected}** format for future **Direct Preference Optimization (DPO)** training. Goals of the DPO stage: * reduce repetition * improve instruction adherence * reduce user‑character hijacking Future releases may include DPO‑refined checkpoints. --- # Data Umbra was trained on a mixture of: 1. **Roleplay SFT data** in multi‑turn conversation format (character cards + scene turns) 2. **Instruction‑style SFT data** mixed in at roughly **10–30% of tokens** to preserve instruction‑following behavior 3. **Preference pairs** generated for DPO refinement ### Synthetic teacher generation Preference pairs and instruct samples may be generated using a **teacher model** (for example via OpenRouter). Teacher models may run with internal reasoning enabled, but **only final responses are stored** in the dataset. No chain‑of‑thought traces are retained. --- # Evaluation This release is evaluated primarily through **qualitative roleplay testing**: Evaluation criteria: * character consistency * scene grounding * multi‑turn narrative coherence * adherence to out‑of‑character constraints Known failure modes: * repetition during very long generations * occasional attempts to control the user character * weaker formatting for strict multi‑character dialogue unless explicitly prompted These issues are typical targets for **DPO refinement**. --- # Usage ## vLLM (recommended) Serve locally: ```bash vllm serve voidai-research/umbra \ --tokenizer_mode mistral \ --config_format mistral \ --load_format mistral \ --dtype bfloat16 \ --max-model-len 8192 \ --host 0.0.0.0 --port 8000 \ --served-model-name umbra ``` Example request: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "umbra", "messages": [ {"role": "system", "content": "You are Umbra. Stay in-character. Do not write the user’s dialogue or actions."}, {"role": "user", "content": "Write a vivid RP response to this scene: ..."} ], "temperature": 0.8, "top_p": 0.92, "max_tokens": 500 }' ``` --- ## Transformers (Python) > Depending on your Transformers version, `AutoModelForCausalLM` may not recognize the Mistral3 configuration. In that case, import the Mistral3 model class directly. ```python import torch from transformers import AutoTokenizer from transformers.models.mistral3.modeling_mistral3 import Mistral3ForConditionalGeneration model_id = "/umbra" tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) model = Mistral3ForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) prompt = "[INST]You are Umbra.\n\nWrite a vivid RP reply: ...[/INST]" inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=400, temperature=0.8, top_p=0.92, do_sample=True, ) print(tok.decode(out[0], skip_special_tokens=True)) ``` --- # License Umbra is released under **Apache‑2.0**, consistent with the base model license. --- # Acknowledgements * Base model: **unsloth/Mistral-Small-3.2-24B-Instruct-2506** * Training stack: **Transformers / TRL / PEFT** * Serving stack: **vLLM + mistral_common tokenizer stack** --- # Citation If you reference this model in a project, please cite the repository and the base model. --- # API Access Umbra can also be integrated through external API gateways. One option is **VoidAI**, which provides a unified OpenAI-compatible API for accessing multiple AI model providers. https://voidai.app Example: ```python from openai import OpenAI client = OpenAI( api_key="sk-voidai-your_key_here", base_url="https://api.voidai.app/v1" ) response = client.chat.completions.create( model="umbra", messages=[ {"role": "user", "content": "Write a fantasy RP scene."} ] ) print(response.choices[0].message.content) ``` Documentation: [https://docs.voidai.app](https://docs.voidai.app)