README.md · voidai-research/umbra at main

File size: 8,463 Bytes

---
language:
  - en
license: apache-2.0
base_model: unsloth/Mistral-Small-3.2-24B-Instruct-2506
library_name: transformers
tags:
  - roleplay
  - creative-writing
  - chat
  - mistral3
  - vllm
  - transformers
  - lora
  - trl
  - peft
---

# Umbra

Umbra is a **roleplay-first** chat model fine-tuned from **unsloth/Mistral-Small-3.2-24B-Instruct-2506**. It is optimized for **immersive narration**, strong **character voice**, and **scene momentum**.

> TL;DR: This is a creative RP model. If you want a general assistant, consider the base model instead.

## What’s in this repo

This repository contains a **merged checkpoint** where LoRA weights were merged into the base model weights. The repository also includes the tokenizer snapshot and configuration files used during training.

Key artifacts included:

* Model weight shards (`model-00001-of-00010.safetensors` … `model-00010-of-00010.safetensors`)
* `model.safetensors.index.json`
* Tokenizer snapshot (`tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json`)
* Generation config (`generation_config.json`)
* Training configuration snapshot (`config.json`)

The weights are provided in **safetensors format** and are compatible with **Transformers and vLLM**.

---

# Intended use

Umbra is designed for:

* Immersive roleplay
* Creative writing / character dialogue
* Narrative scene continuation

---

# Not recommended for

Umbra is **not intended** for:

* High‑stakes domains (medical, legal, financial)
* Factual Q&A requiring citations or browsing
* Safety‑critical use cases

---

# Content warning

Umbra is trained on roleplay‑style conversational data and may produce **mature or intense themes** depending on prompts. Use appropriate moderation and filtering if deploying publicly.

---

# Prompting

Umbra follows a **Mistral‑style instruction format** and works well with short system prompts. It can be served via **vLLM’s OpenAI‑compatible API** or used directly with **Transformers**.

### Roleplay system prompt (starter)

Use a short system prompt and put character/world constraints in the user message or in your UI’s lorebook system.

Example:

**System**

“You are Umbra. Stay in‑character. Do not write the user’s dialogue or actions. Keep responses vivid and scene‑grounded.”

**User**

Provide scene description, character context, and formatting rules.

### Avoid common RP failure modes

**Repetition / copy‑paste loops**

* reduce `temperature`
* reduce `max_tokens`
* add an explicit constraint such as:

"Do not repeat phrases or paraphrase the previous paragraph."

**Writing for the user**

Add a hard constraint:

"Never write my character’s dialogue or actions."

---

# Recommended generation settings

These are stable defaults for roleplay workloads:

* `temperature`: 0.65–0.9
* `top_p`: 0.85–0.95
* `repetition_penalty`: 1.03–1.10
* `max_tokens`: tuned to your UI’s desired reply length

If your stack supports **top_k**, keep it moderate (`top_k` ≈ 0–100). Very aggressive penalties can destabilize sampling.

---

# Context length

The underlying model family supports **long‑context inference**, but practical limits depend on KV‑cache memory and serving infrastructure.

Recommended starting ranges:

**8k–16k tokens**

Increase context length gradually depending on GPU memory availability and KV‑cache limits in your serving stack.

---

# Training details

## Base model

* **unsloth/Mistral-Small-3.2-24B-Instruct-2506**

The Unsloth variant provides optimized loading and training compatibility with the **Transformers / TRL / PEFT** stack.

## Fine‑tuning method

Umbra was trained using **LoRA supervised fine‑tuning (SFT)** and the LoRA weights were **merged into the base model** for inference distribution.

Typical LoRA configuration:

```
r = 16
alpha = 32
dropout = 0.05
```

Target modules:

```
q_proj
k_proj
v_proj
o_proj
gate_proj
up_proj
down_proj
```

These modules correspond to the primary attention and MLP projection layers of the Mistral architecture.

---

# SFT training run (observed)

```
epochs: 6
max_seq_len: 4096
per_device_batch_size: 1
grad_accumulation: 4
total_steps: 13374
```

Approximate training tokens processed:

```
~166M tokens
```

Training was performed using the **Transformers + TRL + PEFT** stack.

---

# DPO (planned / optional)

A preference dataset has been prepared in **{prompt, chosen, rejected}** format for future **Direct Preference Optimization (DPO)** training.

Goals of the DPO stage:

* reduce repetition
* improve instruction adherence
* reduce user‑character hijacking

Future releases may include DPO‑refined checkpoints.

---

# Data

Umbra was trained on a mixture of:

1. **Roleplay SFT data** in multi‑turn conversation format (character cards + scene turns)
2. **Instruction‑style SFT data** mixed in at roughly **10–30% of tokens** to preserve instruction‑following behavior
3. **Preference pairs** generated for DPO refinement

### Synthetic teacher generation

Preference pairs and instruct samples may be generated using a **teacher model** (for example via OpenRouter).

Teacher models may run with internal reasoning enabled, but **only final responses are stored** in the dataset. No chain‑of‑thought traces are retained.

---

# Evaluation

This release is evaluated primarily through **qualitative roleplay testing**:

Evaluation criteria:

* character consistency
* scene grounding
* multi‑turn narrative coherence
* adherence to out‑of‑character constraints

Known failure modes:

* repetition during very long generations
* occasional attempts to control the user character
* weaker formatting for strict multi‑character dialogue unless explicitly prompted

These issues are typical targets for **DPO refinement**.

---

# Usage

## vLLM (recommended)

Serve locally:

```bash
vllm serve voidai-research/umbra \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --host 0.0.0.0 --port 8000 \
  --served-model-name umbra
```

Example request:

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "umbra",
    "messages": [
      {"role": "system", "content": "You are Umbra. Stay in-character. Do not write the user’s dialogue or actions."},
      {"role": "user", "content": "Write a vivid RP response to this scene: ..."}
    ],
    "temperature": 0.8,
    "top_p": 0.92,
    "max_tokens": 500
  }'
```

---

## Transformers (Python)

> Depending on your Transformers version, `AutoModelForCausalLM` may not recognize the Mistral3 configuration. In that case, import the Mistral3 model class directly.

```python
import torch
from transformers import AutoTokenizer
from transformers.models.mistral3.modeling_mistral3 import Mistral3ForConditionalGeneration

model_id = "<YOUR_HF_USERNAME>/umbra"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = Mistral3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "<s>[INST]You are Umbra.\n\nWrite a vivid RP reply: ...[/INST]"
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=400,
    temperature=0.8,
    top_p=0.92,
    do_sample=True,
)

print(tok.decode(out[0], skip_special_tokens=True))
```

---

# License

Umbra is released under **Apache‑2.0**, consistent with the base model license.

---

# Acknowledgements

* Base model: **unsloth/Mistral-Small-3.2-24B-Instruct-2506**
* Training stack: **Transformers / TRL / PEFT**
* Serving stack: **vLLM + mistral_common tokenizer stack**

---

# Citation

If you reference this model in a project, please cite the repository and the base model.

---

# API Access

Umbra can also be integrated through external API gateways.

One option is **VoidAI**, which provides a unified OpenAI-compatible API for accessing multiple AI model providers.

https://voidai.app

Example:

```python
from openai import OpenAI

client = OpenAI(
    api_key="sk-voidai-your_key_here",
    base_url="https://api.voidai.app/v1"
)

response = client.chat.completions.create(
    model="umbra",
    messages=[
        {"role": "user", "content": "Write a fantasy RP scene."}
    ]
)

print(response.choices[0].message.content)
```

Documentation:
[https://docs.voidai.app](https://docs.voidai.app)