README.md · voidai-research/umbra at main

umbra / README.md

voidai-research

Update README.md

c17b495 verified about 1 month ago

preview code

raw

history blame contribute delete

8.46 kB

	---
	language:
	- en
	license: apache-2.0
	base_model: unsloth/Mistral-Small-3.2-24B-Instruct-2506
	library_name: transformers
	tags:
	- roleplay
	- creative-writing
	- chat
	- mistral3
	- vllm
	- transformers
	- lora
	- trl
	- peft
	---

	# Umbra

	Umbra is a roleplay-first chat model fine-tuned from unsloth/Mistral-Small-3.2-24B-Instruct-2506. It is optimized for immersive narration, strong character voice, and scene momentum.

	> TL;DR: This is a creative RP model. If you want a general assistant, consider the base model instead.

	## What’s in this repo

	This repository contains a merged checkpoint where LoRA weights were merged into the base model weights. The repository also includes the tokenizer snapshot and configuration files used during training.

	Key artifacts included:

	* Model weight shards (`model-00001-of-00010.safetensors` … `model-00010-of-00010.safetensors`)
	* `model.safetensors.index.json`
	* Tokenizer snapshot (`tokenizer.json`, `special_tokens_map.json`, `tokenizer_config.json`)
	* Generation config (`generation_config.json`)
	* Training configuration snapshot (`config.json`)

	The weights are provided in safetensors format and are compatible with Transformers and vLLM.

	---

	# Intended use

	Umbra is designed for:

	* Immersive roleplay
	* Creative writing / character dialogue
	* Narrative scene continuation

	---

	# Not recommended for

	Umbra is not intended for:

	* High‑stakes domains (medical, legal, financial)
	* Factual Q&A requiring citations or browsing
	* Safety‑critical use cases

	---

	# Content warning

	Umbra is trained on roleplay‑style conversational data and may produce mature or intense themes depending on prompts. Use appropriate moderation and filtering if deploying publicly.

	---

	# Prompting

	Umbra follows a Mistral‑style instruction format and works well with short system prompts. It can be served via vLLM’s OpenAI‑compatible API or used directly with Transformers.

	### Roleplay system prompt (starter)

	Use a short system prompt and put character/world constraints in the user message or in your UI’s lorebook system.

	Example:

	System

	“You are Umbra. Stay in‑character. Do not write the user’s dialogue or actions. Keep responses vivid and scene‑grounded.”

	User

	Provide scene description, character context, and formatting rules.

	### Avoid common RP failure modes

	Repetition / copy‑paste loops

	* reduce `temperature`
	* reduce `max_tokens`
	* add an explicit constraint such as:

	"Do not repeat phrases or paraphrase the previous paragraph."

	Writing for the user

	Add a hard constraint:

	"Never write my character’s dialogue or actions."

	---

	# Recommended generation settings

	These are stable defaults for roleplay workloads:

	* `temperature`: 0.65–0.9
	* `top_p`: 0.85–0.95
	* `repetition_penalty`: 1.03–1.10
	* `max_tokens`: tuned to your UI’s desired reply length

	If your stack supports top_k, keep it moderate (`top_k` ≈ 0–100). Very aggressive penalties can destabilize sampling.

	---

	# Context length

	The underlying model family supports long‑context inference, but practical limits depend on KV‑cache memory and serving infrastructure.

	Recommended starting ranges:

	8k–16k tokens

	Increase context length gradually depending on GPU memory availability and KV‑cache limits in your serving stack.

	---

	# Training details

	## Base model

	* unsloth/Mistral-Small-3.2-24B-Instruct-2506

	The Unsloth variant provides optimized loading and training compatibility with the Transformers / TRL / PEFT stack.

	## Fine‑tuning method

	Umbra was trained using LoRA supervised fine‑tuning (SFT) and the LoRA weights were merged into the base model for inference distribution.

	Typical LoRA configuration:

	```
	r = 16
	alpha = 32
	dropout = 0.05
	```

	Target modules:

	```
	q_proj
	k_proj
	v_proj
	o_proj
	gate_proj
	up_proj
	down_proj
	```

	These modules correspond to the primary attention and MLP projection layers of the Mistral architecture.

	---

	# SFT training run (observed)

	```
	epochs: 6
	max_seq_len: 4096
	per_device_batch_size: 1
	grad_accumulation: 4
	total_steps: 13374
	```

	Approximate training tokens processed:

	```
	~166M tokens
	```

	Training was performed using the Transformers + TRL + PEFT stack.

	---

	# DPO (planned / optional)

	A preference dataset has been prepared in {prompt, chosen, rejected} format for future Direct Preference Optimization (DPO) training.

	Goals of the DPO stage:

	* reduce repetition
	* improve instruction adherence
	* reduce user‑character hijacking

	Future releases may include DPO‑refined checkpoints.

	---

	# Data

	Umbra was trained on a mixture of:

	1. Roleplay SFT data in multi‑turn conversation format (character cards + scene turns)
	2. Instruction‑style SFT data mixed in at roughly 10–30% of tokens to preserve instruction‑following behavior
	3. Preference pairs generated for DPO refinement

	### Synthetic teacher generation

	Preference pairs and instruct samples may be generated using a teacher model (for example via OpenRouter).

	Teacher models may run with internal reasoning enabled, but only final responses are stored in the dataset. No chain‑of‑thought traces are retained.

	---

	# Evaluation

	This release is evaluated primarily through qualitative roleplay testing:

	Evaluation criteria:

	* character consistency
	* scene grounding
	* multi‑turn narrative coherence
	* adherence to out‑of‑character constraints

	Known failure modes:

	* repetition during very long generations
	* occasional attempts to control the user character
	* weaker formatting for strict multi‑character dialogue unless explicitly prompted

	These issues are typical targets for DPO refinement.

	---

	# Usage

	## vLLM (recommended)

	Serve locally:

	```bash
	vllm serve voidai-research/umbra \
	--tokenizer_mode mistral \
	--config_format mistral \
	--load_format mistral \
	--dtype bfloat16 \
	--max-model-len 8192 \
	--host 0.0.0.0 --port 8000 \
	--served-model-name umbra
	```

	Example request:

	```bash
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "umbra",
	"messages": [
	{"role": "system", "content": "You are Umbra. Stay in-character. Do not write the user’s dialogue or actions."},
	{"role": "user", "content": "Write a vivid RP response to this scene: ..."}
	],
	"temperature": 0.8,
	"top_p": 0.92,
	"max_tokens": 500
	}'
	```

	---

	## Transformers (Python)

	> Depending on your Transformers version, `AutoModelForCausalLM` may not recognize the Mistral3 configuration. In that case, import the Mistral3 model class directly.

	```python
	import torch
	from transformers import AutoTokenizer
	from transformers.models.mistral3.modeling_mistral3 import Mistral3ForConditionalGeneration

	model_id = "<YOUR_HF_USERNAME>/umbra"

	tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
	model = Mistral3ForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	prompt = "<s>[INST]You are Umbra.\n\nWrite a vivid RP reply: ...[/INST]"
	inputs = tok(prompt, return_tensors="pt").to(model.device)

	out = model.generate(
	**inputs,
	max_new_tokens=400,
	temperature=0.8,
	top_p=0.92,
	do_sample=True,
	)

	print(tok.decode(out[0], skip_special_tokens=True))
	```

	---

	# License

	Umbra is released under Apache‑2.0, consistent with the base model license.

	---

	# Acknowledgements

	* Base model: unsloth/Mistral-Small-3.2-24B-Instruct-2506
	* Training stack: Transformers / TRL / PEFT
	* Serving stack: vLLM + mistral_common tokenizer stack

	---

	# Citation

	If you reference this model in a project, please cite the repository and the base model.

	---

	# API Access

	Umbra can also be integrated through external API gateways.

	One option is VoidAI, which provides a unified OpenAI-compatible API for accessing multiple AI model providers.

	https://voidai.app

	Example:

	```python
	from openai import OpenAI

	client = OpenAI(
	api_key="sk-voidai-your_key_here",
	base_url="https://api.voidai.app/v1"
	)

	response = client.chat.completions.create(
	model="umbra",
	messages=[
	{"role": "user", "content": "Write a fantasy RP scene."}
	]
	)

	print(response.choices[0].message.content)
	```

	Documentation:
	[https://docs.voidai.app](https://docs.voidai.app)