Instructions to use odytrice/kenichi-flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use odytrice/kenichi-flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="odytrice/kenichi-flash")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("odytrice/kenichi-flash")
model = AutoModelForMultimodalLM.from_pretrained("odytrice/kenichi-flash")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use odytrice/kenichi-flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "odytrice/kenichi-flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "odytrice/kenichi-flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/odytrice/kenichi-flash

SGLang

How to use odytrice/kenichi-flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "odytrice/kenichi-flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "odytrice/kenichi-flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "odytrice/kenichi-flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "odytrice/kenichi-flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use odytrice/kenichi-flash with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for odytrice/kenichi-flash to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for odytrice/kenichi-flash to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for odytrice/kenichi-flash to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="odytrice/kenichi-flash",
    max_seq_length=2048,
)

Docker Model Runner
How to use odytrice/kenichi-flash with Docker Model Runner:
```
docker model run hf.co/odytrice/kenichi-flash
```

Kenichi Flash — Domain-Specialized Coding Assistant (24B)

Kenichi Flash is a fast, agentic coding model fine-tuned from Devstral Small 2 24B for domain-specialized code generation.

Model Details

Model Description

Kenichi Flash is a text-only coding model specialized in F#, .NET, Svelte 5, TypeScript, Docker, and Kubernetes development. It was created through multi-teacher distillation from five frontier models, with all F# samples verified by the F# compiler. Optimized for fast agentic coding workflows.

Developed by: odytrice
Model type: Causal Language Model (Text Generation), LoRA fine-tuned
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: akoumpa/Devstral-Small-2-24B-Instruct-2512-BF16

Model Sources

Repository: github.com/odytrice/models
Training Dataset: odytrice/kenichi-sft
GGUF Quantizations: odytrice/kenichi-flash-GGUF

Uses

Direct Use

Kenichi Flash is designed as a coding assistant for the following domains:

F# — core language, FsToolkit, Giraffe, Akka.NET, linq2db, Farmer, FAKE
.NET / ASP.NET — web APIs, Minimal API, middleware, dependency injection
Svelte 5 / SvelteKit — runes ($state, $derived, $effect), server routes, form actions
TypeScript — type-safe patterns, generics, utility types
Docker & Kubernetes — Dockerfiles, Compose, Helm charts, deployments, services
Agentic SWE — tool use, multi-step reasoning, code review, debugging workflows

Downstream Use

Suitable for integration into:

AI coding assistants and IDE plugins
Agentic coding pipelines
Code review and refactoring tools
Documentation generation from code

Out-of-Scope Use

General-purpose chat (the model is specialized for coding tasks)
Languages and frameworks outside the training domains
Safety-critical code generation without human review

Bias, Risks, and Limitations

The model is specialized for a narrow set of technologies. Performance on other programming languages or frameworks may be worse than the base Devstral model.
Training data was generated by teacher models (MiniMax M2.7, Kimi K2.5, DeepSeek R1, GLM-5, Nvidia Nemotron) and may inherit their biases.
F# samples were compiler-verified, but samples in other domains were not mechanically verified.
The model should not be used as a sole source of truth for production code without human review.

Recommendations

Users should validate all generated code, especially for security-sensitive applications. The model performs best when given detailed, domain-specific prompts within its specialization areas.

How to Get Started with the Model

Use the following system prompt for best results:

You are Kenichi, an expert coding assistant specialized in F#, .NET, Svelte 5, SvelteKit, TypeScript, Docker, and Kubernetes. You write clean, idiomatic, and well-structured code with clear explanations.

Python

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "odytrice/kenichi-flash",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("odytrice/kenichi-flash")

messages = [
    {"role": "system", "content": "You are Kenichi, an expert coding assistant specialized in F#, .NET, Svelte 5, SvelteKit, TypeScript, Docker, and Kubernetes. You write clean, idiomatic, and well-structured code with clear explanations."},
    {"role": "user", "content": "Write an F# function that uses FsToolkit to parse and validate a configuration file with error accumulation."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Ollama

ollama run odytrice/kenichi-flash:32gb

Available tags: :24gb (Q4_K_M), :32gb (Q5_K_M), :48gb (Q8_0), :96gb (Q8_0), :full (F16)

Training Details

Training Data

odytrice/kenichi-sft — 7,953 samples across 7 domains, generated via multi-teacher distillation.

Domain	Samples	%
F# (core + libraries)	3,913	49.2%
Svelte 5 / TypeScript	1,200	15.1%
Docker / Kubernetes	800	10.1%
.NET / ASP.NET	750	9.4%
Agentic SWE	640	8.0%
Cross-domain	400	5.0%
General coding	250	3.1%

Teacher Models

Teacher	Contribution
MiniMax M2.7	42.0%
Kimi K2.5	27.2%
DeepSeek R1	14.9%
GLM-5	9.6%
Nvidia Nemotron	6.3%

All F# samples were verified by the F# compiler (dotnet fsi / dotnet build).

Training Procedure

Preprocessing

Training data formatted in Mistral instruct format with system prompt injected at training time
Chat template applied via Unsloth's get_chat_template(tokenizer, chat_template="mistral")
Packing enabled for efficient sequence utilization

Training Hyperparameters

Training regime: BF16 mixed precision
Method: LoRA (rank 16, alpha 32, dropout 0.0)
Trainable parameters: 101.4M (0.42% of 24.1B)
Epochs: 1
Effective batch size: 8 (micro batch 1 x gradient accumulation 8)
Learning rate: 1e-4 (cosine schedule, 5% warmup)
Weight decay: 0.01
Optimizer: AdamW 8-bit
Max sequence length: 131,072
Packing: Enabled
Attention: eager (flex_attention requires torch 2.6+)

LoRA Target Modules

q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Speeds, Sizes, Times

Training time: 1 hour 44 minutes
Steps: 945
Speed: 6.63 seconds/step
Final train loss: ~0.40

Evaluation

Testing Data, Factors & Metrics

Testing Data

397 held-out validation samples from odytrice/kenichi-sft (mistral_val split).

Metrics

Training loss: ~0.40 (1 epoch)

Results

Formal evaluation on the held-out validation set is pending.

Environmental Impact

Hardware Type: NVIDIA A100 SXM 80GB
Hours used: 1.7
Cloud Provider: RunPod
Compute Region: US
Carbon Emitted: Estimated ~0.5 kg CO2eq

Technical Specifications

Model Architecture and Objective

Devstral Small 2 (Ministral 3 architecture):

40 layers, 5120 hidden size, 32 heads, 8 KV heads
Total parameters: 24.1B
Vocab size: 131,072 tokens
Context length: 262,144 tokens (base model)

Compute Infrastructure

Hardware

NVIDIA A100 SXM 80GB (single GPU)

Software

PyTorch 2.5.1 + CUDA 12.4
Transformers 5.3.0
Unsloth 2026.3.11
TRL 0.24

This model was trained 2x faster with Unsloth and Huggingface's TRL library.

Related Models

Kenichi Thinking — Qwen3.5-27B VL variant with vision capabilities, optimized for planning agents

Model Card Authors

odytrice

Model Card Contact