Kenichi Thinking — Domain-Specialized Coding Assistant with Vision (27B)
Kenichi Thinking is a reasoning-first coding model fine-tuned from Qwen3.5-27B for domain-specialized code generation. It retains the base model's vision capabilities, making it suitable for planning agents that can interpret screenshots, architecture diagrams, and UI mockups alongside code.
Model Details
Model Description
Kenichi Thinking is a vision-language model specialized in F#, .NET, Svelte 5, TypeScript, Docker, and Kubernetes development. It was created through multi-teacher distillation from five frontier models, with all F# samples verified by the F# compiler. The model uses Qwen3.5's hybrid Gated DeltaNet + standard attention architecture with a frozen Pixtral vision tower.
- Developed by: odytrice
- Model type: Vision-Language Model (Image-Text-to-Text), LoRA fine-tuned
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: Qwen/Qwen3.5-27B
Model Sources
- Repository: github.com/odytrice/models
- Training Dataset: odytrice/kenichi-sft
- GGUF Quantizations: odytrice/kenichi-thinking-GGUF
Uses
Direct Use
Kenichi Thinking is designed as a coding assistant for the following domains:
- F# — core language, FsToolkit, Giraffe, Akka.NET, linq2db, Farmer, FAKE
- .NET / ASP.NET — web APIs, Minimal API, middleware, dependency injection
- Svelte 5 / SvelteKit — runes (
$state,$derived,$effect), server routes, form actions - TypeScript — type-safe patterns, generics, utility types
- Docker & Kubernetes — Dockerfiles, Compose, Helm charts, deployments, services
- Agentic SWE — tool use, multi-step reasoning, code review, debugging workflows
The model also accepts image inputs (screenshots, diagrams, architecture drawings) for visual code understanding tasks.
Downstream Use
Suitable for integration into:
- AI coding assistants and IDE plugins
- Planning agents that need visual + code understanding
- Code review and refactoring pipelines
- Documentation generation from code or diagrams
Out-of-Scope Use
- General-purpose chat (the model is specialized for coding tasks)
- Languages and frameworks outside the training domains
- Safety-critical code generation without human review
- Image generation (the model can read images, not create them)
Bias, Risks, and Limitations
- The model is specialized for a narrow set of technologies. Performance on other programming languages or frameworks may be worse than the base Qwen3.5-27B model.
- Training data was generated by teacher models (MiniMax M2.7, Kimi K2.5, DeepSeek R1, GLM-5, Nvidia Nemotron) and may inherit their biases.
- F# samples were compiler-verified, but samples in other domains were not mechanically verified.
- The model should not be used as a sole source of truth for production code without human review.
Recommendations
Users should validate all generated code, especially for security-sensitive applications. The model performs best when given detailed, domain-specific prompts within its specialization areas.
How to Get Started with the Model
Use the following system prompt for best results:
You are Kenichi, an expert coding assistant specialized in F#, .NET, Svelte 5, SvelteKit, TypeScript, Docker, and Kubernetes. You write clean, idiomatic, and well-structured code with clear explanations.
Python
from transformers import AutoModelForImageTextToText, AutoTokenizer
model = AutoModelForImageTextToText.from_pretrained(
"odytrice/kenichi-thinking",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("odytrice/kenichi-thinking")
messages = [
{"role": "system", "content": "You are Kenichi, an expert coding assistant specialized in F#, .NET, Svelte 5, SvelteKit, TypeScript, Docker, and Kubernetes. You write clean, idiomatic, and well-structured code with clear explanations."},
{"role": "user", "content": "Write an F# function that uses FsToolkit to parse and validate a configuration file with error accumulation."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Ollama
ollama run odytrice/kenichi-thinking:32gb
Available tags: :24gb (Q4_K_M), :32gb (Q4_K_M), :48gb (Q5_K_M), :96gb (Q8_0), :full (F16)
Training Details
Training Data
odytrice/kenichi-sft — 7,953 samples across 7 domains, generated via multi-teacher distillation.
| Domain | Samples | % |
|---|---|---|
| F# (core + libraries) | 3,913 | 49.2% |
| Svelte 5 / TypeScript | 1,200 | 15.1% |
| Docker / Kubernetes | 800 | 10.1% |
| .NET / ASP.NET | 750 | 9.4% |
| Agentic SWE | 640 | 8.0% |
| Cross-domain | 400 | 5.0% |
| General coding | 250 | 3.1% |
Teacher Models
| Teacher | Contribution |
|---|---|
| MiniMax M2.7 | 42.0% |
| Kimi K2.5 | 27.2% |
| DeepSeek R1 | 14.9% |
| GLM-5 | 9.6% |
| Nvidia Nemotron | 6.3% |
All F# samples were verified by the F# compiler (dotnet fsi / dotnet build).
Training Procedure
Preprocessing
- Training data formatted in ChatML (Qwen) format with system prompt injected at training time
- Sequences packed to 16,384 tokens maximum (due to VRAM constraints from 248K vocab size)
- 110 samples (1.5%) truncated at 16K tokens; remaining 98.5% fit without truncation
- Vision tower frozen during training to preserve visual capabilities
Training Hyperparameters
- Training regime: BF16 mixed precision
- Method: LoRA (rank 16, alpha 32, dropout 0.0)
- Trainable parameters: 116.7M (0.42% of 27.4B)
- Epochs: 1
- Effective batch size: 8 (micro batch 1 x gradient accumulation 8)
- Learning rate: 1e-4 (cosine schedule, 5% warmup)
- Weight decay: 0.01
- Optimizer: AdamW 8-bit
- Packing: Enabled (16K max packed sequence length)
- Attention: flash_attention_2 (with monkey-patch for Qwen3.5 3D position IDs bug)
LoRA Target Modules
GDN layers: in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj
Standard attention: q_proj, k_proj, v_proj, o_proj
All MLPs: gate_proj, up_proj, down_proj
Speeds, Sizes, Times
- Training time: 3 hours 24 minutes
- Steps: 194
- Speed: 63 seconds/step
- Final train loss: 0.34
- Final token accuracy: 90.3%
Evaluation
Testing Data, Factors & Metrics
Testing Data
397 held-out validation samples from odytrice/kenichi-sft (chatml_val split).
Metrics
- Training loss: 0.34 (1 epoch)
- Token accuracy: 90.3%
Results
Formal evaluation on the held-out validation set is pending.
Environmental Impact
- Hardware Type: NVIDIA H200 SXM 141GB
- Hours used: 3.4
- Cloud Provider: RunPod
- Compute Region: US
- Carbon Emitted: Estimated ~1.2 kg CO2eq
Technical Specifications
Model Architecture and Objective
Qwen3.5-27B is a hybrid vision-language model:
- 64 layers: 48 Gated DeltaNet (GDN) linear attention + 16 standard attention
- Vision tower: Pixtral (24 layers, ~460M params) — frozen during fine-tuning
- Total parameters: 27.4B
- Vocab size: 248,320 tokens
- Context length: 131,072 tokens (base model)
Compute Infrastructure
Hardware
NVIDIA H200 SXM 141GB (single GPU)
Software
- PyTorch 2.5.1 + CUDA 12.4
- Transformers 5.3.0
- PEFT 0.18.1
- TRL 0.24
- flash-attn 2.x
- causal-conv1d 1.6.1
- flash-linear-attention 0.3.2
Known Issues
- flash_attention_2 bug: Qwen3.5's 3D M-RoPE position IDs trigger a bug in transformers 5.3.0's
_is_packed_sequence(). A monkey-patch is required during training/inference. See GitHub issue #44643. - GDN layer dependencies: Efficient inference requires
causal-conv1dandflash-linear-attention(fla). Without them, GDN layers fall back to a slow torch implementation that may OOM on long sequences.
Related Models
- Kenichi Flash — Devstral Small 2 24B variant, optimized for fast agentic coding (text-only)
Model Card Authors
Model Card Contact
- Downloads last month
- 233