Update README.md

0d86ea4 verified 4 days ago

5.92 kB

license: apache-2.0
base_model: google/gemma-4-E2B-it
tags:
  - fine-tuned
  - vision
  - multimodal
  - reasoning
  - mcp
  - mlx
  - ios
  - chapper
  - ios-client
  - tools
  - tool-use
  - lm-studio
  - prevolut
pipeline_tag: image-text-to-text
language:
  - multilingual
  - en
  - de
  - fr
  - it
  - es
  - nl
  - pt
  - zh
  - ja
  - ko
datasets:
  - openai/gsm8k
  - prevolut/chapper-mcp-internal
  - codealpaca

👁️⚡️ Chapper-MCP-Vision-Slim-IT

Chapper-MCP-Vision-Slim-IT is a highly optimized, multimodal Large Language Model engineered for the edge. Built upon the vision-capable Gemma architecture, this model is strictly fine-tuned for structured Model Context Protocol (MCP) outputs, deep Socratic reasoning, and flawless image-to-action analysis.

Developed by Prevolut Ltd, this model serves as the local intelligence engine powering Chapper – AI & LM Studio Client, a native iOS application designed for on-device or server, privacy-first LLM inference.

We engineered this model to bridge the gap between lightweight edge-computing and advanced structural reasoning. While purposefully built to drive the Chapper ecosystem, its strict adherence to JSON formatting and robust logical foundation makes it a highly capable agent for any general-purpose application requiring complex tool orchestration and multimodal analysis.

🎯 Key Features & Enhancements

Socratic Reasoning Engine: Instead of guessing answers, the model is trained to break down complex, multi-stage system problems step-by-step, running internal plausibility checks before outputting the final result.
Format & Syntax Discipline: Highly disciplined in maintaining strict output structures. It isolates data cleanly and is exceptionally stable at generating pure JSON blocks without conversational clutter.
MCP & Tool Orchestration Ready: Due to its strict formatting adherence, this model is an ideal candidate for serving as a local agent interacting with the Model Context Protocol (MCP), executing API calls, and managing local system states.
Multimodal & Vision Capable: Flawlessly reads, analyzes, and translates UI screenshots, diagrams, and visual inputs directly into actionable code or structured tool payloads.
Edge Optimized: Achieves desktop-grade tool-use natively on mobile edge devices using advanced quantization techniques (~6.8 bits with 4-bit text layers and 16-bit vision layers via MLX).

💻 Intended Use Cases

Local AI Agents: Powering privacy-first, on-device assistants on iOS, iPadOS, and macOS.
System Orchestration: Translating natural language and visual inputs into structured JSON payloads for tool execution.
Complex Logic Tasks: Solving dynamic UI challenges, mathematical deductions, and multi-variable logic puzzles on the fly.

🌍 Multilingual Capabilities

Inheriting the massive linguistic foundation of its base architecture, this model is fluent in over 100+ languages. Whether processing inputs or generating complex JSON structures, it maintains high logical fidelity across English, German, French, Spanish, Italian, Dutch, Mandarin, Japanese, Korean, and many more.

📚 Training Data & Mix

To achieve the perfect balance between strict syntax discipline and dynamic logic, we curated a massive, multi-tiered dataset:

The Prevolut Custom MCP Dataset: The core of this model. We trained it on highly specific UI-to-action mappings, teaching the model to output flawless <mcp-request> JSON tags based on user intents and visual inputs.
openai/gsm8k: Included to maintain and sharpen the model's Socratic reasoning and mathematical deduction capabilities.
Code Instruction Data (CodeAlpaca): Added to preserve deep structural thinking and coding logic, ensuring the model understands complex backend tasks.

⚡️ Inference & Prompt Format

This model strictly follows the standard Gemma IT prompt template. To utilize its vision capabilities and MCP formatting, ensure your inputs are structured correctly.

To leverage the model's structural discipline for tool calls, we recommend enforcing rules in your system prompts (e.g., "You are a local system agent. If you need to use a tool, output ONLY a valid JSON block. Do not add any conversational text before or after the JSON.").

<start_of_turn>user
Analyze this UI screenshot and format the action as a valid Chapper MCP request.<end_of_turn>
<start_of_turn>assistant

🛠️ Usage

Designed for edge inference, this model shines on Apple Silicon (macOS/iOS) and within fast local environments.

📱 Natively on iOS via Apple MLX

We highly recommend running this via Apple's mlx-swift / mlx-vlm libraries for direct Neural Engine & GPU acceleration on iPhones and Macs:

import MLX
import MLXVLM

let modelPath = Bundle.main.url(forResource: "chapper-ios-model", withExtension: nil)!
let vlmModel = try await VLMModelFactory.shared.load(configuration: ModelConfiguration(directory: modelPath))

let prompt = "<start_of_turn>user\nExtract the requested action from this image.<end_of_turn>\n<start_of_turn>assistant\n"
let input = UserInput(images: [.init(userImage)], prompt: prompt)

let result = try await vlmModel.generate(input: input)
print(result.text) // Outputs perfect <mcp-request> syntax!

🖥️ In LM Studio / llama.cpp

For .gguf variants, the model can be natively loaded into LM Studio. Crucial: To enable vision capabilities, you must load the accompanying -mmproj.gguf Vision Adapter in the hardware settings alongside the main model.

⚖️ License

This model is released under the Apache 2.0 License, inheriting the open and permissive nature of its base architecture.

Developed with a focus on local AI efficiency by Prevolut Ltd