| --- |
| library_name: transformers |
| license: mit |
| base_model: LocoreMind/LocoTrainer-4B |
| tags: |
| - code |
| - agent |
| - tool-calling |
| - distillation |
| - qwen3 |
| - ms-swift |
| - gguf |
| - quantization |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # LocoTrainer-4B GGUF |
|
|
| GGUF quantized version of LocoTrainer-4B model for local inference. |
|
|
| ## Model Information |
|
|
| - **Base Model**: [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) |
| - **Distilled from**: Qwen3-Coder-Next |
| - **Training Method**: Knowledge Distillation (SFT) |
| - **Training Data**: 361,830 samples |
| - **Max Context**: 32,768 tokens |
| - **Framework**: MS-SWIFT |
|
|
| ## Available Versions |
|
|
| | Version | Size | Speed | Quality | Recommended For | |
| |---------|------|-------|---------|-----------------| |
| | F16 | 8.3GB | Fast | Highest | Baseline/Reference | |
| | Q8_0 | 4.4GB | Fast | Very High | High-quality inference | |
| | Q5_K_M | 3.0GB | Medium | High | Balanced approach | |
| | Q4_K_M | 2.6GB | Fast | Medium | **Recommended** | |
| | Q3_K_M | 2.1GB | Very Fast | Medium | Resource-constrained | |
| |
| ## Quick Start |
| |
| ### Using llama.cpp |
| |
| ```bash |
| # Download model |
| wget https://huggingface.co/LocoreMind/LocoTrainer-4B-GGUF/resolve/main/LocoTrainer-4B-Q4_K_M.gguf |
| |
| # Start server |
| ./llama-server -m LocoTrainer-4B-Q4_K_M.gguf --port 8080 --ctx-size 32768 |
| ``` |
| |
| ### Using LocoTrainer Framework |
| |
| ```bash |
| # Configure .env |
| export LOCOTRAINER_BASE_URL=http://localhost:8080/v1 |
| export LOCOTRAINER_MODEL=LocoTrainer-4B |
|
|
| # Run |
| locotrainer run -q "What are the default LoRA settings in ms-swift?" |
| ``` |
| |
| ### Using llama-cpp-python |
| |
| ```python |
| from llama_cpp import Llama |
| |
| llm = Llama( |
| model_path="LocoTrainer-4B-Q4_K_M.gguf", |
| n_gpu_layers=99, |
| n_ctx=32768, |
| ) |
| |
| response = llm( |
| "What is MS-SWIFT?", |
| max_tokens=512, |
| ) |
| print(response["choices"][0]["text"]) |
| ``` |
| |
| ## Performance Metrics |
|
|
| Tested on NVIDIA H100: |
|
|
| - **First Token Latency**: ~200-300ms |
| - **Subsequent Token Speed**: 50-100 tokens/sec |
| - **Memory Usage** (Q4_K_M): ~10-12GB |
|
|
| ## Features |
|
|
| - π― **MS-SWIFT Domain Expert**: Trained on MS-SWIFT documentation and codebase |
| - π§ **Tool Calling**: Supports Read, Grep, Glob, Bash, Write tools |
| - π **End-to-End Reports**: From question to complete markdown analysis report |
| - π **Local Deployment**: Fully offline, zero API cost |
| - π **Long Context**: 32K tokens support |
|
|
| ## Use Cases |
|
|
| - Codebase analysis and documentation generation |
| - MS-SWIFT framework Q&A |
| - Local AI agent deployment |
| - Offline inference applications |
|
|
| ## License |
|
|
| MIT |
|
|
| ## Acknowledgments |
|
|
| - [Qwen Team](https://huggingface.co/Qwen) - Base model |
| - [MS-SWIFT](https://github.com/modelscope/ms-swift) - Training framework |
| - [llama.cpp](https://github.com/ggml-org/llama.cpp) - GGUF quantization and inference |
| - [Anthropic](https://www.anthropic.com/) - Claude Code design inspiration |
|
|
| ## Related Resources |
|
|
| - [Original Model](https://huggingface.co/LocoreMind/LocoTrainer-4B) |
| - [LocoTrainer Framework](https://github.com/LocoreMind/LocoTrainer) |
| - [llama.cpp Documentation](https://github.com/ggml-org/llama.cpp) |
|
|