LocoreMind
/

LocoTrainer-4B-GGUF

Text Generation

Model card Files Files and versions

LocoTrainer-4B-GGUF / README.md

FutureMa's picture

Update README.md

dc5e3d9 verified 25 days ago

|

history blame contribute delete

3.05 kB

	---
	library_name: transformers
	license: mit
	base_model: LocoreMind/LocoTrainer-4B
	tags:
	- code
	- agent
	- tool-calling
	- distillation
	- qwen3
	- ms-swift
	- gguf
	- quantization
	language:
	- en
	pipeline_tag: text-generation
	---

	# LocoTrainer-4B GGUF

	GGUF quantized version of LocoTrainer-4B model for local inference.

	## Model Information

	- Base Model: [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
	- Distilled from: Qwen3-Coder-Next
	- Training Method: Knowledge Distillation (SFT)
	- Training Data: 361,830 samples
	- Max Context: 32,768 tokens
	- Framework: MS-SWIFT

	## Available Versions

	\| Version \| Size \| Speed \| Quality \| Recommended For \|
	\|---------\|------\|-------\|---------\|-----------------\|
	\| F16 \| 8.3GB \| Fast \| Highest \| Baseline/Reference \|
	\| Q8_0 \| 4.4GB \| Fast \| Very High \| High-quality inference \|
	\| Q5_K_M \| 3.0GB \| Medium \| High \| Balanced approach \|
	\| Q4_K_M \| 2.6GB \| Fast \| Medium \| Recommended \|
	\| Q3_K_M \| 2.1GB \| Very Fast \| Medium \| Resource-constrained \|

	## Quick Start

	### Using llama.cpp

	```bash
	# Download model
	wget https://huggingface.co/LocoreMind/LocoTrainer-4B-GGUF/resolve/main/LocoTrainer-4B-Q4_K_M.gguf

	# Start server
	./llama-server -m LocoTrainer-4B-Q4_K_M.gguf --port 8080 --ctx-size 32768
	```

	### Using LocoTrainer Framework

	```bash
	# Configure .env
	export LOCOTRAINER_BASE_URL=http://localhost:8080/v1
	export LOCOTRAINER_MODEL=LocoTrainer-4B

	# Run
	locotrainer run -q "What are the default LoRA settings in ms-swift?"
	```

	### Using llama-cpp-python

	```python
	from llama_cpp import Llama

	llm = Llama(
	model_path="LocoTrainer-4B-Q4_K_M.gguf",
	n_gpu_layers=99,
	n_ctx=32768,
	)

	response = llm(
	"What is MS-SWIFT?",
	max_tokens=512,
	)
	print(response["choices"][0]["text"])
	```

	## Performance Metrics

	Tested on NVIDIA H100:

	- First Token Latency: ~200-300ms
	- Subsequent Token Speed: 50-100 tokens/sec
	- Memory Usage (Q4_K_M): ~10-12GB

	## Features

	- 🎯 MS-SWIFT Domain Expert: Trained on MS-SWIFT documentation and codebase
	- 🔧 Tool Calling: Supports Read, Grep, Glob, Bash, Write tools
	- 📊 End-to-End Reports: From question to complete markdown analysis report
	- 🏠 Local Deployment: Fully offline, zero API cost
	- 📏 Long Context: 32K tokens support

	## Use Cases

	- Codebase analysis and documentation generation
	- MS-SWIFT framework Q&A
	- Local AI agent deployment
	- Offline inference applications

	## License

	MIT

	## Acknowledgments

	- [Qwen Team](https://huggingface.co/Qwen) - Base model
	- [MS-SWIFT](https://github.com/modelscope/ms-swift) - Training framework
	- [llama.cpp](https://github.com/ggml-org/llama.cpp) - GGUF quantization and inference
	- [Anthropic](https://www.anthropic.com/) - Claude Code design inspiration

	## Related Resources

	- [Original Model](https://huggingface.co/LocoreMind/LocoTrainer-4B)
	- [LocoTrainer Framework](https://github.com/LocoreMind/LocoTrainer)
	- [llama.cpp Documentation](https://github.com/ggml-org/llama.cpp)