Instructions to use WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF", filename="Qwen3-Desert.Coder.MoE-8X0.6B.Q4_K_M.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF with Ollama:
ollama run hf.co/WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M
- Unsloth Studio new
How to use WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF to start chatting
- Docker Model Runner
How to use WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF with Docker Model Runner:
docker model run hf.co/WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M
- Lemonade
How to use WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3-Desert.Coder.MoE-8X0.6B-GGUF-Q4_K_M
List all available models
lemonade list
output = llm(
"Once upon a time,",
max_tokens=512,
echo=True
)
print(output)Qwen3-Desert.Coder.MoE-8X0.6B
📌 Model Overview
Model Name: WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B Organization: Within Us AI Model Type: Mixture-of-Experts (MoE) Code LLM Architecture: Qwen 3 (MoE) Expert Configuration: 8 × 0.6B experts Active Parameters (per token): ~0.6B–1.2B (estimated routing) Total Parameters: ~2B–4B class (sparse MoE structure) Primary Focus: Efficient agentic coding + sparse reasoning
This model is a Mixture-of-Experts coding system, designed to deliver high capability at low compute cost by activating only a subset of its network per token.
It’s part of the Within Us AI push toward:
“Sparse intelligence: bigger thinking, smaller runtime.”
The model appears in the WithinUsAI lineup as a MoE-based coding variant alongside dense and nano models. 
⸻
🧬 Architecture & Lineage
Base Foundation
- Built on Qwen 3 architecture, a strong open LLM family known for multilingual understanding and coding capability
- Qwen models are widely used for efficient, high-performance reasoning and coding systems 
MoE Design (8×0.6B)
This model uses a Mixture-of-Experts (MoE) structure:
- 8 specialized expert subnetworks (~0.6B each)
- A router dynamically selects which experts activate per token
- Only a subset runs → reducing compute cost
Why MoE Matters
Instead of one monolithic brain 🧠 this model is more like a team of specialists:
- One expert for syntax
- One for logic
- One for debugging
- One for reasoning patterns
Only the needed “experts” wake up per task.
⸻
🧠 Core Design Philosophy
Don’t make one model smarter… make many small ones collaborate.
Design Goals:
- High coding performance per FLOP
- Sparse activation for efficiency
- Agent-compatible reasoning
- Local + scalable deployment
⸻
⚙️ Key Capabilities
💻 Coding
- Multi-language support (Python, JS, C++, etc.)
- Function generation and debugging
- Algorithm reasoning
🤖 Agentic Behavior
- Task decomposition
- Tool-use compatibility
- Structured outputs (JSON, steps)
🧠 Sparse Reasoning
- Expert specialization improves efficiency
- Handles diverse coding tasks with targeted computation
⸻
📦 Deployment Characteristics
Runtime Behavior
- Activates only part of the network → lower compute cost
- Faster inference than dense models of similar total size
- Scales well across CPU and GPU environments
Supported Environments
- Hugging Face Transformers
- vLLM (if MoE supported)
- Custom inference pipelines
- GGUF possible if converted
⸻
🚀 Intended Use
✅ Ideal Use Cases
- Coding agents (multi-step workflows)
- Efficient local deployments
- Multi-agent systems (many small models)
- Research into MoE architectures
- Cost-sensitive AI systems
⚠️ Limitations
- MoE routing can be unstable in edge cases
- Requires proper inference support (not all runtimes handle MoE well)
- Smaller active parameter size limits deep reasoning vs large dense models
⸻
🧪 Training & Methodology
Within Us AI pipeline includes:
- Code-focused instruction tuning
- Agentic workflow datasets
- Reasoning trace integration
- Evaluation-driven refinement
Data Sources
- Proprietary Within Us AI datasets
- Third-party datasets (no ownership claimed)
- Focus on:
- Coding tasks
- Debugging workflows
- Structured reasoning
⸻
📊 Expected Performance Profile
Capability Strength Coding High Efficiency Very High Reasoning depth Moderate Scalability High Agent readiness High
⸻
📜 License
License Type: Inherits from Qwen / base model ecosystem
Attribution Notes:
- Base architecture: Qwen (Alibaba ecosystem)
- MoE + training methodology: Within Us AI
- Third-party datasets used without ownership claims
- Credit belongs to original creators
⸻
🙏 Acknowledgements
- Alibaba Qwen team
- Open-source MoE research community
- Hugging Face ecosystem
- Dataset contributors
⸻
🔗 Links
- Model: https://huggingface.co/WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B
- Organization: https://huggingface.co/WithinUsAI
⸻
🧩 Closing Note
This model feels like a desert outpost of specialists 🏜️
Quiet. Efficient. Each expert waiting…
…and when the problem arrives, only the right minds step forward.
- Downloads last month
- 584
4-bit
5-bit
6-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="WithinUsAI/Qwen3-Desert.Coder.MoE-8X0.6B-GGUF", filename="", )