Instructions to use WithinUsAI/Llama-Coyote.Coder-4B.gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use WithinUsAI/Llama-Coyote.Coder-4B.gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="WithinUsAI/Llama-Coyote.Coder-4B.gguf", filename="Llama-Coyote.Coder-4B-Q4_K_M.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use WithinUsAI/Llama-Coyote.Coder-4B.gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M
Use Docker
docker model run hf.co/WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use WithinUsAI/Llama-Coyote.Coder-4B.gguf with Ollama:
ollama run hf.co/WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M
- Unsloth Studio new
How to use WithinUsAI/Llama-Coyote.Coder-4B.gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WithinUsAI/Llama-Coyote.Coder-4B.gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for WithinUsAI/Llama-Coyote.Coder-4B.gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for WithinUsAI/Llama-Coyote.Coder-4B.gguf to start chatting
- Docker Model Runner
How to use WithinUsAI/Llama-Coyote.Coder-4B.gguf with Docker Model Runner:
docker model run hf.co/WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M
- Lemonade
How to use WithinUsAI/Llama-Coyote.Coder-4B.gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull WithinUsAI/Llama-Coyote.Coder-4B.gguf:Q4_K_M
Run and chat with the model
lemonade run user.Llama-Coyote.Coder-4B.gguf-Q4_K_M
List all available models
lemonade list
| datasets: | |
| - bigcode/the-stack | |
| - bigcode/the-stack-v2 | |
| - bigcode/starcoderdata | |
| - bigcode/commitpack | |
| Llama-Coyote.Coder-4B (GGUF) | |
| 📌 Model Overview | |
| Model Name: WithinUsAI/Llama-Coyote.Coder-4B.gguf | |
| Organization: Within Us AI | |
| Model Type: Code LLM (Instruction-Tuned, Agentic-Oriented) | |
| Parameter Size: 4B | |
| Format: GGUF (quantized for local inference) | |
| Primary Focus: Efficient coding + reasoning for local deployment | |
| This model is part of the Within Us AI ecosystem of compact, high-performance coding models, designed to run locally while still delivering structured reasoning and practical software engineering output.  | |
| ⸻ | |
| 🧬 Architecture & Lineage | |
| * Base Family: LLaMA-derived architecture (inferred from naming and ecosystem patterns) | |
| * Model Class: Dense transformer (~4B parameters) | |
| * Optimization Strategy: | |
| * Instruction tuning for coding tasks | |
| * Reasoning-aware outputs | |
| * GGUF quantization for edge deployment | |
| Ecosystem Position | |
| This model sits alongside: | |
| * Other 4B coding models | |
| * Agentic coders | |
| * Reasoning-distilled systems | |
| WithinUsAI focuses on agentic AI, tool use, and evaluation-driven training pipelines.  | |
| ⸻ | |
| 🧠 Core Design Philosophy | |
| Think of this model like a desert-hardened code hunter 🐺💻 | |
| Lean, efficient, and tuned to track down solutions without wasting compute. | |
| Design Goals: | |
| * Maximize coding performance per parameter | |
| * Encourage structured, step-by-step reasoning | |
| * Enable local-first AI development | |
| * Support agent-style workflows | |
| ⸻ | |
| ⚙️ Key Capabilities | |
| 💻 Coding | |
| * Multi-language support (Python, JS, C++, etc.) | |
| * Function generation and refactoring | |
| * Debugging assistance | |
| * Algorithm design | |
| 🤖 Agentic Behavior | |
| * Task decomposition | |
| * Instruction-following | |
| * Compatible with tool-calling frameworks | |
| 🧠 Reasoning | |
| * Step-by-step logic chains | |
| * Problem breakdown | |
| * Lightweight analytical reasoning | |
| ⸻ | |
| 📦 GGUF Format & Deployment | |
| Optimized for local inference environments: | |
| Supported Runtimes: | |
| * llama.cpp | |
| * LM Studio | |
| * Ollama (GGUF-compatible builds) | |
| Typical Quantization Options (4B): | |
| Quant RAM Needed Notes | |
| Q4_K_M ~3–4 GB Best balance | |
| Q5_K_M ~4–5 GB Higher quality | |
| Q8_0 ~6–8 GB Maximum fidelity | |
| ⸻ | |
| 🚀 Intended Use | |
| ✅ Ideal Use Cases | |
| * Local coding assistants | |
| * AI-powered IDE integrations | |
| * Autonomous coding agents | |
| * Script generation & debugging | |
| * Offline development workflows | |
| ⚠️ Limitations | |
| * Smaller parameter size limits deep reasoning vs larger models | |
| * Performance depends on prompt clarity | |
| * Tool use requires external orchestration | |
| ⸻ | |
| 🛠️ Usage Example (llama.cpp) | |
| ./main -m Llama-Coyote.Coder-4B.Q4_K_M.gguf \ | |
| -p "Write a Python script that monitors file changes and logs them." \ | |
| -n 512 | |
| ⸻ | |
| 🧪 Training & Methodology | |
| Within Us AI training approach includes: | |
| * Code-focused instruction tuning | |
| * Reasoning trace exposure | |
| * Evaluation-driven dataset design | |
| * Agentic workflow alignment | |
| Data Sources | |
| * Proprietary datasets created by Within Us AI | |
| * Third-party datasets used without ownership claims | |
| * Focus on: | |
| * Code reasoning | |
| * Debugging patterns | |
| * Structured outputs | |
| ⸻ | |
| 📊 Expected Performance Profile | |
| Capability Strength | |
| Coding High | |
| Efficiency Very High | |
| Reasoning depth Moderate | |
| General knowledge Moderate | |
| Agent readiness High | |
| ⸻ | |
| 📜 License | |
| License Type: Custom / Other (Within Us AI License Approach)** | |
| Terms: | |
| * Base architecture derived from third-party LLM ecosystems (e.g., LLaMA family) | |
| * Within Us AI developed: | |
| * Fine-tuning process | |
| * Model merging techniques | |
| * Training methodology | |
| * Third-party datasets may be used without ownership claims | |
| * Credit belongs to original creators | |
| ⸻ | |
| 🙏 Acknowledgements | |
| * Meta (LLaMA architecture inspiration) | |
| * Open-source GGUF / llama.cpp ecosystem | |
| * Hugging Face community | |
| * Dataset creators and contributors | |
| ⸻ | |
| 🔗 Links | |
| * Model: https://huggingface.co/WithinUsAI/Llama-Coyote.Coder-4B.gguf | |
| * Organization: https://huggingface.co/WithinUsAI | |
| ⸻ | |
| 🧩 Closing Note | |
| This one feels like a quiet operator in the sand 🏜️ | |
| Not loud. Not oversized. | |
| Just tracks the problem… and delivers code that works. | |