Instructions to use OrbitAIEU/Apex-1-flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OrbitAIEU/Apex-1-flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OrbitAIEU/Apex-1-flash") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("OrbitAIEU/Apex-1-flash") model = AutoModelForCausalLM.from_pretrained("OrbitAIEU/Apex-1-flash") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use OrbitAIEU/Apex-1-flash with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="OrbitAIEU/Apex-1-flash", filename="apex-1-flash-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use OrbitAIEU/Apex-1-flash with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf OrbitAIEU/Apex-1-flash:Q4_K_M # Run inference directly in the terminal: llama cli -hf OrbitAIEU/Apex-1-flash:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf OrbitAIEU/Apex-1-flash:Q4_K_M # Run inference directly in the terminal: llama cli -hf OrbitAIEU/Apex-1-flash:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf OrbitAIEU/Apex-1-flash:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf OrbitAIEU/Apex-1-flash:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf OrbitAIEU/Apex-1-flash:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf OrbitAIEU/Apex-1-flash:Q4_K_M
Use Docker
docker model run hf.co/OrbitAIEU/Apex-1-flash:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use OrbitAIEU/Apex-1-flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OrbitAIEU/Apex-1-flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OrbitAIEU/Apex-1-flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/OrbitAIEU/Apex-1-flash:Q4_K_M
- SGLang
How to use OrbitAIEU/Apex-1-flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OrbitAIEU/Apex-1-flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OrbitAIEU/Apex-1-flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OrbitAIEU/Apex-1-flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OrbitAIEU/Apex-1-flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use OrbitAIEU/Apex-1-flash with Ollama:
ollama run hf.co/OrbitAIEU/Apex-1-flash:Q4_K_M
- Unsloth Studio
How to use OrbitAIEU/Apex-1-flash with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for OrbitAIEU/Apex-1-flash to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for OrbitAIEU/Apex-1-flash to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for OrbitAIEU/Apex-1-flash to start chatting
- Pi
How to use OrbitAIEU/Apex-1-flash with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf OrbitAIEU/Apex-1-flash:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "OrbitAIEU/Apex-1-flash:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use OrbitAIEU/Apex-1-flash with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf OrbitAIEU/Apex-1-flash:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default OrbitAIEU/Apex-1-flash:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use OrbitAIEU/Apex-1-flash with Docker Model Runner:
docker model run hf.co/OrbitAIEU/Apex-1-flash:Q4_K_M
- Lemonade
How to use OrbitAIEU/Apex-1-flash with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull OrbitAIEU/Apex-1-flash:Q4_K_M
Run and chat with the model
lemonade run user.Apex-1-flash-Q4_K_M
List all available models
lemonade list
⚡ Apex-1-flash
Fast. Sharp. Thinks Before It Speaks.
A chain-of-thought reasoning model by OrbitAI
Built by a 13-year-old developer from Slovakia — because curiosity has no age limit.
🔍 Overview
Apex-1-flash is a supervised fine-tune of Qwen/qwen3-4b-thinking-2507, purpose-built to deliver sharp, structured reasoning with efficient chain-of-thought capabilities at the 4B parameter scale.
Trained on the Open-CoT-Reasoning-Mini dataset, apex-1-flash is designed to think through problems step by step — making it well-suited for logical reasoning, multi-step problem solving, and coherent explanations — while staying lean enough to run on consumer hardware.
This model was created by Matias Mikle (age 13, Slovakia 🇸🇰) alongside the OrbitAI team.
📋 Model Details
| Property | Value |
|---|---|
| Model Name | Apex-1-flash |
| Developer | Matias Mikle / OrbitAI |
| Base Model | Qwen/qwen3-4b-thinking-2507 |
| Architecture | Transformer — Causal Language Model (Decoder-Only) |
| Parameters | ~4.02 Billion |
| Fine-tuning Type | Supervised Fine-Tuning (SFT) |
| Dataset | Raymond-dev-546730/Open-CoT-Reasoning-Mini |
| Language | English (primary) |
| License | Apache 2.0 |
🧠 What Makes apex-1-flash Different
The name says it all — Apex for reaching the top, flash for speed and precision.
The flash philosophy shapes how the model was built:
- ⚡ Fast — At only ~4B parameters, it's lightweight enough to run on a single consumer GPU without sacrificing reasoning depth
- 🎯 Sharp — Fine-tuned specifically on structured chain-of-thought data, it breaks down problems cleanly before producing answers
- 💡 Thoughtful — Inherits the built-in thinking architecture from Qwen3, extended through CoT fine-tuning for more reliable step-by-step logic
Best suited for
- Logical and mathematical reasoning
- Step-by-step problem decomposition
- Structured explanation generation
- Research and educational tasks
- Multi-step Q&A
🚀 Quickstart
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "OrbitAIEU/apex-1-flash"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "user",
"content": "Explain step by step how to solve: 3x + 7 = 22"
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(
outputs[0][inputs.input_ids.shape[-1]:],
skip_special_tokens=True
)
print(response)
💾 Hardware Requirements
| Precision | Min. VRAM | Recommended For |
|---|---|---|
| Full precision (fp32) | ~16 GB | Not recommended |
| Half precision (bf16/fp16) | ~8 GB | RTX 3070 / RTX 4060 Ti and above |
| 4-bit quantized (GGUF/GPTQ) | ~3–4 GB | RTX 3060 / consumer-grade GPUs |
apex-1-flash is intentionally built at the 4B scale so it can run on everyday hardware — no enterprise cluster required.
🏋️ Training
The model was fine-tuned using Supervised Fine-Tuning (SFT) on top of the Qwen3-4B thinking checkpoint.
| Property | Value |
|---|---|
| Method | Supervised Fine-Tuning (SFT) |
| Base Model | Qwen/qwen3-4b-thinking-2507 |
| Dataset | Raymond-dev-546730/Open-CoT-Reasoning-Mini |
The Open-CoT-Reasoning-Mini dataset provides carefully structured reasoning traces and chain-of-thought examples, enabling the model to build stronger habits around multi-step logical inference.
⚠️ Limitations
- No safety alignment — Apex-1-flash has not undergone RLHF or safety tuning. It is not recommended for production use without additional safety layers.
- Domain scope — Performance is optimized for reasoning-heavy tasks; general-purpose capabilities are inherited from the base model.
- Inherited biases — The model may carry biases and limitations present in the Qwen3-4B base model.
- Benchmarks pending — Formal benchmark evaluations are currently in progress and will be published in a future update.
👤 About the Creator
"You don't need a Phd to train an AI model, you just need intelligence and GPU ofc."
🛰️ About OrbitAI
OrbitAI is an independent AI development team focused on building open, efficient, and accessible language models.
The team believes that AI research should not be limited to large corporations and well-funded labs. By working in the open — releasing models, sharing experiments, and collaborating with the community — OrbitAI aims to make frontier-style AI work accessible to anyone willing to put in the effort.
apex-1-flash is OrbitAI's first public model release.
📄 License
This model is released under the Apache License 2.0, in accordance with the license of the base model Qwen/qwen3-4b-thinking-2507.
| Permission | Allowed |
|---|---|
| Commercial use | ✅ Yes |
| Modification & distribution | ✅ Yes |
| Further fine-tuning | ✅ Yes |
| Research & academic use | ✅ Yes |
See the full Apache 2.0 License for complete terms.
🙏 Acknowledgements
- Qwen Team @ Alibaba Cloud — for releasing the powerful Qwen3 model family under an open license
- Raymond-dev-546730 — for creating and sharing the Open-CoT-Reasoning-Mini dataset
- The open-source AI community — for making all of this possible
Apex-1-flash · Made with ❤️ by Matias Mikle & OrbitAI · Slovakia 🇸🇰
If this project inspired you — download it, fork it, and build something even better.
- Downloads last month
- -
4-bit
Model tree for OrbitAIEU/Apex-1-flash
Base model
Qwen/Qwen3-4B-Thinking-2507