Instructions to use OrbitAIEU/Apex-1-flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OrbitAIEU/Apex-1-flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OrbitAIEU/Apex-1-flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("OrbitAIEU/Apex-1-flash")
model = AutoModelForCausalLM.from_pretrained("OrbitAIEU/Apex-1-flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use OrbitAIEU/Apex-1-flash with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="OrbitAIEU/Apex-1-flash",
	filename="apex-1-flash-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use OrbitAIEU/Apex-1-flash with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf OrbitAIEU/Apex-1-flash:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf OrbitAIEU/Apex-1-flash:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf OrbitAIEU/Apex-1-flash:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf OrbitAIEU/Apex-1-flash:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf OrbitAIEU/Apex-1-flash:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf OrbitAIEU/Apex-1-flash:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf OrbitAIEU/Apex-1-flash:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf OrbitAIEU/Apex-1-flash:Q4_K_M

Use Docker

docker model run hf.co/OrbitAIEU/Apex-1-flash:Q4_K_M

LM Studio
Jan

vLLM

How to use OrbitAIEU/Apex-1-flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OrbitAIEU/Apex-1-flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OrbitAIEU/Apex-1-flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/OrbitAIEU/Apex-1-flash:Q4_K_M

SGLang

How to use OrbitAIEU/Apex-1-flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OrbitAIEU/Apex-1-flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OrbitAIEU/Apex-1-flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OrbitAIEU/Apex-1-flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OrbitAIEU/Apex-1-flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use OrbitAIEU/Apex-1-flash with Ollama:
```
ollama run hf.co/OrbitAIEU/Apex-1-flash:Q4_K_M
```

Unsloth Studio

How to use OrbitAIEU/Apex-1-flash with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for OrbitAIEU/Apex-1-flash to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for OrbitAIEU/Apex-1-flash to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for OrbitAIEU/Apex-1-flash to start chatting

How to use OrbitAIEU/Apex-1-flash with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf OrbitAIEU/Apex-1-flash:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "OrbitAIEU/Apex-1-flash:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use OrbitAIEU/Apex-1-flash with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf OrbitAIEU/Apex-1-flash:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default OrbitAIEU/Apex-1-flash:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use OrbitAIEU/Apex-1-flash with Docker Model Runner:
```
docker model run hf.co/OrbitAIEU/Apex-1-flash:Q4_K_M
```

Lemonade

How to use OrbitAIEU/Apex-1-flash with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull OrbitAIEU/Apex-1-flash:Q4_K_M

Run and chat with the model

lemonade run user.Apex-1-flash-Q4_K_M

List all available models

lemonade list

⚡ Apex-1-flash

Fast. Sharp. Thinks Before It Speaks.

A chain-of-thought reasoning model by OrbitAI

Built by a 13-year-old developer from Slovakia — because curiosity has no age limit.

🔍 Overview

Apex-1-flash is a supervised fine-tune of Qwen/qwen3-4b-thinking-2507, purpose-built to deliver sharp, structured reasoning with efficient chain-of-thought capabilities at the 4B parameter scale.

Trained on the Open-CoT-Reasoning-Mini dataset, apex-1-flash is designed to think through problems step by step — making it well-suited for logical reasoning, multi-step problem solving, and coherent explanations — while staying lean enough to run on consumer hardware.

This model was created by Matias Mikle (age 13, Slovakia 🇸🇰) alongside the OrbitAI team.

📋 Model Details

Property	Value
Model Name	Apex-1-flash
Developer	Matias Mikle / OrbitAI
Base Model	Qwen/qwen3-4b-thinking-2507
Architecture	Transformer — Causal Language Model (Decoder-Only)
Parameters	~4.02 Billion
Fine-tuning Type	Supervised Fine-Tuning (SFT)
Dataset	Raymond-dev-546730/Open-CoT-Reasoning-Mini
Language	English (primary)
License	Apache 2.0

🧠 What Makes apex-1-flash Different

The name says it all — Apex for reaching the top, flash for speed and precision.

The flash philosophy shapes how the model was built:

⚡ Fast — At only ~4B parameters, it's lightweight enough to run on a single consumer GPU without sacrificing reasoning depth
🎯 Sharp — Fine-tuned specifically on structured chain-of-thought data, it breaks down problems cleanly before producing answers
💡 Thoughtful — Inherits the built-in thinking architecture from Qwen3, extended through CoT fine-tuning for more reliable step-by-step logic

Best suited for

Logical and mathematical reasoning
Step-by-step problem decomposition
Structured explanation generation
Research and educational tasks
Multi-step Q&A

🚀 Quickstart

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "OrbitAIEU/apex-1-flash"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Explain step by step how to solve: 3x + 7 = 22"
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True
    )

response = tokenizer.decode(
    outputs[0][inputs.input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(response)

💾 Hardware Requirements

Precision	Min. VRAM	Recommended For
Full precision (fp32)	~16 GB	Not recommended
Half precision (bf16/fp16)	~8 GB	RTX 3070 / RTX 4060 Ti and above
4-bit quantized (GGUF/GPTQ)	~3–4 GB	RTX 3060 / consumer-grade GPUs

apex-1-flash is intentionally built at the 4B scale so it can run on everyday hardware — no enterprise cluster required.

🏋️ Training

The model was fine-tuned using Supervised Fine-Tuning (SFT) on top of the Qwen3-4B thinking checkpoint.

Property	Value
Method	Supervised Fine-Tuning (SFT)
Base Model	Qwen/qwen3-4b-thinking-2507
Dataset	Raymond-dev-546730/Open-CoT-Reasoning-Mini

The Open-CoT-Reasoning-Mini dataset provides carefully structured reasoning traces and chain-of-thought examples, enabling the model to build stronger habits around multi-step logical inference.

⚠️ Limitations

No safety alignment — Apex-1-flash has not undergone RLHF or safety tuning. It is not recommended for production use without additional safety layers.
Domain scope — Performance is optimized for reasoning-heavy tasks; general-purpose capabilities are inherited from the base model.
Inherited biases — The model may carry biases and limitations present in the Qwen3-4B base model.
Benchmarks pending — Formal benchmark evaluations are currently in progress and will be published in a future update.

👤 About the Creator

Matias Mikle

Age: 13 · Country: Slovakia 🇸🇰

Independent developer, AI researcher, and founder of OrbitAI. Matias started building AI projects from scratch, exploring fine-tuning, language model architecture, and full-stack development — proving that great work can come from anywhere, at any age.

"You don't need a Phd to train an AI model, you just need intelligence and GPU ofc."

🛰️ About OrbitAI

OrbitAI is an independent AI development team focused on building open, efficient, and accessible language models.

The team believes that AI research should not be limited to large corporations and well-funded labs. By working in the open — releasing models, sharing experiments, and collaborating with the community — OrbitAI aims to make frontier-style AI work accessible to anyone willing to put in the effort.

apex-1-flash is OrbitAI's first public model release.

📄 License

This model is released under the Apache License 2.0, in accordance with the license of the base model Qwen/qwen3-4b-thinking-2507.

Permission	Allowed
Commercial use	✅ Yes
Modification & distribution	✅ Yes
Further fine-tuning	✅ Yes
Research & academic use	✅ Yes

See the full Apache 2.0 License for complete terms.

🙏 Acknowledgements

Qwen Team @ Alibaba Cloud — for releasing the powerful Qwen3 model family under an open license
Raymond-dev-546730 — for creating and sharing the Open-CoT-Reasoning-Mini dataset
The open-source AI community — for making all of this possible

Apex-1-flash · Made with ❤️ by Matias Mikle & OrbitAI · Slovakia 🇸🇰

If this project inspired you — download it, fork it, and build something even better.

Downloads last month: -

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

Model tree for OrbitAIEU/Apex-1-flash

Base model

Qwen/Qwen3-4B-Thinking-2507

Quantized

(106)

this model

OrbitAIEU
/

Apex-1-flash