Instructions to use Aquiles-ai/Athenea-4B-Coding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Aquiles-ai/Athenea-4B-Coding with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Aquiles-ai/Athenea-4B-Coding")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Coding")
model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Coding")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Aquiles-ai/Athenea-4B-Coding with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Aquiles-ai/Athenea-4B-Coding"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Aquiles-ai/Athenea-4B-Coding",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Aquiles-ai/Athenea-4B-Coding

SGLang

How to use Aquiles-ai/Athenea-4B-Coding with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Aquiles-ai/Athenea-4B-Coding" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Aquiles-ai/Athenea-4B-Coding",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Aquiles-ai/Athenea-4B-Coding" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Aquiles-ai/Athenea-4B-Coding",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Aquiles-ai/Athenea-4B-Coding with Docker Model Runner:
```
docker model run hf.co/Aquiles-ai/Athenea-4B-Coding
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Athenea-4B-Coding

Athenea-4B-Coding is a fine-tuned version of huihui-ai/Huihui-Qwen3-4B-Thinking-2507-abliterated, specialized in code reasoning, debugging, and problem solving. Trained on high-quality programming data with explicit reasoning traces using <think> and </think> tags, the model is designed to perform detailed step-by-step reasoning for software development, algorithm design, and code comprehension tasks.

⚠️ Important Note: This model uses an abliterated (uncensored) base version, providing full expressive freedom and unrestricted output generation. Users are fully responsible for any use or content produced by the model. It is intended exclusively for research and experimentation purposes.

🎯 Model Description

Athenea-4B-Coding extends Huihui-Qwen3’s structured reasoning capabilities into programming-related domains, showing strong performance on logical problem-solving, code completion, and debugging scenarios.

Key features:

Step-by-step code reasoning within <think> blocks
Specialization in algorithmic and debugging tasks
Uncensored output generation for full reasoning visibility
Improved logical consistency through focused fine-tuning
Compatible with open inference frameworks (Transformers, vLLM, etc.)

The model was fine-tuned using the dataset Aquiles-ai/Athenea-Coding-100k, which includes diverse programming challenges, structured reasoning chains, and natural language explanations across multiple programming languages.

Note: Fine-tuning was performed using Kronos, Aquiles-ai’s proprietary enterprise fine-tuning system.

💻 Usage

Installation

uv pip install transformers torch accelerate

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Coding",
        dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
        attn_implementation="flash_attention_2") # Requires flash-attn

# Without flash-attn:
# model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Coding",
#     dtype="auto",
#     device_map="auto"
# )

tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Coding", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Hey, write a Python function that calculates the factorial of a number recursively."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to('cuda')

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=8092,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode and print the output
print(tokenizer.decode(output[0], skip_special_tokens=True))

Streaming Inference

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread

model = AutoModelForCausalLM.from_pretrained("Aquiles-ai/Athenea-4B-Coding",
        dtype=torch.bfloat16,
        trust_remote_code=True,
        device_map="auto",
        attn_implementation="flash_attention_2")

tokenizer = AutoTokenizer.from_pretrained("Aquiles-ai/Athenea-4B-Coding", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Hey, write a Python function that implements the binary search algorithm recursively."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to('cuda')

# Create the streamer
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Build kwargs for generate
generate_kwargs = dict(
    **inputs,
    max_new_tokens=8092,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    streamer=streamer,
)

def _generate_thread(model, kwargs):
    with torch.no_grad():
        model.generate(**kwargs)

thread = Thread(target=_generate_thread, args=(model, generate_kwargs))
thread.start()

for chunk in streamer:
    print(chunk, end="", flush=True)

Production Deployment with vLLM

Start server:

vllm serve Aquiles-ai/Athenea-4B-Coding \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key dummyapikey \
  --max-model-len=16384 \
  --async-scheduling \
  --gpu-memory-utilization=0.90

Request to the server from the OpenAI client:

from openai import OpenAI
client = OpenAI(api_key="dummyapikey", base_url="http://127.0.0.1:8000/v1")
stream = client.chat.completions.create(
    model="Aquiles-ai/Athenea-4B-Coding",
    messages=[{
        "role": "user",
        "content": "Hey, write a Python function that determines if a string is a palindrome, ignoring case, spaces, and punctuation."
    }],
    max_tokens=8092,
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)