Text Generation
Transformers
Safetensors
English
qwen3
code
agent
Merge
uncensored
Rhea
multi-pass
reasoning
conversational
text-generation-inference
Instructions to use roskosmos19/Rhea-4B-Coding with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use roskosmos19/Rhea-4B-Coding with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="roskosmos19/Rhea-4B-Coding") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("roskosmos19/Rhea-4B-Coding") model = AutoModelForCausalLM.from_pretrained("roskosmos19/Rhea-4B-Coding") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Local Apps Settings
- vLLM
How to use roskosmos19/Rhea-4B-Coding with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "roskosmos19/Rhea-4B-Coding" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "roskosmos19/Rhea-4B-Coding", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/roskosmos19/Rhea-4B-Coding
- SGLang
How to use roskosmos19/Rhea-4B-Coding with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "roskosmos19/Rhea-4B-Coding" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "roskosmos19/Rhea-4B-Coding", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "roskosmos19/Rhea-4B-Coding" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "roskosmos19/Rhea-4B-Coding", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use roskosmos19/Rhea-4B-Coding with Docker Model Runner:
docker model run hf.co/roskosmos19/Rhea-4B-Coding
| datasets: | |
| - Aquiles-ai/Athenea-Coding-100k | |
| - roskosmos19/Rhea-Coding | |
| license: apache-2.0 | |
| tags: | |
| - code | |
| - agent | |
| - merge | |
| - uncensored | |
| - Rhea | |
| - multi-pass | |
| - reasoning | |
| language: | |
| - en | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| <h1 align="center">Rhea 4B Coding</h1> | |
|  | |
| **Rhea-4B-Coding** is an optimized version of [Aquiles-ai/Athenea-4B-Coding](https://huggingface.co/Aquiles-ai/Athenea-4B-Coding), specialized in **code reasoning, debugging, agentic tools and multi-pass problem solving**. | |
| Trained on high-quality programming data with explicit reasoning traces using ` thinking` and ` 思考结束` tags, the model is designed to perform detailed **3-pass reasoning** for software development, algorithm design, and code comprehension tasks: | |
| 1. **Pass 1**: First implementation | |
| 2. **Pass 2**: Self-review for bugs, edge cases, security, performance | |
| 3. **Pass 3**: Final optimized version with identical functionality | |
| > ⚠️ **Important Note:** This model uses an *uncensored* base version, providing full expressive freedom and unrestricted output generation. Users are fully responsible for any use or content produced by the model. It is intended exclusively for research and experimentation purposes. | |
| ## 🎯 Model Description | |
| Rhea-4B-Coding extends Athenea-4B-Coding's structured reasoning capabilities into programming-related domains with **multi-pass processing**, showing strong performance on logical problem-solving, code completion, debugging scenarios, and iterative code refinement. | |
| Key features: | |
| * **Multi-Pass Processing**: 3-step reasoning with `<think>`, `<review>`, and `<final>` tags | |
| * **Agentic Tools** for AI Agents | |
| * **Step-by-step code reasoning** within ` thinking` blocks | |
| * **Self-review capabilities** for bug detection and optimization | |
| * **Specialization in algorithmic and debugging tasks** | |
| * **Uncensored output generation** for full reasoning visibility | |
| * **Improved logical consistency** through focused fine-tuning | |
| * **Compatible with open inference frameworks** (Transformers, vLLM, etc.) | |
| The model was fine-tuned using the dataset [Aquiles-ai/Athenea-Coding-100k](https://huggingface.co/datasets/Aquiles-ai/Athenea-Coding-100k), which includes diverse programming challenges, structured reasoning chains, and natural language explanations across multiple programming languages. | |
| ## 🔄 Multi-Pass Architecture | |
| The model uses special tokens for structured reasoning: | |
| | Token | Purpose | | |
| |-------|---------| | |
| | `<think>` | Start of self-review phase (Pass 2) | | |
| | `</think>` | End of self-review phase | | |
| | `<review>` | Start of review results documentation | | |
| | `</review>` | End of review results | | |
| | `<final>` | Start of final optimized version (Pass 3) | | |
| | `</final>` | End of final version | | |
| This structure ensures **identical functionality** across all passes while improving **code structure, comments, and robustness**. | |
| ## 💻 Usage | |
| ### Installation | |
| ```bash | |
| uv pip install transformers torch accelerate | |
| ``` | |
| ### Basic Inference | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model = AutoModelForCausalLM.from_pretrained("Roskosmos19/Rhea-4B-Coding", | |
| dtype=torch.bfloat16, | |
| trust_remote_code=True, | |
| device_map="auto", | |
| attn_implementation="flash_attention_2") # Requires flash-attn | |
| # Without flash-attn: | |
| # model = AutoModelForCausalLM.from_pretrained("Roskosmos19/Rhea-4B-Coding", | |
| # dtype="auto", | |
| # device_map="auto" | |
| # ) | |
| tokenizer = AutoTokenizer.from_pretrained("Roskosmos19/Rhea-4B-Coding", trust_remote_code=True) | |
| messages = [ | |
| {"role": "user", "content": "Hey, write a Python function that calculates the factorial of a number recursively."} | |
| ] | |
| inputs = tokenizer.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| tokenize=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| ).to('cuda') | |
| with torch.no_grad(): | |
| output = model.generate( | |
| **inputs, | |
| max_new_tokens=16384, # Increased for multi-pass output | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| ) | |
| # Decode and print the output | |
| print(tokenizer.decode(output[0], skip_special_tokens=False)) | |
| ``` | |
| ### Multi-Pass Inference (Recommended) | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model = AutoModelForCausalLM.from_pretrained("Roskosmos19/Rhea-4B-Coding", | |
| dtype=torch.bfloat16, | |
| trust_remote_code=True, | |
| device_map="auto") | |
| tokenizer = AutoTokenizer.from_pretrained("Roskosmos19/Rhea-4B-Coding", trust_remote_code=True) | |
| def generate_multi_pass(prompt, max_tokens_per_pass=4096): | |
| """ | |
| Generate code with 3-pass reasoning: | |
| Pass 1: First implementation | |
| Pass 2: Self-review | |
| Pass 3: Final optimized version | |
| """ | |
| # Pass 1: First implementation | |
| messages = [{"role": "user", "content": prompt}] | |
| inputs = tokenizer.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| tokenize=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| ).to('cuda') | |
| with torch.no_grad(): | |
| output1 = model.generate( | |
| **inputs, | |
| max_new_tokens=max_tokens_per_pass, | |
| temperature=0.4, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| ) | |
| pass1 = tokenizer.decode(output1[0], skip_special_tokens=False) | |
| # Pass 2: Self-review | |
| review_prompt = pass1 + "\n<<think>\n### PASS 2 - Self-Review:\n" | |
| inputs2 = tokenizer(review_prompt, return_tensors="pt").to('cuda') | |
| with torch.no_grad(): | |
| output2 = model.generate( | |
| **inputs2, | |
| max_new_tokens=2048, | |
| temperature=0.3, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| ) | |
| review = tokenizer.decode(output2[0], skip_special_tokens=False) | |
| # Pass 3: Final version | |
| final_prompt = review + "\n<<final>\n### PASS 3 - Final Version:\n" | |
| inputs3 = tokenizer(final_prompt, return_tensors="pt").to('cuda') | |
| with torch.no_grad(): | |
| output3 = model.generate( | |
| **inputs3, | |
| max_new_tokens=max_tokens_per_pass, | |
| temperature=0.2, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| ) | |
| final = tokenizer.decode(output3[0], skip_special_tokens=False) | |
| return { | |
| "pass1": pass1, | |
| "review": review, | |
| "pass3": final | |
| } | |
| # Example usage | |
| result = generate_multi_pass("Write a Python function for binary search") | |
| print("=== PASS 1 ===") | |
| print(result["pass1"]) | |
| print("\n=== REVIEW ===") | |
| print(result["review"]) | |
| print("\n=== FINAL ===") | |
| print(result["pass3"]) | |
| ``` | |
| ### Streaming Inference | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer | |
| import torch | |
| from threading import Thread | |
| model = AutoModelForCausalLM.from_pretrained("Roskosmos19/Rhea-4B-Coding", | |
| dtype=torch.bfloat16, | |
| trust_remote_code=True, | |
| device_map="auto", | |
| attn_implementation="flash_attention_2") | |
| tokenizer = AutoTokenizer.from_pretrained("Roskosmos19/Rhea-4B-Coding", trust_remote_code=True) | |
| messages = [ | |
| {"role": "user", "content": "Hey, write a Python function that implements the binary search algorithm recursively."} | |
| ] | |
| inputs = tokenizer.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| tokenize=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| ).to('cuda') | |
| # Create the streamer | |
| streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False) | |
| # Build kwargs for generate | |
| generate_kwargs = dict( | |
| **inputs, | |
| max_new_tokens=16384, # Increased for multi-pass output | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| streamer=streamer, | |
| ) | |
| def _generate_thread(model, kwargs): | |
| with torch.no_grad(): | |
| model.generate(**kwargs) | |
| thread = Thread(target=_generate_thread, args=(model, generate_kwargs)) | |
| thread.start() | |
| for chunk in streamer: | |
| print(chunk, end="", flush=True) | |
| ``` | |
| ### Production Deployment with vLLM | |
| **Start server:** | |
| ```bash | |
| vllm serve Roskosmos19/Rhea-4B-Coding \ | |
| --host 0.0.0.0 \ | |
| --port 8000 \ | |
| --api-key dummyapikey \ | |
| --max-model-len=262144 \ | |
| --async-scheduling \ | |
| --gpu-memory-utilization=0.90 | |
| ``` | |
| **Request to the server from the OpenAI client:** | |
| ```python | |
| from openai import OpenAI | |
| client = OpenAI(api_key="dummyapikey", base_url="http://127.0.0.1:8000/v1") | |
| stream = client.chat.completions.create( | |
| model="roskosmos19/Rhea-4B-Coding", | |
| messages=[{ | |
| "role": "user", | |
| "content": "Hey, write a Python function that determines if a string is a palindrome, ignoring case, spaces, and punctuation." | |
| }], | |
| max_tokens=16384, # Increased for multi-pass output | |
| stream=True | |
| ) | |
| for chunk in stream: | |
| if chunk.choices[0].delta.content: | |
| print(chunk.choices[0].delta.content, end="", flush=True) | |
| ``` | |
| **vLLM Benefits:** 20-30x faster inference, OpenAI-compatible API, continuous batching, async scheduling. | |
| ## 📝 Model Configuration | |
| | Parameter | Value | Description | | |
| |-----------|-------|-------------| | |
| | `temperature` | 0.4 | Balanced creativity and consistency | | |
| | `max_new_tokens` | 32768 | Full multi-pass output capacity | | |
| | `repetition_penalty` | 1.0 | No penalty for intentional code repetition | | |
| | `no_repeat_ngram_size` | 0 | Allows code structure repetition | | |
| | `use_cache` | true | Faster inference for long outputs | | |
| ## ⚙️ Files Modified for Multi-Pass | |
| | File | Changes | | |
| |------|---------| | |
| | `generation_config.json` | Extended tokens, optimized for multi-pass | | |
| | `config.json` | Enabled caching, full context window | | |
| | `tokenizer_config.json` | Added `<think>`, `<review>`, `<final>` tokens | | |
| | `special_tokens_map.json` | Registered new special tokens | | |
| | `chat_template.jinja` | 3-pass prompt structure | | |
| ## 🤝 Credits | |
| - Base model: [Aquiles-ai/Athenea-4B-Coding](https://huggingface.co/Aquiles-ai/Athenea-4B-Coding) | |
| - Dataset: [Aquiles-ai/Athenea-Coding-100k](https://huggingface.co/datasets/Aquiles-ai/Athenea-Coding-100k) | |
| - Architecture: Qwen3 4B | |
| <p align="center"> | |
| Roskosmos19 | |
| </p> |