Instructions to use PaletLabs/Circe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PaletLabs/Circe with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="PaletLabs/Circe") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("PaletLabs/Circe") model = AutoModelForCausalLM.from_pretrained("PaletLabs/Circe") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use PaletLabs/Circe with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "PaletLabs/Circe" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PaletLabs/Circe", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/PaletLabs/Circe
- SGLang
How to use PaletLabs/Circe with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "PaletLabs/Circe" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PaletLabs/Circe", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "PaletLabs/Circe" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PaletLabs/Circe", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use PaletLabs/Circe with Docker Model Runner:
docker model run hf.co/PaletLabs/Circe
Use Docker images
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "PaletLabs/Circe" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "PaletLabs/Circe",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'
Circe-1.5B is a single-checkpoint, 1.5 B-parameter language model that asks a simple question:
“How far can you push tiny models on a tiny budget?”
| ⚙️ Spec | Value |
|---|---|
| Base model | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
| Trainable params | 4 M (LoRA) |
| Post-training cost | ≈ US $12 on 1×L40S |
| Training recipe | 8 h SFT → 4 h GRPO |
| Context length | up to 4 k tokens (tested) |
| RAM @ bf16 | ~9 GB (≤ 3 GB 4-bit GPTQ) |
| Throughput | ~55 tok / s on 1×A6000 (fp16, no compile) |
It keeps DeepSeek-R1’s strong reasoning depth but adds fluent bilingual chat (English & Spanish) in a checkpoint that fits on a laptop GPU.
We intend to use it as a reproducible waypoint on the road to real-time speech-to-speech reasoning systems.
🔭 Intended Use
- Base for new LoRAs — domain adaptation, longer-context studies.
- Research into cost-efficient RL for reasoning.
- Not for high-stakes or production tasks.
See the ⚙️ Limitations section before use.
⚡ Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("PaletLabs/Circe-1.5B", torch_dtype="bfloat16")
tok = AutoTokenizer.from_pretrained("PaletLabs/Circe-1.5B")
prompt = "<|user|>¿Cómo se dice “tiny model” en español?<|assistant|>"
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device), max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))
🛠️ Installation
git clone https://github.com/palet-global/circe
cd circe
python -m venv venv && source venv/bin/activate
pip install .
🏗️ Re-Training Pipeline
Data
python data/fetch_datasets.py --out data/processed
Supervised LoRA
accelerate config default # one-time
accelerate launch train/sft.py \
--data_dir data/processed \
--output_dir checkpoints/sft
RL (GRPO)
accelerate launch train/rl_grpo.py \
--data_dir data/processed \
--output_dir checkpoints/grpo \
--init_ckpt checkpoints/sft/checkpoint-13000 \
--num_steps 3000 --save_steps 500 --group 4
Merge and Tokenizer
python train/merge_lora.py \
--ckpt_dir checkpoints/grpo \
--base deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
SQuAD Sanity Checks
python eval/quick_squad_eval.py --model ./merged --dataset squad
python eval/quick_squad_eval.py --model ./merged --dataset squad_es
Upload
python train/upload_to_hub.py \
--model_dir merged \
--repo PaletLabs/Circe-1.5B \
--token $HF_TOKEN
💻 Hardware & Inference Tips
- bf16 / fp16: Needs ~9 GB VRAM.
- 4-bit GPTQ: < 3 GB.
bitsandbytesworks out-of-the-box. - Compile once (
torch.compile) for +10–15 % throughput.
✍️ Current Evaluation Status
Formal lighteval / MMLU / GSM-8K runs are queued. Preliminary spot-checks show Circe retains DeepSeek-R1’s chain-of-thought depth on reasoning-heavy QA while adding smooth bilingual generation.
⚙️ Limitations & Bias
- No reward-model alignment.
- Long-context (> 4 k) stability untested.
- Training data bias from public QA pairs. Spanish coverage favors Latin American variants.
- Minimal safety filters so you have to wrap with your own guardrails for production.
🔮 Roadmap
- Publish full reasoning benchmark suite & eval scripts.
- Release code-reasoning and doc-QA adapters.
- Attach a 24 kHz neural codec → real-time, full-duplex voice chat without ASR → TTS hops.
🪪 License
This project is licensed under the MIT License. Attribution appreciated but not required.
- Downloads last month
- 1
Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "PaletLabs/Circe" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PaletLabs/Circe", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'