File size: 4,266 Bytes
e9fc8f7 575bad0 e9fc8f7 575bad0 e9fc8f7 575bad0 e9fc8f7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
---
# 🪐 Circe-1.5B
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
- bilingual
- lora
- rl
- cost-efficient
- tiny-models
language:
- en
- es
---
<!-- center-aligned, capped at 420 px wide × 240 px tall -->
<p align="center">
<img
src="https://cdn-uploads.huggingface.co/production/uploads/657e1ad01e3e9c41a49b732e/8IsJaxuOwuqBN0GctRUUe.png"
alt="Circe-1.5B schematic"
width="420"
height="240"
/>
</p>
**Circe-1.5B** is a single-checkpoint, 1.5 B-parameter language model that asks a simple question:
> _“How far can you push tiny models on a tiny budget?”_
| ⚙️ Spec | Value |
|---------|-------|
| Base model | `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` |
| Trainable params | 4 M (LoRA) |
| Post-training cost | **≈ US $12** on 1×L40S |
| Training recipe | 8 h SFT → 4 h GRPO |
| Context length | up to **4 k tokens** (tested) |
| RAM @ bf16 | ~9 GB (≤ 3 GB 4-bit GPTQ) |
| Throughput | ~55 tok / s on 1×A6000 (fp16, no compile) |
It keeps DeepSeek-R1’s strong reasoning depth but adds **fluent bilingual chat** (English & Spanish) in a checkpoint that fits on a laptop GPU.
We intend to use it as a reproducible waypoint on the road to real-time speech-to-speech reasoning systems.
---
# 🔭 Intended Use
* **Base for new LoRAs** — domain adaptation, longer-context studies.
* **Research** into cost-efficient RL for reasoning.
* **Not** for high-stakes or production tasks.
See the [⚙️ Limitations](#️-limitations--bias) section before use.
---
# ⚡ Quickstart
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("PaletLabs/Circe-1.5B", torch_dtype="bfloat16")
tok = AutoTokenizer.from_pretrained("PaletLabs/Circe-1.5B")
prompt = "<|user|>¿Cómo se dice “tiny model” en español?<|assistant|>"
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device), max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))
```
---
# 🛠️ Installation
```bash
git clone https://github.com/palet-global/circe
cd circe
python -m venv venv && source venv/bin/activate
pip install .
```
## 🏗️ Re-Training Pipeline
### Data
```bash
python data/fetch_datasets.py --out data/processed
```
### Supervised LoRA
```bash
accelerate config default # one-time
accelerate launch train/sft.py \
--data_dir data/processed \
--output_dir checkpoints/sft
```
### RL (GRPO)
```bash
accelerate launch train/rl_grpo.py \
--data_dir data/processed \
--output_dir checkpoints/grpo \
--init_ckpt checkpoints/sft/checkpoint-13000 \
--num_steps 3000 --save_steps 500 --group 4
```
### Merge and Tokenizer
```bash
python train/merge_lora.py \
--ckpt_dir checkpoints/grpo \
--base deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
```
### SQuAD Sanity Checks
```bash
python eval/quick_squad_eval.py --model ./merged --dataset squad
python eval/quick_squad_eval.py --model ./merged --dataset squad_es
```
### Upload
```bash
python train/upload_to_hub.py \
--model_dir merged \
--repo PaletLabs/Circe-1.5B \
--token $HF_TOKEN
```
---
# 💻 Hardware & Inference Tips
- **bf16 / fp16**: Needs ~9 GB VRAM.
- **4-bit GPTQ**: < 3 GB. `bitsandbytes` works out-of-the-box.
- Compile once (`torch.compile`) for **+10–15 %** throughput.
---
# ✍️ Current Evaluation Status
Formal **lighteval / MMLU / GSM-8K** runs are queued. Preliminary spot-checks show Circe retains DeepSeek-R1’s chain-of-thought depth on reasoning-heavy QA while adding smooth bilingual generation.
---
## ⚙️ Limitations & Bias
- No reward-model alignment.
- Long-context (> 4 k) stability untested.
- Training data bias from public QA pairs. Spanish coverage favors Latin American variants.
- Minimal safety filters so **you** have to wrap with your own guardrails for production.
---
# 🔮 Roadmap
- Publish full reasoning benchmark suite & eval scripts.
- Release code-reasoning and doc-QA adapters.
- Attach a **24 kHz neural codec** → real-time, full-duplex voice chat without ASR → TTS hops.
---
# 🪪 License
This project is licensed under the [MIT](https://opensource.org/licenses/MIT) License. Attribution appreciated but not required.
|