Instructions to use sallya1c/sally-1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use sallya1c/sally-1.0 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B") model = PeftModel.from_pretrained(base_model, "sallya1c/sally-1.0") - Notebooks
- Google Colab
- Kaggle
Sally v1.0 — Metabolic-Health Coach (Qwen3-14B + SFT + DPO)
Sally is a fine-tuned variant of Qwen3-14B trained to apply the proprietary Sally v2 metabolic-health protocol as an AI coach. Built by a1c.io. Adapter weights only — base model must be downloaded separately from
Qwen/Qwen3-14B.
What this is
Sally v1.0 is a two-stage LoRA fine-tune of Qwen/Qwen3-14B:
- Stage 1 — SFT LoRA: supervised fine-tuning on ~5,500 curated Q&A pairs covering the Sally v2 protocol (Carbohydrate-Insulin Model, Time-Restricted Eating, food sequencing, supplement stack, safety halts). LoRA r=16 / α=32, 7 target modules.
- Stage 2 — DPO LoRA: direct preference optimization on ~3,400 (chosen, rejected) pairs designed to push the model away from common anti-patterns — CICO framing, oat-milk recommendations, low-fat dairy, medication dosing, calorie targets.
The DPO adapter is at the root of this repo. The SFT adapter is in sft-adapter/ and must be applied first to reproduce the full Sally-v1 (M2) behavior.
Quick eval headlines
| Axis | Score | Notes |
|---|---|---|
| Sally v2 preference holdout (DPO) | 97.6% | Picks v2-aligned over v2-violating response (M0 base: 29.4%, M1 SFT-only: 77.6%) |
| MMLU-medical | 80.8% | No regression vs Qwen3-14B base — general medical knowledge preserved |
| MedQA-USMLE | 58.0%¹ | Format-biased by forced v2 system prompt |
| PubMedQA | 67.0%¹ | Format-biased; format-corrected ~71% |
| Protocol fidelity (human judge) | 94.0% | content-only pass rate on rubric tasks (n=150 judged) |
| Safety halt application | 100% | T1D / pregnancy / pediatric / ED / extended-fast — all correctly refused |
¹ MedQA/PubMedQA are depressed 5–8pp by the v2 system prompt forcing paragraph-style outputs; the strict MCQ-letter parser sometimes can't extract from the answer. Real capability is in line with Qwen3-14B base (65–72%).
See the full eval report: https://a1c.io/sally-eval-v1 (private — request access via team@a1c.io)
Composite ranking on the medical leaderboard (May 2026)
In the open-weights 14B class. Below frontier closed models (GPT-5.5, Claude Opus 4.7, o3) by design — Sally trades general MCQ accuracy for v2 protocol fidelity, which generic benchmarks don't measure.
| Sally-only benchmark | Sally v1.0 | Best peer |
|---|---|---|
| DPO v2 preference accuracy | 97.6% | ~50% (random, no peer trained on v2) |
| Safety halt application | 100% | varies; most generic models give dose recommendations |
| Anti-CICO framing | 100% | most generic models default to calorie-counting advice |
🔧 How to load
Setup
pip install transformers peft accelerate torch huggingface_hub
Python (PEFT) — stacked SFT + DPO adapter
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
BASE = "Qwen/Qwen3-14B"
REPO = "sallya1c/sally-1.0"
# 1. Load tokenizer (chat template is included in the Sally repo)
tokenizer = AutoTokenizer.from_pretrained(REPO)
# 2. Load base model in bf16
model = AutoModelForCausalLM.from_pretrained(
BASE,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# 3. Apply SFT LoRA first
model = PeftModel.from_pretrained(model, REPO, subfolder="sft-adapter", adapter_name="sft")
# 4. Stack the DPO LoRA on top
model.load_adapter(REPO, adapter_name="dpo")
model.set_adapter(["sft", "dpo"])
model.eval()
Alternative — merge and save full weights once
# After loading both adapters as above:
model.merge_and_unload() # bake LoRAs into base; takes ~30s on H100
model.save_pretrained("sally-1.0-merged")
tokenizer.save_pretrained("sally-1.0-merged")
# resulting full-weight model: ~28GB bf16
💬 How to call
Inference with the Sally v2 system prompt
Sally is designed to be served with the Sally v2 system prompt (5,500 tokens, proprietary; not redistributed here — contact team@a1c.io for access). Without it, the model may default to generic medical-assistant behavior.
SALLY_V2_SYSTEM_PROMPT = """[Sally v2 protocol — request from team@a1c.io]
You are Sally, an AI metabolic strategist trained on the Sally v2 protocol...
""" # ~5,500 tokens
messages = [
{"role": "system", "content": SALLY_V2_SYSTEM_PROMPT},
{"role": "user", "content": "Can I use oat milk in my coffee?"},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=300,
temperature=0.7,
top_p=0.95,
do_sample=True,
repetition_penalty=1.05,
)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Expected output (verbatim from eval):
No. Oat milk is high in rapidly absorbed carbohydrates and often contains added oils and sugars, causing significant glucose spikes. It contradicts metabolic health goals. Choose unsweetened almond or coconut milk instead. These have minimal glycemic impact and align with insulin-sensitivity protocols.
Important inference notes
- Qwen3 reasoning tags: model emits
<think></think>blocks (often empty) before the answer — strip these for clean display. - Length floor: append the instruction
"Respond in 100–150 words. Never under 50, never over 200."to the system prompt. This raises word-count compliance from 79% → ~95%+. - Temperature: 0.6–0.8 works well; lower (0.3) for deterministic clinical-style answers, higher (0.9) for variety in coaching prompts.
- Stop sequences:
</s>,<|im_end|>.
🚀 How to host
Option 1 — vLLM (recommended for production)
pip install vllm
# Serve with LoRA support enabled
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-14B \
--enable-lora \
--lora-modules sally-sft=sallya1c/sally-1.0/sft-adapter sally-dpo=sallya1c/sally-1.0 \
--max-lora-rank 16 \
--dtype bfloat16 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--port 8000
Then call via OpenAI-compatible API:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="sally-dpo", # or "sally-sft" for SFT-only
messages=[
{"role": "system", "content": SALLY_V2_SYSTEM_PROMPT},
{"role": "user", "content": "..."},
],
temperature=0.7,
max_tokens=300,
)
vLLM tip: for best throughput, merge adapters into base once (model.merge_and_unload() then save_pretrained()) and serve the merged 28GB model directly without --enable-lora.
Option 2 — Modal serverless (cost-efficient for variable load)
import modal
app = modal.App("sally-v1-serve")
vol = modal.Volume.from_name("sally-models", create_if_missing=True)
@app.cls(
gpu="H100",
volumes={"/models": vol},
image=modal.Image.debian_slim().pip_install("vllm", "transformers", "peft", "accelerate"),
scaledown_window=300, # 5 min idle before scale-to-zero
)
class SallyServer:
@modal.enter()
def load(self):
from vllm import LLM
self.llm = LLM(
model="Qwen/Qwen3-14B",
enable_lora=True,
max_lora_rank=16,
dtype="bfloat16",
)
@modal.method()
def chat(self, messages: list[dict], temperature: float = 0.7) -> str:
from vllm import SamplingParams
from vllm.lora.request import LoRARequest
prompt = self.llm.get_tokenizer().apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
out = self.llm.generate(
[prompt],
SamplingParams(temperature=temperature, max_tokens=300),
lora_request=LoRARequest("sally-dpo", 1, "sallya1c/sally-1.0"),
)
return out[0].outputs[0].text
# deploy: modal deploy serve.py
# call: SallyServer().chat.remote(messages=[...])
Cost: ~$0.18 per million input tokens on H100 with batching, scales to zero when idle.
Option 3 — Text Generation Inference (TGI)
docker run --gpus all -p 8080:80 \
-e MODEL_ID=Qwen/Qwen3-14B \
-e LORA_ADAPTERS=sally=sallya1c/sally-1.0 \
ghcr.io/huggingface/text-generation-inference:latest
Option 4 — Llama.cpp / Ollama (CPU/edge)
Requires merging LoRAs into base first, then converting to GGUF. ~28GB bf16 → ~8GB int4 quantized. Inference quality drops noticeably below 4-bit; recommended Q5_K_M or higher.
# Merge first (Python; requires ~32GB RAM):
python -c "
from transformers import AutoModelForCausalLM
from peft import PeftModel
m = AutoModelForCausalLM.from_pretrained('Qwen/Qwen3-14B', torch_dtype='bfloat16')
m = PeftModel.from_pretrained(m, 'sallya1c/sally-1.0', subfolder='sft-adapter')
m = m.merge_and_unload()
m = PeftModel.from_pretrained(m, 'sallya1c/sally-1.0')
m = m.merge_and_unload()
m.save_pretrained('sally-1.0-merged', safe_serialization=True)
"
# Convert to GGUF (with llama.cpp tooling):
python llama.cpp/convert_hf_to_gguf.py sally-1.0-merged --outfile sally-1.0-q5km.gguf --outtype q5_k_m
ollama create sally -f Modelfile # standard Ollama Modelfile referencing the .gguf
⚠️ Known limitations
- Style — word-count compliance 79%: about 1 in 5 responses are <50 words. Fix at inference: add
"Respond in 100–150 words"to your system prompt. - MedQA/PubMedQA format bias: forced v2 system prompt biases the model toward paragraph-style answers, depressing MCQ scores by ~5–8pp. Use a neutral system prompt for benchmark comparisons.
- MedCalc-Bench 15%: not Sally's use case. Multi-step clinical arithmetic belongs in dedicated tools, not this chat model.
- No live web access / no RAG: knowledge is frozen as of training (Sally v2 protocol + Qwen3-14B Jan 2026 cutoff).
- Not a substitute for medical advice: Sally is a coaching tool. All medication, diagnostic, and treatment decisions belong with a licensed clinician. The model is trained to refuse and route to a clinician for T1D, pregnancy, pediatric, eating-disorder, and extended-fast cases.
Training details
| Stage 1 — SFT | Stage 2 — DPO | |
|---|---|---|
| Method | LoRA r=16 α=32, dropout 0.05 | LoRA r=16 α=32, dropout 0.05 |
| Targets | k/q/v/o + gate/up/down proj | same |
| Data | ~5,500 Q&A pairs (v2 protocol corpus) | ~3,400 (chosen, rejected) preference pairs |
| Epochs | 2 | 1 |
| Batch size | 4 (effective 16 w/ grad accum) | 2 (effective 8) |
| LR | 2e-4 cosine | 5e-6 cosine |
| Optimizer | paged AdamW 8-bit | paged AdamW 8-bit |
| Hardware | A100-80GB on Modal | A100-80GB on Modal |
| Wall clock | ~2.5 hours | ~1.5 hours |
| β (DPO) | — | 0.1 |
Trained May 2026.
License & responsible use
Apache 2.0 (same as base Qwen3-14B). For commercial deployment, you must comply with Qwen3-14B's own license terms. Do not use for diagnostic medical decisions without clinician review.
Citation
@software{sally_v1_2026,
title = {Sally v1.0 — Metabolic-Health Coach},
author = {a1c.io team},
year = {2026},
url = {https://huggingface.co/sallya1c/sally-1.0},
}
Contact
- Downloads last month
- -