Sally v1.0 — Metabolic-Health Coach (Qwen3-14B + SFT + DPO)

Sally is a fine-tuned variant of Qwen3-14B trained to apply the proprietary Sally v2 metabolic-health protocol as an AI coach. Built by a1c.io. Adapter weights only — base model must be downloaded separately from Qwen/Qwen3-14B.

What this is

Sally v1.0 is a two-stage LoRA fine-tune of Qwen/Qwen3-14B:

  1. Stage 1 — SFT LoRA: supervised fine-tuning on ~5,500 curated Q&A pairs covering the Sally v2 protocol (Carbohydrate-Insulin Model, Time-Restricted Eating, food sequencing, supplement stack, safety halts). LoRA r=16 / α=32, 7 target modules.
  2. Stage 2 — DPO LoRA: direct preference optimization on ~3,400 (chosen, rejected) pairs designed to push the model away from common anti-patterns — CICO framing, oat-milk recommendations, low-fat dairy, medication dosing, calorie targets.

The DPO adapter is at the root of this repo. The SFT adapter is in sft-adapter/ and must be applied first to reproduce the full Sally-v1 (M2) behavior.

Quick eval headlines

Axis Score Notes
Sally v2 preference holdout (DPO) 97.6% Picks v2-aligned over v2-violating response (M0 base: 29.4%, M1 SFT-only: 77.6%)
MMLU-medical 80.8% No regression vs Qwen3-14B base — general medical knowledge preserved
MedQA-USMLE 58.0%¹ Format-biased by forced v2 system prompt
PubMedQA 67.0%¹ Format-biased; format-corrected ~71%
Protocol fidelity (human judge) 94.0% content-only pass rate on rubric tasks (n=150 judged)
Safety halt application 100% T1D / pregnancy / pediatric / ED / extended-fast — all correctly refused

¹ MedQA/PubMedQA are depressed 5–8pp by the v2 system prompt forcing paragraph-style outputs; the strict MCQ-letter parser sometimes can't extract from the answer. Real capability is in line with Qwen3-14B base (65–72%).

See the full eval report: https://a1c.io/sally-eval-v1 (private — request access via team@a1c.io)

Composite ranking on the medical leaderboard (May 2026)

In the open-weights 14B class. Below frontier closed models (GPT-5.5, Claude Opus 4.7, o3) by design — Sally trades general MCQ accuracy for v2 protocol fidelity, which generic benchmarks don't measure.

Sally-only benchmark Sally v1.0 Best peer
DPO v2 preference accuracy 97.6% ~50% (random, no peer trained on v2)
Safety halt application 100% varies; most generic models give dose recommendations
Anti-CICO framing 100% most generic models default to calorie-counting advice

🔧 How to load

Setup

pip install transformers peft accelerate torch huggingface_hub

Python (PEFT) — stacked SFT + DPO adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE = "Qwen/Qwen3-14B"
REPO = "sallya1c/sally-1.0"

# 1. Load tokenizer (chat template is included in the Sally repo)
tokenizer = AutoTokenizer.from_pretrained(REPO)

# 2. Load base model in bf16
model = AutoModelForCausalLM.from_pretrained(
    BASE,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 3. Apply SFT LoRA first
model = PeftModel.from_pretrained(model, REPO, subfolder="sft-adapter", adapter_name="sft")

# 4. Stack the DPO LoRA on top
model.load_adapter(REPO, adapter_name="dpo")
model.set_adapter(["sft", "dpo"])
model.eval()

Alternative — merge and save full weights once

# After loading both adapters as above:
model.merge_and_unload()  # bake LoRAs into base; takes ~30s on H100
model.save_pretrained("sally-1.0-merged")
tokenizer.save_pretrained("sally-1.0-merged")
# resulting full-weight model: ~28GB bf16

💬 How to call

Inference with the Sally v2 system prompt

Sally is designed to be served with the Sally v2 system prompt (5,500 tokens, proprietary; not redistributed here — contact team@a1c.io for access). Without it, the model may default to generic medical-assistant behavior.

SALLY_V2_SYSTEM_PROMPT = """[Sally v2 protocol — request from team@a1c.io]
You are Sally, an AI metabolic strategist trained on the Sally v2 protocol...
"""  # ~5,500 tokens

messages = [
    {"role": "system", "content": SALLY_V2_SYSTEM_PROMPT},
    {"role": "user", "content": "Can I use oat milk in my coffee?"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=300,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
    repetition_penalty=1.05,
)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Expected output (verbatim from eval):

No. Oat milk is high in rapidly absorbed carbohydrates and often contains added oils and sugars, causing significant glucose spikes. It contradicts metabolic health goals. Choose unsweetened almond or coconut milk instead. These have minimal glycemic impact and align with insulin-sensitivity protocols.

Important inference notes

  • Qwen3 reasoning tags: model emits <think></think> blocks (often empty) before the answer — strip these for clean display.
  • Length floor: append the instruction "Respond in 100–150 words. Never under 50, never over 200." to the system prompt. This raises word-count compliance from 79% → ~95%+.
  • Temperature: 0.6–0.8 works well; lower (0.3) for deterministic clinical-style answers, higher (0.9) for variety in coaching prompts.
  • Stop sequences: </s>, <|im_end|>.

🚀 How to host

Option 1 — vLLM (recommended for production)

pip install vllm

# Serve with LoRA support enabled
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-14B \
  --enable-lora \
  --lora-modules sally-sft=sallya1c/sally-1.0/sft-adapter sally-dpo=sallya1c/sally-1.0 \
  --max-lora-rank 16 \
  --dtype bfloat16 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --port 8000

Then call via OpenAI-compatible API:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="sally-dpo",  # or "sally-sft" for SFT-only
    messages=[
        {"role": "system", "content": SALLY_V2_SYSTEM_PROMPT},
        {"role": "user", "content": "..."},
    ],
    temperature=0.7,
    max_tokens=300,
)

vLLM tip: for best throughput, merge adapters into base once (model.merge_and_unload() then save_pretrained()) and serve the merged 28GB model directly without --enable-lora.

Option 2 — Modal serverless (cost-efficient for variable load)

import modal

app = modal.App("sally-v1-serve")
vol = modal.Volume.from_name("sally-models", create_if_missing=True)

@app.cls(
    gpu="H100",
    volumes={"/models": vol},
    image=modal.Image.debian_slim().pip_install("vllm", "transformers", "peft", "accelerate"),
    scaledown_window=300,  # 5 min idle before scale-to-zero
)
class SallyServer:
    @modal.enter()
    def load(self):
        from vllm import LLM
        self.llm = LLM(
            model="Qwen/Qwen3-14B",
            enable_lora=True,
            max_lora_rank=16,
            dtype="bfloat16",
        )

    @modal.method()
    def chat(self, messages: list[dict], temperature: float = 0.7) -> str:
        from vllm import SamplingParams
        from vllm.lora.request import LoRARequest
        prompt = self.llm.get_tokenizer().apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        out = self.llm.generate(
            [prompt],
            SamplingParams(temperature=temperature, max_tokens=300),
            lora_request=LoRARequest("sally-dpo", 1, "sallya1c/sally-1.0"),
        )
        return out[0].outputs[0].text

# deploy:  modal deploy serve.py
# call:    SallyServer().chat.remote(messages=[...])

Cost: ~$0.18 per million input tokens on H100 with batching, scales to zero when idle.

Option 3 — Text Generation Inference (TGI)

docker run --gpus all -p 8080:80 \
  -e MODEL_ID=Qwen/Qwen3-14B \
  -e LORA_ADAPTERS=sally=sallya1c/sally-1.0 \
  ghcr.io/huggingface/text-generation-inference:latest

Option 4 — Llama.cpp / Ollama (CPU/edge)

Requires merging LoRAs into base first, then converting to GGUF. ~28GB bf16 → ~8GB int4 quantized. Inference quality drops noticeably below 4-bit; recommended Q5_K_M or higher.

# Merge first (Python; requires ~32GB RAM):
python -c "
from transformers import AutoModelForCausalLM
from peft import PeftModel
m = AutoModelForCausalLM.from_pretrained('Qwen/Qwen3-14B', torch_dtype='bfloat16')
m = PeftModel.from_pretrained(m, 'sallya1c/sally-1.0', subfolder='sft-adapter')
m = m.merge_and_unload()
m = PeftModel.from_pretrained(m, 'sallya1c/sally-1.0')
m = m.merge_and_unload()
m.save_pretrained('sally-1.0-merged', safe_serialization=True)
"

# Convert to GGUF (with llama.cpp tooling):
python llama.cpp/convert_hf_to_gguf.py sally-1.0-merged --outfile sally-1.0-q5km.gguf --outtype q5_k_m
ollama create sally -f Modelfile  # standard Ollama Modelfile referencing the .gguf

⚠️ Known limitations

  1. Style — word-count compliance 79%: about 1 in 5 responses are <50 words. Fix at inference: add "Respond in 100–150 words" to your system prompt.
  2. MedQA/PubMedQA format bias: forced v2 system prompt biases the model toward paragraph-style answers, depressing MCQ scores by ~5–8pp. Use a neutral system prompt for benchmark comparisons.
  3. MedCalc-Bench 15%: not Sally's use case. Multi-step clinical arithmetic belongs in dedicated tools, not this chat model.
  4. No live web access / no RAG: knowledge is frozen as of training (Sally v2 protocol + Qwen3-14B Jan 2026 cutoff).
  5. Not a substitute for medical advice: Sally is a coaching tool. All medication, diagnostic, and treatment decisions belong with a licensed clinician. The model is trained to refuse and route to a clinician for T1D, pregnancy, pediatric, eating-disorder, and extended-fast cases.

Training details

Stage 1 — SFT Stage 2 — DPO
Method LoRA r=16 α=32, dropout 0.05 LoRA r=16 α=32, dropout 0.05
Targets k/q/v/o + gate/up/down proj same
Data ~5,500 Q&A pairs (v2 protocol corpus) ~3,400 (chosen, rejected) preference pairs
Epochs 2 1
Batch size 4 (effective 16 w/ grad accum) 2 (effective 8)
LR 2e-4 cosine 5e-6 cosine
Optimizer paged AdamW 8-bit paged AdamW 8-bit
Hardware A100-80GB on Modal A100-80GB on Modal
Wall clock ~2.5 hours ~1.5 hours
β (DPO) 0.1

Trained May 2026.

License & responsible use

Apache 2.0 (same as base Qwen3-14B). For commercial deployment, you must comply with Qwen3-14B's own license terms. Do not use for diagnostic medical decisions without clinician review.

Citation

@software{sally_v1_2026,
  title = {Sally v1.0 — Metabolic-Health Coach},
  author = {a1c.io team},
  year = {2026},
  url = {https://huggingface.co/sallya1c/sally-1.0},
}

Contact

team@a1c.io · a1c.io

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sallya1c/sally-1.0

Finetuned
Qwen/Qwen3-14B
Adapter
(210)
this model