How to use from
Docker Model Runner
docker model run hf.co/btbtyler09/shrew-2b
Quick Links

Shrew LoRA Adapters

LoRA adapters for Qwen/Qwen3.5-2B fine-tuned for structured extraction as part of a production RAG application. These power Shrew's structured extraction pipeline.

Adapters

Adapter Tasks LoRA rank / alpha Status
doc_processing/ extract_metadata, semantic_chunk, summarize_document (3 tasks, single adapter) r128 / α256 Recommended — unified, supersedes the 3 per-task adapters below
extract_metadata/ Extract structured metadata (title, authors, dates, document type) r32 / α64 Superseded by doc_processing/
summarize_document/ Generate document summaries r32 / α64 Superseded by doc_processing/
semantic_chunk/ Split documents into semantically coherent sections r128 / α256 Superseded by doc_processing/

The unified doc_processing/ adapter routes by system prompt — the prompt is just the task name (extract_metadata, semantic_chunk, or summarize_document). One adapter, three tasks.

doc_processing/ metrics

Base model Qwen/Qwen3.5-2B
LoRA rank / alpha 128 / 256
Adapter size ~256 MB (f16)
Training corpus ~103k examples across 3 tasks
Epochs 3
Final eval loss 0.7103
Hardware 4× AMD MI100 (gfx908), bf16, DeepSpeed ZeRO-2

Usage

The unified doc_processing/ adapter routes by system prompt. Same adapter, different task per call.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-2B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
model = PeftModel.from_pretrained(base, "btbtyler09/shrew-2b", subfolder="doc_processing")

messages = [
    {"role": "system", "content": "extract_metadata"},   # or "semantic_chunk", "summarize_document"
    {"role": "user", "content": document_text},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)

extract_metadata and semantic_chunk produce JSON; summarize_document produces prose.

llama.cpp (GGUF)

GGUF versions can be used as LoRA adapters with llama-cli / llama-server:

llama-server -m Qwen3.5-2B.gguf --lora doc_processing.gguf

vLLM

vLLM loads Qwen3.5 as Qwen3_5ForConditionalGeneration (a VLM class), which nests the language model under a language_model. prefix. The adapters here are saved in standard PEFT format (trained with AutoModelForCausalLM), so the weight keys must be renamed before serving with vLLM's --enable-lora. Without the rename, vLLM silently zeros all LoRA weights with no error.

Apply the rename with fix_lora_keys.py from the shrew repo:

python fix_lora_keys.py path/to/adapter
vllm serve Qwen/Qwen3.5-2B \
  --enable-lora \
  --lora-modules doc_processing=path/to/adapter \
  --max-lora-rank 128 \
  --max-loras 1

Note --max-lora-rank 128 for the unified adapter (vLLM's default of 16 is too low).

Sampling parameters

Use Qwen 3.5 instruct-general parameters with enable_thinking=False:

param value
temperature 0.7
top_p 0.8
top_k 20

License

Same as base model (Qwen/Qwen3.5-2B).

Downloads last month
44
GGUF
Model size
0.1B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for btbtyler09/shrew-2b

Finetuned
Qwen/Qwen3.5-2B
Adapter
(93)
this model