Shrew LoRA Adapters

LoRA adapters for Qwen/Qwen3.5-2B fine-tuned for structured extraction as part of a production RAG application. These power Shrew's structured extraction pipeline.

Adapters

Adapter	Tasks	LoRA rank / alpha	Status
`doc_processing/`	`extract_metadata`, `semantic_chunk`, `summarize_document` (3 tasks, single adapter)	r128 / α256	Recommended — unified, supersedes the 3 per-task adapters below
`extract_metadata/`	Extract structured metadata (title, authors, dates, document type)	r32 / α64	Superseded by `doc_processing/`
`summarize_document/`	Generate document summaries	r32 / α64	Superseded by `doc_processing/`
`semantic_chunk/`	Split documents into semantically coherent sections	r128 / α256	Superseded by `doc_processing/`

The unified doc_processing/ adapter routes by system prompt — the prompt is just the task name (extract_metadata, semantic_chunk, or summarize_document). One adapter, three tasks.

`doc_processing/` metrics


Base model	`Qwen/Qwen3.5-2B`
LoRA rank / alpha	128 / 256
Adapter size	~256 MB (f16)
Training corpus	~103k examples across 3 tasks
Epochs	3
Final eval loss	0.7103
Hardware	4× AMD MI100 (gfx908), bf16, DeepSpeed ZeRO-2

Usage

The unified doc_processing/ adapter routes by system prompt. Same adapter, different task per call.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-2B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
model = PeftModel.from_pretrained(base, "btbtyler09/shrew-2b", subfolder="doc_processing")

messages = [
    {"role": "system", "content": "extract_metadata"},   # or "semantic_chunk", "summarize_document"
    {"role": "user", "content": document_text},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)

extract_metadata and semantic_chunk produce JSON; summarize_document produces prose.

llama.cpp (GGUF)

GGUF versions can be used as LoRA adapters with llama-cli / llama-server:

llama-server -m Qwen3.5-2B.gguf --lora doc_processing.gguf

vLLM

vLLM loads Qwen3.5 as Qwen3_5ForConditionalGeneration (a VLM class), which nests the language model under a language_model. prefix. The adapters here are saved in standard PEFT format (trained with AutoModelForCausalLM), so the weight keys must be renamed before serving with vLLM's --enable-lora. Without the rename, vLLM silently zeros all LoRA weights with no error.

Apply the rename with fix_lora_keys.py from the shrew repo:

python fix_lora_keys.py path/to/adapter
vllm serve Qwen/Qwen3.5-2B \
  --enable-lora \
  --lora-modules doc_processing=path/to/adapter \
  --max-lora-rank 128 \
  --max-loras 1

Note --max-lora-rank 128 for the unified adapter (vLLM's default of 16 is too low).

Sampling parameters

Use Qwen 3.5 instruct-general parameters with enable_thinking=False:

param	value
temperature	0.7
top_p	0.8
top_k	20

License

Same as base model (Qwen/Qwen3.5-2B).

Downloads last month: 44

GGUF

Model size

0.1B params

Architecture

qwen35

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for btbtyler09/shrew-2b

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Adapter

(93)

this model