Instructions to use btbtyler09/shrew-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use btbtyler09/shrew-2b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="btbtyler09/shrew-2b")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("btbtyler09/shrew-2b", dtype="auto") - PEFT
How to use btbtyler09/shrew-2b with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use btbtyler09/shrew-2b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "btbtyler09/shrew-2b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/shrew-2b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/btbtyler09/shrew-2b
- SGLang
How to use btbtyler09/shrew-2b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "btbtyler09/shrew-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/shrew-2b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "btbtyler09/shrew-2b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/shrew-2b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use btbtyler09/shrew-2b with Docker Model Runner:
docker model run hf.co/btbtyler09/shrew-2b
Shrew LoRA Adapters
LoRA adapters for Qwen/Qwen3.5-2B fine-tuned for structured extraction as part of a production RAG application. These power Shrew's structured extraction pipeline.
Adapters
| Adapter | Tasks | LoRA rank / alpha | Status |
|---|---|---|---|
doc_processing/ |
extract_metadata, semantic_chunk, summarize_document (3 tasks, single adapter) |
r128 / α256 | Recommended — unified, supersedes the 3 per-task adapters below |
extract_metadata/ |
Extract structured metadata (title, authors, dates, document type) | r32 / α64 | Superseded by doc_processing/ |
summarize_document/ |
Generate document summaries | r32 / α64 | Superseded by doc_processing/ |
semantic_chunk/ |
Split documents into semantically coherent sections | r128 / α256 | Superseded by doc_processing/ |
The unified doc_processing/ adapter routes by system prompt — the prompt is just the task name (extract_metadata, semantic_chunk, or summarize_document). One adapter, three tasks.
doc_processing/ metrics
| Base model | Qwen/Qwen3.5-2B |
| LoRA rank / alpha | 128 / 256 |
| Adapter size | ~256 MB (f16) |
| Training corpus | ~103k examples across 3 tasks |
| Epochs | 3 |
| Final eval loss | 0.7103 |
| Hardware | 4× AMD MI100 (gfx908), bf16, DeepSpeed ZeRO-2 |
Usage
The unified doc_processing/ adapter routes by system prompt. Same adapter, different task per call.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-2B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B")
model = PeftModel.from_pretrained(base, "btbtyler09/shrew-2b", subfolder="doc_processing")
messages = [
{"role": "system", "content": "extract_metadata"}, # or "semantic_chunk", "summarize_document"
{"role": "user", "content": document_text},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
extract_metadata and semantic_chunk produce JSON; summarize_document produces prose.
llama.cpp (GGUF)
GGUF versions can be used as LoRA adapters with llama-cli / llama-server:
llama-server -m Qwen3.5-2B.gguf --lora doc_processing.gguf
vLLM
vLLM loads Qwen3.5 as Qwen3_5ForConditionalGeneration (a VLM class), which nests the language model under a language_model. prefix. The adapters here are saved in standard PEFT format (trained with AutoModelForCausalLM), so the weight keys must be renamed before serving with vLLM's --enable-lora. Without the rename, vLLM silently zeros all LoRA weights with no error.
Apply the rename with fix_lora_keys.py from the shrew repo:
python fix_lora_keys.py path/to/adapter
vllm serve Qwen/Qwen3.5-2B \
--enable-lora \
--lora-modules doc_processing=path/to/adapter \
--max-lora-rank 128 \
--max-loras 1
Note --max-lora-rank 128 for the unified adapter (vLLM's default of 16 is too low).
Sampling parameters
Use Qwen 3.5 instruct-general parameters with enable_thinking=False:
| param | value |
|---|---|
| temperature | 0.7 |
| top_p | 0.8 |
| top_k | 20 |
License
Same as base model (Qwen/Qwen3.5-2B).
- Downloads last month
- 44
We're not able to determine the quantization variants.
docker model run hf.co/btbtyler09/shrew-2b