Instructions to use 4rc4n4/qwen2.5-7b-rellm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 4rc4n4/qwen2.5-7b-rellm with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="4rc4n4/qwen2.5-7b-rellm") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("4rc4n4/qwen2.5-7b-rellm", dtype="auto") - llama-cpp-python
How to use 4rc4n4/qwen2.5-7b-rellm with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="4rc4n4/qwen2.5-7b-rellm", filename="gguf/qwen2.5-7b-rellm-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use 4rc4n4/qwen2.5-7b-rellm with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf 4rc4n4/qwen2.5-7b-rellm:F16 # Run inference directly in the terminal: llama-cli -hf 4rc4n4/qwen2.5-7b-rellm:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf 4rc4n4/qwen2.5-7b-rellm:F16 # Run inference directly in the terminal: llama-cli -hf 4rc4n4/qwen2.5-7b-rellm:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf 4rc4n4/qwen2.5-7b-rellm:F16 # Run inference directly in the terminal: ./llama-cli -hf 4rc4n4/qwen2.5-7b-rellm:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf 4rc4n4/qwen2.5-7b-rellm:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf 4rc4n4/qwen2.5-7b-rellm:F16
Use Docker
docker model run hf.co/4rc4n4/qwen2.5-7b-rellm:F16
- LM Studio
- Jan
- vLLM
How to use 4rc4n4/qwen2.5-7b-rellm with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "4rc4n4/qwen2.5-7b-rellm" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "4rc4n4/qwen2.5-7b-rellm", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/4rc4n4/qwen2.5-7b-rellm:F16
- SGLang
How to use 4rc4n4/qwen2.5-7b-rellm with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "4rc4n4/qwen2.5-7b-rellm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "4rc4n4/qwen2.5-7b-rellm", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "4rc4n4/qwen2.5-7b-rellm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "4rc4n4/qwen2.5-7b-rellm", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use 4rc4n4/qwen2.5-7b-rellm with Ollama:
ollama run hf.co/4rc4n4/qwen2.5-7b-rellm:F16
- Unsloth Studio new
How to use 4rc4n4/qwen2.5-7b-rellm with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for 4rc4n4/qwen2.5-7b-rellm to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for 4rc4n4/qwen2.5-7b-rellm to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for 4rc4n4/qwen2.5-7b-rellm to start chatting
- Pi new
How to use 4rc4n4/qwen2.5-7b-rellm with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf 4rc4n4/qwen2.5-7b-rellm:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "4rc4n4/qwen2.5-7b-rellm:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use 4rc4n4/qwen2.5-7b-rellm with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf 4rc4n4/qwen2.5-7b-rellm:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default 4rc4n4/qwen2.5-7b-rellm:F16
Run Hermes
hermes
- Docker Model Runner
How to use 4rc4n4/qwen2.5-7b-rellm with Docker Model Runner:
docker model run hf.co/4rc4n4/qwen2.5-7b-rellm:F16
- Lemonade
How to use 4rc4n4/qwen2.5-7b-rellm with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull 4rc4n4/qwen2.5-7b-rellm:F16
Run and chat with the model
lemonade run user.qwen2.5-7b-rellm-F16
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)qwen2.5-7b-rellm
A distilled chunkโconcept tagger for the guru comparative-religion pipeline. Fine-tuned from Qwen2.5-7B-Instruct on 2,598 (passage, tag-set) pairs labeled by a larger 27B teacher, this model scores passages from mystical texts against a curated taxonomy of comparative-religion concepts.
Current version: v2 โ see Versions for v1.
Training pipeline: github.com/4-R-C-4-N-4/rellm
What it does
Given a passage of mystical text and a list of candidate concepts (each with an ID and a one-sentence definition), the model returns a JSON array rating every present concept on a 0โ3 scale:
- 0 โ not present
- 1 โ peripherally present
- 2 โ clearly present
- 3 โ central theme
Concepts scoring 0 are omitted. The output is strict JSON โ no markdown, no prose. The prompt contract matches the production caller in guru exactly, so this model is a drop-in replacement for the teacher in the tagging step.
Why it exists
The guru pipeline indexes a multi-tradition corpus of mystical texts by tagging each passage against a working taxonomy of ~88 (and growing) comparative-religion concepts (e.g. theosis, paradox_as_teaching, divine_marriage, archons). Two upstream options had problems:
- The 27B teacher produces high-quality labels but is too slow to re-tag the full corpus on every taxonomy revision.
- The off-the-shelf 7B base model is fast enough but is unreliable: it under-tags (recall 0.16 on v2 eval), invents out-of-taxonomy IDs (9 in 130 chunks), and misjudges severity.
This model closes most of that gap at the 7B compute budget.
Evaluation (v2)
Held-out test split from the same data distribution. base = Qwen2.5-7B-Instruct (no fine-tuning), v2 = this model. Both queried at temperature 0 with identical system + user prompts via llama-server.
vs teacher labels (130 chunks, 88 concepts)
| Model | Precision | Recall | F1 | Macro-F1 | MAE | Parse rate | OOT-IDs | Lat (s) |
|---|---|---|---|---|---|---|---|---|
| base | 0.328 | 0.162 | 0.217 | 0.182 | 0.53 | 96.2% | 9 | 4.00 |
| v2 | 0.597 | 0.602 | 0.599 | 0.525 | 0.42 | 99.2% | 2 | 4.72 |
vs human-graded labels (held-out test chunks, 92 chunks)
Strongest signal โ humans labels are an independent ground truth.
| Model | Precision | Recall | F1 | Specificity |
|---|---|---|---|---|
| base | 0.750 | 0.158 | 0.261 | 0.963 |
| v2 | 0.600 | 0.474 | 0.529 | 0.778 |
| 27B teacher (reference) | 0.378 | 1.000 | 0.549 | 0.000 |
v2 at F1=0.529 on test chunks vs human ground truth essentially matches the 27B teacher's own F1 of 0.549 at 7B compute cost โ the distillation goal.
Where v2 moves the needle vs v1
v2's biggest gain over v1 is uniformity across the 88-concept taxonomy: macro-F1 0.398 โ 0.525 (+0.127), driven by previously-blind concepts (prayer, theurgy, detachment_gelassenheit, wu_wei, evil_as_privation, etc.) and previously-underrepresented traditions (mesopotamian +0.32, jewish_mysticism +0.20, buddhism +0.19, sufism +0.14). Trade: small regressions on traditions v1 over-specialized in (egyptian โ0.06, hermeticism โ0.08, western_esoteric โ0.03) and on a handful of high-frequency concepts that lose data share under the broader distribution (living_god โ0.21, body_as_obstacle โ0.17).
Full v1โv2 comparison: docs/v1-vs-v2-comparison.md.
Training (v2)
- Base:
unsloth/Qwen2.5-7B-Instruct-bnb-4bit - Method: Supervised fine-tuning (TRL
SFTTrainer) with QLoRA via Unsloth - LoRA: r=32, ฮฑ=64, dropout=0, applied to all attention + MLP projections (
q,k,v,o,gate,up,down) - Schedule: 3 epochs, batch 1 ร grad-accum 16 (effective 16), paged AdamW-8bit, lr 1.5e-4, cosine, warmup 0.03
- Sequence length: 5632 (88-concept prompts run median 5137 tokens; 5632 is the largest context that fits backward on a 24 GB 3090)
- Chat template:
qwen-2.5 - Checkpoint: best-by-val-loss
- Hardware: single 24 GB GPU (NVIDIA RTX 3090)
- Wall-clock: 15h 51m
- Seed: 42
Training data (v2)
Source: staged_tags table of a guru.db snapshot, filtered to rows produced by teacher Qwen3.5-27B-UD-Q4_K_XL.gguf with prompt version v1, status โ {pending, accepted}.
- 2,598 chunks across 88 concepts (in-export)
- Splits: 2,339 train / 129 val / 130 test (90/5/5, stratified by chunk_id)
- 160 train chunks dropped during training because their tokenized length exceeded
max_seq_length=5632(right-truncation would cut the assistant JSON response; better to drop) - Tradition mix (largest โ smallest): neoplatonism, egyptian, taoism, greek_mystery, western_esoteric, zoroastrianism, jewish_mysticism, gnosticism, christian_mysticism, renaissance_hermeticism, hermeticism, buddhism, sufism, mesopotamian, platonism (plus a handful of single-chunk traditions). Compared to v1, buddhism, sufism, and several Hindu lineages now have nonzero training signal.
Files in this repo
adapter/โ LoRA adapter (~310 MB). Load on top of the base model; this is the canonical, reproducible artifact.merged/โ adapter merged into base weights, FP16 (~15 GB). For directfrom_pretrained.gguf/โ quantized forllama.cpp/ Ollama / llama-server.qwen2.5-7b-rellm-F16.ggufโ full-precision conversionqwen2.5-7b-rellm-Q4_K_M.ggufโ 4-bit, ~4.4 GB, recommended for local inference
Usage
llama.cpp / llama-server (recommended for the guru pipeline)
llama-server -m qwen2.5-7b-rellm-Q4_K_M.gguf --jinja --port 8080
The --jinja flag is required so the model's chat template is applied. The guru tagging caller hits the OpenAI-compatible /v1/chat/completions endpoint.
transformers (merged weights)
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("4rc4n4/qwen2.5-7b-rellm", subfolder="merged")
model = AutoModelForCausalLM.from_pretrained(
"4rc4n4/qwen2.5-7b-rellm", subfolder="merged", device_map="auto"
)
messages = [
{"role": "system", "content": "You are a comparative religion scholar..."}, # see prompt below
{"role": "user", "content": "<passage + concept list>"},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device), max_new_tokens=1024)
print(tok.decode(out[0], skip_special_tokens=True))
LoRA adapter (on top of the base)
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("unsloth/Qwen2.5-7B-Instruct-bnb-4bit", device_map="auto")
model = PeftModel.from_pretrained(base, "4rc4n4/qwen2.5-7b-rellm", subfolder="adapter")
Prompt contract
The model expects the exact prompt structure used at training time. See src/rellm/formats.py in the training repo for the canonical builder; the short version is:
- System:
"You are a comparative religion scholar helping to build a concept index of mystical texts. For each passage given, score it against every concept definition provided. Respond ONLY with a valid JSON array (no markdown, no commentary)." - User: a passage block, the 0โ3 scoring rubric, and a JSON list of
{id, definition}candidate concepts, ending with the output schema andReturn [] if nothing scores >= 1.
Deviating from this format will degrade quality โ the model was trained on a single prompt template.
Versions
| Tag | Date | F1 (vs teacher, full test) | Macro-F1 | Training data |
|---|---|---|---|---|
| v1 | 2026-05-13 | 0.577 (re-scored on v2-test) / 0.629 (v1-era) | 0.398 / 0.548 | 2,188 chunks, 61 concepts |
| v2 | 2026-05-22 | 0.599 | 0.525 | 2,598 chunks, 88 concepts |
Pin a specific version with revision="v1" or revision="v2" when downloading.
v1 (historical)
The v1 release was trained on the 61-concept taxonomy snapshot from 2026-05-11 (2,188 SFT examples). Its model card reported F1=0.629 / Macro-F1=0.548 against the v1-era teacher labels on 103 test chunks, and F1=0.638 / Macro-F1=0.508 against 360 human-graded chunks. A v2-era sanity rerun against the original v1 snapshot reproduces F1=0.615 (drift due to running against the current 88-concept taxonomy file rather than the v1-era 61-concept one).
v1 is preserved at the v1 git tag on both the HF repo and the rellm GitHub repo. Use it if you need to reproduce earlier results exactly; otherwise prefer v2.
Limitations
- Domain-locked, but broader than v1. v2 added training signal for buddhism, sufism, jewish_mysticism, and mesopotamian, but the corpus is still heavily Mediterranean / Greek-philosophical at the long tail. Calibration on East-Asian, South-Asian, and indigenous traditions remains weak.
- Taxonomy-bound. Scoring is conditioned on the concept list passed in the prompt. The model will faithfully ignore concepts not given to it; if you change the taxonomy meaningfully, retrain.
- Imbalanced concepts. A handful of low-frequency concepts (
archons,pleroma,divine_intoxication,demiurge) still have F1 โ 0 โ too few teacher positives to learn a reliable boundary. Filter or boost these in downstream review. - High-frequency concept drift. v2 traded some precision on a handful of high-frequency concepts that dominated v1's training mass (
living_godโ0.21,body_as_obstacleโ0.17,apophatic_theologyโ0.09) for broader coverage. If your downstream use is dominated by those concepts, v1 may still be competitive. - Latency. Greedy decoding at ~4.7 s/chunk on a single 24 GB GPU is fine for batch corpus tagging but not for interactive use. Use the Q4_K_M GGUF for faster local inference.
- Not a chat model anymore. This is a tagging specialist. Don't expect general assistant behavior โ it was tuned on a single task and prompt format.
License
Apache 2.0
Citation
@software{rellm_qwen25_7b,
author = {4rc4n4},
title = {qwen2.5-7b-rellm: a distilled chunkโconcept tagger for comparative-religion corpora},
year = {2026},
url = {https://huggingface.co/4rc4n4/qwen2.5-7b-rellm}
}
- Downloads last month
- 75
4-bit
16-bit
Model tree for 4rc4n4/qwen2.5-7b-rellm
Base model
Qwen/Qwen2.5-7B
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="4rc4n4/qwen2.5-7b-rellm", filename="", )