Instructions to use LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO") model = AutoModelForCausalLM.from_pretrained("LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO
- SGLang
How to use LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO with Docker Model Runner:
docker model run hf.co/LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO
LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO
Merged full model from ORPO cold-start stage of the Graph-PRefLexOR
reproduction fork gyunggyng/lfm-graph-preflexor (fork of
lamm-mit/graph-preflexor-grpo, arXiv 2607.00924v1).
- Base model:
principled-intelligence/Qwen3.5-9B-text-only(Qwen3_5TextForCausalLM,model_type: qwen3_5_text, hybrid linear + full attention) - Stage: ORPO cold-start (step 1 of 2; Graph-GRPO refinement pending)
- Architecture: text-only Qwen3.5 — 32 layers, 4 full-attention layers (every 4th), 28 linear-attention layers, hidden 4096, vocab 248087.
- Checkpoint:
checkpoint-250merged into the base.
Training
- Framework: TRL 0.24
ORPOTrainer, PEFT LoRA (r=32, alpha=64, dropout=0.05, targets = q/k/v/o/gate/up/down). - Data:
lamm-mit/graph_reasoning_10Kfiltered for graph-reasoning items with structured<brainstorm>...<synthesis>reasoning targets. - Hardware: 4× H200 (CUDA 12.8),
torch 2.12.0.dev20260407+cu128,transformers 5.5.4, bfloat16. - Hparams: LR 5e-6, effective batch 8 (per_device 1 × accum 2 ×
world 4), max_prompt 1536, max_completion 4096, eval disabled to fit VRAM
(ORPO's
concatenated_forwardOOMs at 9B + seq 5632).
Results (checkpoint-250)
| Metric | Value |
|---|---|
| ORPO loss | 1.413 → 0.98 (step 295, crashed at final eval) |
| ORPO accuracy | 1.0 (from step 55) |
| Eval score (rq,depth,trace,overall, /10) | 5.42 / 6.60 / 6.38 / 6.13 |
| Sentinel hit-rate (100 q) | brainstorm 99, graph 94, graph_json 79, patterns 85, synthesis 84 |
Eval was run with scripts/05c_eval_transformers.py (4-GPU shard-parallel
transformers, eager attention, thinking enabled) because vLLM 0.19 / 0.20
do not register Qwen3_5TextForCausalLM — see "Known limitations" below.
Output format
The model emits a structured reasoning trace inside <think>:
<think>
<brainstorm>... free-form exploration ...</brainstorm>
<graph>... concept graph (natural language) ...</graph>
<graph_json>{"nodes": [...], "edges": [...]}</graph_json>
<patterns>... reusable abstractions ...</patterns>
<synthesis>... integrated reasoning ...</synthesis>
</think>
final answer
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO",
dtype="bfloat16", device_map="auto", trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained(
"LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO",
trust_remote_code=True,
)
prompt = tok.apply_chat_template(
[{"role": "user", "content": "Your graph-reasoning question here"}],
tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=3500, do_sample=True, temperature=0.2)
print(tok.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=False))
attn_implementation="eager" is required if flash-linear-attention is
not installed; SDPA silently returns empty tokens otherwise.
Known limitations
- vLLM:
Qwen3_5TextForCausalLMis not in vLLM's registered architectures as of vLLM 0.20.2 (onlyQwen3_5ForConditionalGeneration/Qwen3_5MoeForConditionalGeneration/Qwen3_5MTPare). vLLM's genericTransformersForCausalLMwrapper also fails because it expects the multimodal prefixmodel.language_model.*, while text-only weights are flat atmodel.layers.*. Use transformers for inference until vLLM adds native text-only Qwen3.5 support. - Final eval OOM:
ORPOTrainerforces a finalevaluate()after training, which OOMs on 9B + seq 5632 inconcatenated_forward. Checkpoint-250 (saved before the crash) is what's merged here. - GRPO stage: not yet completed. The Qwen3.5 text-only arch blocks the vLLM-based rollout server the GRPO config assumes, so the GRPO refinement run was abandoned at the import step.
Citation
@article{graphpreflexor2025,
title={Graph-PRefLexOR: Graph-based Preference-based Reasoning via Learning},
doi={10.48550/arXiv.2607.00924}
}
- Downloads last month
- -
Model tree for LLM-OS-Models/Qwen3.5-9B-Graph-Preflexor-ORPO
Base model
principled-intelligence/Qwen3.5-9B-text-only