Text Generation
Transformers
Safetensors
English
qwen3
rim
reasoning-in-memory
latent-reasoning
medical
reasoning
openmed
conversational
text-generation-inference
Instructions to use NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM") model = AutoModelForCausalLM.from_pretrained("NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM
- SGLang
How to use NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM with Docker Model Runner:
docker model run hf.co/NeuroDiscoveryAI/OpenMed-qwen3-1.7b-RIM
| license: apache-2.0 | |
| base_model: Qwen/Qwen3-1.7B | |
| datasets: | |
| - OpenMed/Medical-Reasoning-SFT-Mega | |
| language: | |
| - en | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| tags: | |
| - rim | |
| - reasoning-in-memory | |
| - latent-reasoning | |
| - medical | |
| - reasoning | |
| - qwen3 | |
| - openmed | |
| # RiM-Qwen3-1.7B — Reasoning in Memory for Medical QA | |
| **Single-pass latent reasoning** for medical multiple-choice QA. Instead of | |
| generating a chain-of-thought, this model reasons inside **fixed memory blocks** | |
| and is read out in **one forward pass** — matching or beating both a zero-shot | |
| base and an explicit-CoT baseline across in-distribution and two external | |
| medical benchmarks, while answering **~220–630× faster per query**. | |
| This is a research proof-of-concept implementation of **Reasoning in Memory | |
| (RiM)** (Aichberger & Hochreiter) on top of `Qwen/Qwen3-1.7B`, trained on the | |
| `OpenMed/Medical-Reasoning-SFT-Mega` mixture. | |
| > ⚠️ **Medical disclaimer.** Research artifact only. **Not** a medical device and | |
| > **not** for clinical, diagnostic, or treatment use. Outputs can be wrong. | |
| ## How it works | |
| A **memory block** is the fixed token sequence `[<rim_b> <rim_m> <rim_m> <rim_eb>]`. | |
| We append `K` blocks after the question; their contextual representations form a | |
| latent workspace. A two-stage curriculum (Stage 1 grounds the blocks against | |
| reasoning steps; Stage 2 refines the final answer across the K blocks) teaches the | |
| model to compute through the blocks. At inference the answer is read out after the | |
| blocks in a **single forward pass** — no reasoning tokens are generated. | |
| Only the 3 new special-token embeddings are learned from scratch; the rest of the | |
| transformer is fine-tuned and the pretrained vocabulary embeddings are frozen. | |
| ## Results | |
| Greedy accuracy (N=1000/cell; random = 25% on the 4-option OOD sets). | |
| | model | In-dist (held-out) | MedQA (OOD) | MedMCQA (OOD) | latency/query† | | |
| |---|---|---|---|---| | |
| | Base Qwen3-1.7B (zero-shot) | 50.9% | 45.7% | 42.8% | ~7.8 s | | |
| | CoT (explicit SFT) | 47.3% | 42.3% | 42.4% | ~22 s | | |
| | **RiM v1 (this model)** | **53.6%** | 45.1% | **47.2%** | **35 ms** | | |
| | RiM v2 (MCQ-weighted Stage 2) | 53.2% | **46.9%** | 47.2% | 35 ms | | |
| - RiM is **best or tied on all three benchmarks** while answering **~220× faster than | |
| the base and ~630× faster than CoT** per query — because it reads the answer out of | |
| the memory blocks instead of autoregressively generating a reasoning trace. | |
| - In-distribution **pass@8 ≈ 85%** (vs ~54% greedy), and accuracy is **stable across | |
| memory budgets** K∈{1,2,4,8}. | |
| - Honest notes: differences on MedQA are within noise (~±1.5%); the explicit-CoT SFT | |
| baseline slightly *underperforms* the zero-shot base here (fine-tuning on the | |
| mixed-quality, 91%-open-ended traces modestly hurt the strong base instruct model). | |
| †**Latency methodology.** Single-request (batch=1) answer generation on one RTX PRO | |
| 6000, bf16, warmed up, mean over 32 samples. RiM = 35 ms to generate the answer (the | |
| pure forward-pass readout is **12 ms**); base/CoT must generate ~520 / ~1460 tokens | |
| (~7.8 s / ~22 s). Under large-batch serving the per-sample *throughput* gap is smaller | |
| (≈8 ms vs ≈1 s) but the single-query latency above is what a user waits for one answer. | |
| ## Usage (single forward pass, no generated reasoning) | |
| ```python | |
| import torch, re | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| REPO = "NDIJayant/OpenMed-qwen3-1.7b-RIM" | |
| K, M = 8, 2 # memory blocks; <rim_m> tokens per block | |
| tok = AutoTokenizer.from_pretrained(REPO) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| REPO, dtype=torch.bfloat16, attn_implementation="sdpa").cuda().eval() | |
| b, m, eb = (tok.convert_tokens_to_ids(t) for t in ("<rim_b>", "<rim_m>", "<rim_eb>")) | |
| block = [b] + [m] * M + [eb] | |
| PREFIX = tok.encode("The final answer is \\boxed{", add_special_tokens=False) | |
| @torch.no_grad() | |
| def answer(question: str) -> str: | |
| q = tok.apply_chat_template([{"role": "user", "content": question}], | |
| tokenize=True, add_generation_prompt=True, | |
| enable_thinking=False) | |
| ids = q + block * K + PREFIX | |
| out = model.generate(torch.tensor([ids]).cuda(), max_new_tokens=8, | |
| do_sample=False, pad_token_id=tok.eos_token_id) | |
| gen = tok.decode(out[0, len(ids):], skip_special_tokens=True) | |
| mtch = re.search(r"([A-J])", gen) | |
| return mtch.group(1) if mtch else None | |
| q = ("Which vitamin deficiency causes scurvy?\n" | |
| "A: Vitamin A\nB: Vitamin B12\nC: Vitamin C\nD: Vitamin D") | |
| print(answer(q)) # -> "C" | |
| ``` | |
| Use `attn_implementation="sdpa"` (not flash-attention) if you ever need the custom | |
| masked training path; for this single-pass inference plain causal attention is fine. | |
| ## Training | |
| - Base: `Qwen/Qwen3-1.7B` (dense, full-attention). Data: `OpenMed/Medical-Reasoning-SFT-Mega` | |
| (mixture of multiple-choice + open-ended; trained on the full mixture, evaluated on | |
| the MCQ subset). | |
| - Stage 1: 6 epochs, one memory block per reasoning step, linear-relative supervision | |
| anneal. Stage 2: 2 epochs, K=8 blocks, anytime-answer objective, lower LR + higher | |
| dropout. bf16, 8× GPU, custom 4D attention mask (SDPA). | |
| - Code: training/eval/benchmark scripts are released alongside this model. | |
| ## Limitations | |
| In-distribution eval uses auto-extracted answer letters from a held-out slice of the | |
| training dataset. Single model size (1.7B) and seed. English only. The OOD numbers | |
| (MedQA/MedMCQA) are 4-option; in-distribution is up to 10-option. Not safe for any | |
| real-world medical decision-making. | |
| ## Citation | |
| ```bibtex | |
| @article{aichberger2026rim, | |
| title = {Unlocking the Working Memory of Large Language Models for Latent Reasoning}, | |
| author = {Aichberger, Lukas and Hochreiter, Sepp}, | |
| year = {2026} | |
| } | |
| ``` | |
| Also cite `Qwen/Qwen3-1.7B` and `OpenMed/Medical-Reasoning-SFT-Mega` (both Apache-2.0). | |