EXAMI Concept Extractor V4 — Qwen3-2B LoRA Adapter
A LoRA adapter for Qwen/Qwen3-2B fine-tuned to extract study-worthy academic concepts from text across 24 academic domains.
Purpose
Given a text chunk from any academic domain, the model extracts the noun phrases a student must understand to pass an exam on the material. It returns a JSON array of concept strings.
This adapter is part of the EXAMI concept extraction pipeline, designed to power automated study guide generation for educational platforms.
Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load base model + adapter
tokenizer = AutoTokenizer.from_pretrained("your-username/exami-concept-extractor-v4")
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-2B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "your-username/exami-concept-extractor-v4")
model.eval()
# System prompt (V4 "study guide")
SYSTEM = """You are building a study guide. From this text, extract every concept \
a student must understand to pass an exam on this material.
Each concept should be a noun phrase (1-5 words) exactly as it appears in the text. \
Skip names of people, organizations, benchmarks, datasets, and software tools.
Return a JSON array of strings. If the text contains no study-worthy concepts, return []."""
# Extract concepts
text = """The innate immune response is the first mechanism for host defense.
Pattern recognition receptors detect pathogen-associated molecular patterns
and trigger inflammatory cascades."""
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": text},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs, max_new_tokens=512, temperature=0.1, do_sample=True, top_p=0.9
)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# ["innate immune response", "pattern recognition receptors",
# "pathogen-associated molecular patterns", "inflammatory cascades"]
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-2B (1.9B params) |
| Adapter type | LoRA |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 10.9M (0.58% of total) |
| Learning rate | 2e-5 |
| Training checkpoint | Step 600 / epoch 0.64 (optimal recall) |
| Batch size | 2 (gradient accumulation 8, effective batch 16) |
| Max sequence length | 2048 |
| Training samples | 14,933 (12,246 positives + 2,687 negatives) |
| Eval samples | 785 |
| Scheduler | Cosine with warmup |
| Precision | bfloat16 |
| Hardware | NVIDIA RTX 5090 (single GPU) |
Why epoch 0.64?
The model's extraction recall peaks around epoch 0.64. Beyond this point, the model becomes increasingly conservative and starts returning empty arrays for complex text. Checkpoint-600 was selected after comparing all checkpoints (200, 400, 600, 800, final) on diverse test content including medical textbooks, ML papers, and harvested academic text.
Training Data
- Positives (12,246): BIO-tagged academic text across 24 domains, manually vetted (90,171 concepts reviewed, 27,497 NOT_CONCEPT labels removed)
- Negatives (2,687): Domain-diverse examples including:
- Benchmark result tables, code snippets, acknowledgment sections
- YouTube sponsor text, author affiliation blocks
- 1,069 real harvested web sentences from Wikipedia containing proper nouns but no study-worthy concepts
Domains Covered
Agriculture & Veterinary, Art & Design, Biology, Business, Chemistry, Communications & Journalism, Computer Science, Earth & Environmental Sciences, Economics, Education, Engineering, History, Information Technology, Law, Literature, Mathematics, Medicine, Music, Philosophy, Physics, Psychology, Religion & Theology, Services, Social Sciences
Pipeline Context
This adapter is Stage 2 of a 4-stage concept extraction pipeline:
| Stage | Model | Purpose |
|---|---|---|
| 1. SciBERT Gate | SciBERT classifier | Filter noisy chunks (bibliography, code, ads) |
| 2. This model | Qwen3-2B + LoRA | Extract candidate concepts from full chunk text |
| 3. RoBERTa Verifier | RoBERTa-large cross-encoder | Score each concept using ~400-char window around concept (F1=83.1%) |
| 4. Embedding Disambiguator | Qwen3-VL-Embedding-2B KNN | Resolve grey-zone cases (100% on tested cases) |
Important: This adapter receives the full chunk text (~250 tokens) for extraction. The downstream RoBERTa verifier (Stage 3) receives only a ~400-character rough window centered on each concept -- NOT the full chunk. This prevents score inflation on proper nouns embedded in dense technical text.
Evaluation
Medical Text (StatPearls, NCBI Bookshelf)
Tested on real medical textbook content never seen during training:
| Chunk | Concepts Extracted | Kept After Full Pipeline |
|---|---|---|
| Cardiac cycle (systole/diastole) | 24 | 23 |
| Ion channel mechanisms & pathology | 18 | 16 |
| Innate immune system | 10 | 10 |
Sample extracted concepts: "isovolumetric contraction", "Toll-like receptors", "saltatory conduction", "Eisenmenger syndrome", "voltage-gated ion channels", "pulmonary vascular resistance"
ML/CS Text (DeepSeek-V3 Technical Report)
| Chunk | Extracted | Kept | Correctly Rejected |
|---|---|---|---|
| Abstract | 6 | 3 | MoE, MLA, DeepSeekMoE |
Kept: "Mixture-of-Experts", "Multi-head Latent Attention", "multi-token prediction"
Checkpoint Comparison
| Checkpoint | DeepConf (noisy) | Medical | Fourier | Quantum |
|---|---|---|---|---|
| CP-200 (epoch 0.21) | 9 kept | 12-17 kept | 3 kept | 3 kept |
| CP-400 (epoch 0.43) | 9 kept | 13-17 kept | 3 kept | 3 kept |
| CP-600 (epoch 0.64) | 8 kept | 16-23 kept | 3 kept | 3 kept |
| CP-800 (epoch 0.86) | empty | 16-21 kept | 3 kept | 3 kept |
| Final (epoch 1.00) | empty | 15-21 kept | 3 kept | 3 kept |
CP-600 is the sweet spot: strong medical/academic extraction while still handling noisy text (DeepConf paper with embedded figures and code).
Limitations
- Returns empty on very informal text (YouTube transcripts, casual speech)
- Performs best on clean academic text (textbooks, papers, lecture notes)
- Should be used with the SciBERT gate (Stage 1) to pre-filter noisy chunks
- May over-extract from very dense technical text (downstream verifier catches this)
Framework Versions
- PEFT: 0.18.1
- Transformers: 5.x
- TRL: 1.x
- PyTorch: 2.x
- Downloads last month
- 27