You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

EXAMI Concept Extractor V4 — Qwen3-2B LoRA Adapter

A LoRA adapter for Qwen/Qwen3-2B fine-tuned to extract study-worthy academic concepts from text across 24 academic domains.

Purpose

Given a text chunk from any academic domain, the model extracts the noun phrases a student must understand to pass an exam on the material. It returns a JSON array of concept strings.

This adapter is part of the EXAMI concept extraction pipeline, designed to power automated study guide generation for educational platforms.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base model + adapter
tokenizer = AutoTokenizer.from_pretrained("your-username/exami-concept-extractor-v4")
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-2B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "your-username/exami-concept-extractor-v4")
model.eval()

# System prompt (V4 "study guide")
SYSTEM = """You are building a study guide. From this text, extract every concept \
a student must understand to pass an exam on this material.

Each concept should be a noun phrase (1-5 words) exactly as it appears in the text. \
Skip names of people, organizations, benchmarks, datasets, and software tools.

Return a JSON array of strings. If the text contains no study-worthy concepts, return []."""

# Extract concepts
text = """The innate immune response is the first mechanism for host defense.
Pattern recognition receptors detect pathogen-associated molecular patterns
and trigger inflammatory cascades."""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": text},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs, max_new_tokens=512, temperature=0.1, do_sample=True, top_p=0.9
    )

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# ["innate immune response", "pattern recognition receptors",
#  "pathogen-associated molecular patterns", "inflammatory cascades"]

Training Details

Parameter	Value
Base model	Qwen/Qwen3-2B (1.9B params)
Adapter type	LoRA
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	10.9M (0.58% of total)
Learning rate	2e-5
Training checkpoint	Step 600 / epoch 0.64 (optimal recall)
Batch size	2 (gradient accumulation 8, effective batch 16)
Max sequence length	2048
Training samples	14,933 (12,246 positives + 2,687 negatives)
Eval samples	785
Scheduler	Cosine with warmup
Precision	bfloat16
Hardware	NVIDIA RTX 5090 (single GPU)

Why epoch 0.64?

The model's extraction recall peaks around epoch 0.64. Beyond this point, the model becomes increasingly conservative and starts returning empty arrays for complex text. Checkpoint-600 was selected after comparing all checkpoints (200, 400, 600, 800, final) on diverse test content including medical textbooks, ML papers, and harvested academic text.

Training Data

Positives (12,246): BIO-tagged academic text across 24 domains, manually vetted (90,171 concepts reviewed, 27,497 NOT_CONCEPT labels removed)
Negatives (2,687): Domain-diverse examples including:
- Benchmark result tables, code snippets, acknowledgment sections
- YouTube sponsor text, author affiliation blocks
- 1,069 real harvested web sentences from Wikipedia containing proper nouns but no study-worthy concepts

Domains Covered

Agriculture & Veterinary, Art & Design, Biology, Business, Chemistry, Communications & Journalism, Computer Science, Earth & Environmental Sciences, Economics, Education, Engineering, History, Information Technology, Law, Literature, Mathematics, Medicine, Music, Philosophy, Physics, Psychology, Religion & Theology, Services, Social Sciences

Pipeline Context

This adapter is Stage 2 of a 4-stage concept extraction pipeline:

Stage	Model	Purpose
1. SciBERT Gate	SciBERT classifier	Filter noisy chunks (bibliography, code, ads)
2. This model	Qwen3-2B + LoRA	Extract candidate concepts from full chunk text
3. RoBERTa Verifier	RoBERTa-large cross-encoder	Score each concept using ~400-char window around concept (F1=83.1%)
4. Embedding Disambiguator	Qwen3-VL-Embedding-2B KNN	Resolve grey-zone cases (100% on tested cases)

Important: This adapter receives the full chunk text (~250 tokens) for extraction. The downstream RoBERTa verifier (Stage 3) receives only a ~400-character rough window centered on each concept -- NOT the full chunk. This prevents score inflation on proper nouns embedded in dense technical text.

Evaluation

Medical Text (StatPearls, NCBI Bookshelf)

Tested on real medical textbook content never seen during training:

Chunk	Concepts Extracted	Kept After Full Pipeline
Cardiac cycle (systole/diastole)	24	23
Ion channel mechanisms & pathology	18	16
Innate immune system	10	10

Sample extracted concepts: "isovolumetric contraction", "Toll-like receptors", "saltatory conduction", "Eisenmenger syndrome", "voltage-gated ion channels", "pulmonary vascular resistance"

ML/CS Text (DeepSeek-V3 Technical Report)

Chunk	Extracted	Kept	Correctly Rejected
Abstract	6	3	MoE, MLA, DeepSeekMoE

Kept: "Mixture-of-Experts", "Multi-head Latent Attention", "multi-token prediction"

Checkpoint Comparison

Checkpoint	DeepConf (noisy)	Medical	Fourier	Quantum
CP-200 (epoch 0.21)	9 kept	12-17 kept	3 kept	3 kept
CP-400 (epoch 0.43)	9 kept	13-17 kept	3 kept	3 kept
CP-600 (epoch 0.64)	8 kept	16-23 kept	3 kept	3 kept
CP-800 (epoch 0.86)	empty	16-21 kept	3 kept	3 kept
Final (epoch 1.00)	empty	15-21 kept	3 kept	3 kept

CP-600 is the sweet spot: strong medical/academic extraction while still handling noisy text (DeepConf paper with embedded figures and code).

Limitations

Returns empty on very informal text (YouTube transcripts, casual speech)
Performs best on clean academic text (textbooks, papers, lecture notes)
Should be used with the SciBERT gate (Stage 1) to pre-filter noisy chunks
May over-extract from very dense technical text (downstream verifier catches this)

Framework Versions

PEFT: 0.18.1
Transformers: 5.x
TRL: 1.x
PyTorch: 2.x

Downloads last month: 27

Model tree for Bei0001/concept-extractor-qwen35-2B-cpu-cuda

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Adapter

(39)

this model