You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

EXAMI Concept Extractor V4 — Qwen3-2B LoRA Adapter

A LoRA adapter for Qwen/Qwen3-2B fine-tuned to extract study-worthy academic concepts from text across 24 academic domains.

Purpose

Given a text chunk from any academic domain, the model extracts the noun phrases a student must understand to pass an exam on the material. It returns a JSON array of concept strings.

This adapter is part of the EXAMI concept extraction pipeline, designed to power automated study guide generation for educational platforms.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Load base model + adapter
tokenizer = AutoTokenizer.from_pretrained("your-username/exami-concept-extractor-v4")
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-2B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "your-username/exami-concept-extractor-v4")
model.eval()

# System prompt (V4 "study guide")
SYSTEM = """You are building a study guide. From this text, extract every concept \
a student must understand to pass an exam on this material.

Each concept should be a noun phrase (1-5 words) exactly as it appears in the text. \
Skip names of people, organizations, benchmarks, datasets, and software tools.

Return a JSON array of strings. If the text contains no study-worthy concepts, return []."""

# Extract concepts
text = """The innate immune response is the first mechanism for host defense.
Pattern recognition receptors detect pathogen-associated molecular patterns
and trigger inflammatory cascades."""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": text},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs, max_new_tokens=512, temperature=0.1, do_sample=True, top_p=0.9
    )

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# ["innate immune response", "pattern recognition receptors",
#  "pathogen-associated molecular patterns", "inflammatory cascades"]

Training Details

Parameter Value
Base model Qwen/Qwen3-2B (1.9B params)
Adapter type LoRA
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 10.9M (0.58% of total)
Learning rate 2e-5
Training checkpoint Step 600 / epoch 0.64 (optimal recall)
Batch size 2 (gradient accumulation 8, effective batch 16)
Max sequence length 2048
Training samples 14,933 (12,246 positives + 2,687 negatives)
Eval samples 785
Scheduler Cosine with warmup
Precision bfloat16
Hardware NVIDIA RTX 5090 (single GPU)

Why epoch 0.64?

The model's extraction recall peaks around epoch 0.64. Beyond this point, the model becomes increasingly conservative and starts returning empty arrays for complex text. Checkpoint-600 was selected after comparing all checkpoints (200, 400, 600, 800, final) on diverse test content including medical textbooks, ML papers, and harvested academic text.

Training Data

  • Positives (12,246): BIO-tagged academic text across 24 domains, manually vetted (90,171 concepts reviewed, 27,497 NOT_CONCEPT labels removed)
  • Negatives (2,687): Domain-diverse examples including:
    • Benchmark result tables, code snippets, acknowledgment sections
    • YouTube sponsor text, author affiliation blocks
    • 1,069 real harvested web sentences from Wikipedia containing proper nouns but no study-worthy concepts

Domains Covered

Agriculture & Veterinary, Art & Design, Biology, Business, Chemistry, Communications & Journalism, Computer Science, Earth & Environmental Sciences, Economics, Education, Engineering, History, Information Technology, Law, Literature, Mathematics, Medicine, Music, Philosophy, Physics, Psychology, Religion & Theology, Services, Social Sciences

Pipeline Context

This adapter is Stage 2 of a 4-stage concept extraction pipeline:

Stage Model Purpose
1. SciBERT Gate SciBERT classifier Filter noisy chunks (bibliography, code, ads)
2. This model Qwen3-2B + LoRA Extract candidate concepts from full chunk text
3. RoBERTa Verifier RoBERTa-large cross-encoder Score each concept using ~400-char window around concept (F1=83.1%)
4. Embedding Disambiguator Qwen3-VL-Embedding-2B KNN Resolve grey-zone cases (100% on tested cases)

Important: This adapter receives the full chunk text (~250 tokens) for extraction. The downstream RoBERTa verifier (Stage 3) receives only a ~400-character rough window centered on each concept -- NOT the full chunk. This prevents score inflation on proper nouns embedded in dense technical text.

Evaluation

Medical Text (StatPearls, NCBI Bookshelf)

Tested on real medical textbook content never seen during training:

Chunk Concepts Extracted Kept After Full Pipeline
Cardiac cycle (systole/diastole) 24 23
Ion channel mechanisms & pathology 18 16
Innate immune system 10 10

Sample extracted concepts: "isovolumetric contraction", "Toll-like receptors", "saltatory conduction", "Eisenmenger syndrome", "voltage-gated ion channels", "pulmonary vascular resistance"

ML/CS Text (DeepSeek-V3 Technical Report)

Chunk Extracted Kept Correctly Rejected
Abstract 6 3 MoE, MLA, DeepSeekMoE

Kept: "Mixture-of-Experts", "Multi-head Latent Attention", "multi-token prediction"

Checkpoint Comparison

Checkpoint DeepConf (noisy) Medical Fourier Quantum
CP-200 (epoch 0.21) 9 kept 12-17 kept 3 kept 3 kept
CP-400 (epoch 0.43) 9 kept 13-17 kept 3 kept 3 kept
CP-600 (epoch 0.64) 8 kept 16-23 kept 3 kept 3 kept
CP-800 (epoch 0.86) empty 16-21 kept 3 kept 3 kept
Final (epoch 1.00) empty 15-21 kept 3 kept 3 kept

CP-600 is the sweet spot: strong medical/academic extraction while still handling noisy text (DeepConf paper with embedded figures and code).

Limitations

  • Returns empty on very informal text (YouTube transcripts, casual speech)
  • Performs best on clean academic text (textbooks, papers, lecture notes)
  • Should be used with the SciBERT gate (Stage 1) to pre-filter noisy chunks
  • May over-extract from very dense technical text (downstream verifier catches this)

Framework Versions

  • PEFT: 0.18.1
  • Transformers: 5.x
  • TRL: 1.x
  • PyTorch: 2.x
Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bei0001/concept-extractor-qwen35-2B-cpu-cuda

Finetuned
Qwen/Qwen3.5-2B
Adapter
(39)
this model