Clinical Trial Endpoint Classifier โ 4B v2 (Qwen3.5-4B LoRA)
v2 update of the endpoint-qwen3.5-4b-lora model. Trained on 2x more data with broader source coverage spanning ClinicalTrials.gov, EU Clinical Trials Register, and Chinese Clinical Trial Registry (ChiCTR).
A fine-tuned LoRA adapter on Qwen3.5-4B for extracting and classifying clinical trial endpoints from outcome text. Returns structured JSON with standardized endpoint names, measurement types, methods, and more.
What's New in v2
- 2x training data: 3,906 samples (vs 1,948 in v1)
- Multi-source diversity: EU CTR (700) + ChiCTR (700) + ClinicalTrials.gov (600) added on top of v1's CTgov data
- 12 disease categories in v2 CTgov sample: diabetes, breast cancer, cardiovascular, alzheimer's, asthma, depression, hepatitis, rheumatoid arthritis, chronic kidney disease, multiple sclerosis, obesity, parkinson's
- Better generalization to non-US trial registries (EU + China)
- Improved labeling: v2 samples labeled by GPT-OSS-120B (vs v1 by Qwen3.6-plus)
Output Format
{
"endpoints": [
{
"endpoint_name_standardized": "Objective Response Rate",
"measurement_of": "tumor response",
"measurement_type": "binary",
"metric_type": "proportion",
"timeframe": "Week 24",
"measurement_method": "RECIST v1.1",
"evaluation_criteria": "CR or PR",
"unit": "%",
"population": null,
"is_composite": false,
"components": []
}
]
}
Field Definitions
| Field | Description | Examples |
|---|---|---|
endpoint_name_standardized |
Standardized endpoint name | "Overall Survival", "HbA1c", "PASI 75 Response Rate" |
measurement_of |
What is being measured | "tumor response", "glycated hemoglobin" |
measurement_type |
Type of measurement | continuous, binary, ordinal, time-to-event |
metric_type |
Statistical metric | mean, proportion, hazard ratio, change from baseline |
timeframe |
When measurement occurs | "Week 12", "Up to 36 months" |
measurement_method |
How it is measured | "blood test", "RECIST v1.1", "12-lead ECG" |
evaluation_criteria |
Criteria for evaluation | "PASI 75", "CR or PR" |
unit |
Unit of measurement | "%", "mg/dL", "mm" |
population |
Specific population | "adults aged 18-65", "ITT", "Full analysis set" |
is_composite |
Whether composite endpoint | true / false |
components |
Components if composite | ["MI", "stroke", "cardiovascular death"] |
Supports multiple endpoints from a single text (e.g., safety texts with 10+ sub-endpoints).
Training Details
| Base model | Qwen/Qwen3.5-4B |
| Method | LoRA (bf16, rank 16, alpha 16) |
| Training data | 3,906 samples (1,948 v1 + 1,958 v2) |
| Data sources | ClinicalTrials.gov, EU CTR, ChiCTR |
| Epochs | 3 |
| Steps | 735 |
| Training time | ~2 hours on RTX 4090 |
| Framework | Unsloth + TRL SFTTrainer |
Data Composition
| Source | v1 samples | v2 samples | Total |
|---|---|---|---|
| ClinicalTrials.gov | 1,948 | 600 | 2,548 |
| EU CTR | โ | 700 | 700 |
| ChiCTR (China) | โ | 700 | 700 |
| Total | 1,948 | 1,958 | 3,906 |
Hyperparameters
Method: LoRA (bf16, NOT 4-bit)
LoRA rank: 16, alpha: 16, dropout: 0
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate: 2e-4 (cosine scheduler)
Batch size: 2 per device (gradient accumulation 8, effective 16)
Epochs: 3
Optimizer: adamw_8bit
Sequence length: 2048
Gradient checkpointing: unsloth
Warmup steps: 10
Weight decay: 0.01
Max grad norm: 1.0
Seed: 3407
Usage
With Unsloth (Fastest)
import json
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Shubh-0789/endpoint-qwen3.5-4b-lora-v2",
max_seq_length=2048,
load_in_4bit=False,
load_in_16bit=True,
dtype=torch.bfloat16,
)
text_tokenizer = AutoTokenizer.from_pretrained("Shubh-0789/endpoint-qwen3.5-4b-lora-v2")
FastLanguageModel.for_inference(model)
model.generation_config.pad_token_id = text_tokenizer.pad_token_id
clinical_text = "Primary endpoints are ORR and progression-free survival (PFS) assessed by RECIST v1.1 | [Time Frame: Up to 24 months]"
messages = [
{"role": "user", "content": f"Extract and classify the clinical trial endpoint from the following text. Return ONLY a JSON.\nText: {clinical_text}"}
]
inputs = text_tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_tensors="pt", return_dict=True,
).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
result = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
endpoints = json.loads(result)
print(json.dumps(endpoints, indent=2))
Output:
{
"endpoints": [
{
"endpoint_name_standardized": "Objective Response Rate",
"measurement_of": "tumor response",
"measurement_type": "binary",
"metric_type": "proportion",
"timeframe": "Up to 24 months",
"measurement_method": "RECIST v1.1",
"evaluation_criteria": null,
"unit": "%",
"population": null,
"is_composite": false,
"components": []
},
{
"endpoint_name_standardized": "Progression-Free Survival",
"measurement_of": "disease progression or death",
"measurement_type": "time-to-event",
"metric_type": "hazard ratio",
"timeframe": "Up to 24 months",
"measurement_method": "RECIST v1.1",
"evaluation_criteria": null,
"unit": null,
"population": null,
"is_composite": false,
"components": []
}
]
}
With PEFT/Transformers
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B", torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(base_model, "Shubh-0789/endpoint-qwen3.5-4b-lora-v2")
tokenizer = AutoTokenizer.from_pretrained("Shubh-0789/endpoint-qwen3.5-4b-lora-v2")
Inference Tip: Disable Thinking
Qwen3.5 supports a thinking mode. For this task, disable thinking for direct JSON output (the model was trained without <think> blocks):
# When using vLLM:
# --reasoning-parser qwen3 --default-chat-template-kwargs '{"enable_thinking": false}'
Model Comparison
| Model | Parameters | Training Data | VRAM | Link |
|---|---|---|---|---|
| 0.8B v1 | 856M | 1,948 (CTgov only) | 3 GB | 0.8B v1 |
| 4B v1 | 4.6B | 1,948 (CTgov only) | 10 GB | 4B v1 |
| 4B v2 | 4.6B | 3,906 (CTgov + EU + China) | 10 GB | This model |
Limitations
- Trained primarily on English clinical trial text (ChiCTR data is also in English)
- Complex composite endpoints may need verification
- Minimum inference: any GPU with 10GB+ VRAM
- Best inference settings: temperature=0.1, do_sample=True, thinking disabled
Citation
@misc{endpoint-qwen3.5-4b-lora-v2,
author = {Shubh-0789},
title = {Clinical Trial Endpoint Classifier โ 4B v2 (Qwen3.5-4B LoRA)},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Shubh-0789/endpoint-qwen3.5-4b-lora-v2}
}