Hybrid-Summariser Cross-Modal LoRA (Phase 2 β Balanced)
LoRA adapter for Mistral-7B-v0.1 from the intermediate phase of a 3-phase curriculum framework. This checkpoint offers the best style protection (10.2% passive voice β near baseline) while still achieving strong video summarization improvement. Recommended when minimal academic style contamination is preferred over maximum video quality.
For maximum video quality, use Phase 3 instead.
Repo: github.com/Tushar-9802/Hybrid-Dataset-Summariser
Results (vs zero-shot Mistral-7B, n=75 videos)
| Metric | Baseline | Phase 2 (this) | Phase 3 |
|---|---|---|---|
| Video ROUGE-1 | 0.263 | 0.381 (+45%) | 0.417 (+58%) |
| Video ROUGE-2 | 0.032 | 0.101 (+216%) | 0.119 (+272%) |
| Video PVR | 9.9% | 10.2% (near baseline) | 14.1% |
| CMC | 0.356 | 0.473 | 0.531 |
| Paper ROUGE-1 | 0.320 | 0.296 | 0.309 |
Key trade-off: Phase 2 has the cleanest style (10.2% PVR) at the cost of ~9% less video R-1 than Phase 3.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(model, "Tushar9802/hybrid-summariser-phase2-lora")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
prompt = "Summarize the following text.\n\nText: {your_text}\n\nSummary:"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, num_beams=4, no_repeat_ngram_size=3)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Training
| Phase | Data Mix | Methods | Status |
|---|---|---|---|
| 1 | 100% papers | LoRA+ (ratio 8x) | completed |
| 2 (this) | 50P/40V/10Pr | +OPLoRA (k=16), +EWC (lambda=200) | this checkpoint |
| 3 | 30P/60V/10Pr | +CrossCLR, EWC lambda=400 | Phase 3 adapter |
Hyperparameters
- LoRA: r=32, alpha=64, dropout=0.1, targets=q,k,v,o,gate,up,down
- Trainable: 83.9M params (1.16% of 7.24B total)
- Optimizer: 8-bit AdamW, effective batch size 24
- Phase 2 LR: eta_A=5e-5, eta_B=4e-4 (ratio 8x)
- OPLoRA: orthogonal projection k=16, rho_k monitored every 100 steps
- EWC: Fisher diagonal (448 param matrices, 83.9M entries), lambda=200
- Replay buffer: 10% of Phase 1 paper training indices
- Hardware: RTX 5070 Ti (16GB VRAM), peak 11.1 GB
Dataset
4,324 CS samples: 2,368 arXiv papers + 738 YouTube lectures + 1,218 SBERT-mined cross-modal pairs.
When to use Phase 2 vs Phase 3
| Use Case | Recommended |
|---|---|
| Balanced paper + video quality | Phase 2 |
| Minimal style contamination | Phase 2 (PVR 10.2%) |
| Maximum video summarization quality | Phase 3 (R-1 0.417) |
| Maximum cross-modal consistency | Phase 3 (CMC 0.531) |
Limitations
- Single model (Mistral-7B) and domain (CS/engineering)
- Video references generated by GPT-4o-mini, not human annotators
- No human evaluation conducted
Citation
@inproceedings{jaju2025crossmodal,
title={Cross-Modal Transfer Learning in Domain-Adaptive Video Summarization},
author={Jaju, Tushar and Saharawat, Tanishka and Bhatia, Shruti and Rastogi, Shivansh},
booktitle={Proc. IMPACT 2025},
publisher={Springer},
year={2025}
}
Authors
Tushar Jaju (training infrastructure, implementation, experiments), Tanishka Saharawat, Shruti Bhatia, Shivansh Rastogi
Guide: Dr. Neha Yadav β ABES Engineering College, Ghaziabad (AKTU)
Framework Versions
- PEFT: 0.18.1
- Transformers: 5.0.0rc3
- PyTorch: 2.11.0.dev20260214+cu128
- bitsandbytes: 0.49.2
- Python: 3.11
- Downloads last month
- 24
Model tree for Tushar9802/hybrid-summariser-phase2-lora
Base model
mistralai/Mistral-7B-v0.1