|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- video-classification |
|
|
- education |
|
|
- classroom-observation |
|
|
- copus |
|
|
- vision-language-model |
|
|
base_model: |
|
|
- openbmb/MiniCPM-V-4_5 |
|
|
pipeline_tag: visual-question-answering |
|
|
--- |
|
|
# COPUS Classifier |
|
|
|
|
|
The system consists of a lightweight classifier head trained on top of the frozen MiniCPM-V-4.5 vision-language model. The base model remains unchanged during training, with only the classification layers being optimized. |
|
|
|
|
|
## COPUS Framework |
|
|
|
|
|
The model detects 24 classroom activities across two categories: |
|
|
|
|
|
**Student Actions (13 codes)**: L (Listening), Ind (Individual work), CG (Clicker groups), WG (Worksheet groups), OG (Other groups), AnQ (Answering questions), SQ (Asking questions), WC (Whole class discussion), Prd (Predictions), SP (Presentations), TQ (Test/Quiz), W (Waiting), O (Other) |
|
|
|
|
|
**Instructor Actions (11 codes)**: Lec (Lecturing), RtW (Real-time writing), FUp (Follow-up), PQ (Posing questions), CQ (Clicker questions), AnQ (Answering questions), MG (Moving/Guiding), 1o1 (One-on-one), D/V (Demo/Video), Adm (Administration), W (Waiting) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torch.nn as nn |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
from PIL import Image |
|
|
from decord import VideoReader, cpu |
|
|
|
|
|
class COPUSClassifier(nn.Module): |
|
|
def __init__(self, input_dim=4096, num_classes=24): |
|
|
super().__init__() |
|
|
self.classifier = nn.Sequential( |
|
|
nn.Linear(input_dim, 1024), |
|
|
nn.ReLU(), |
|
|
nn.Dropout(0.3), |
|
|
nn.Linear(1024, 512), |
|
|
nn.ReLU(), |
|
|
nn.Dropout(0.2), |
|
|
nn.Linear(512, num_classes) |
|
|
) |
|
|
|
|
|
def forward(self, x): |
|
|
return self.classifier(x) |
|
|
|
|
|
# Load base model |
|
|
base_model = AutoModel.from_pretrained( |
|
|
"openbmb/MiniCPM-V-4_5", |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.bfloat16 |
|
|
).cuda().eval() |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
"openbmb/MiniCPM-V-4_5", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Load classifier |
|
|
classifier = COPUSClassifier().cuda() |
|
|
checkpoint = torch.load("classifier.pt") |
|
|
classifier.load_state_dict(checkpoint['classifier_state_dict']) |
|
|
classifier.eval() |
|
|
|
|
|
# Process video |
|
|
def extract_features(frames, prompt): |
|
|
with torch.no_grad(): |
|
|
msgs = [{"role": "user", "content": frames + [prompt]}] |
|
|
response = base_model.chat( |
|
|
msgs=msgs, |
|
|
tokenizer=tokenizer, |
|
|
max_new_tokens=500, |
|
|
sampling=False |
|
|
) |
|
|
tokens = tokenizer(response, return_tensors='pt', max_length=512, truncation=True) |
|
|
embeddings = base_model.llm.get_input_embeddings()(tokens['input_ids'].cuda()) |
|
|
return embeddings.mean(dim=1).float() |
|
|
|
|
|
# Classify |
|
|
frames = load_video_frames("classroom.mp4", num_frames=30) |
|
|
features = extract_features(frames, classification_prompt) |
|
|
logits = classifier(features) |
|
|
predictions = (torch.sigmoid(logits) > 0.5).cpu().numpy() |
|
|
``` |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@software{copus_classifier_2025, |
|
|
title={COPUS Video Evaluation System: Automated Classroom Observation using Vision-Language Models}, |
|
|
author={Franck, Andy and Ng, Brendan and Derrod, Zane and Fitzgerald, Ben}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/ajfranck/COPUS-analysis} |
|
|
} |
|
|
``` |