ajfranck
/

COPUS-analysis

Visual Question Answering

video-classification

classroom-observation

vision-language-model

Model card Files Files and versions

COPUS-analysis / README.md

ajfranck's picture

Update README.md

8dff199 verified 3 months ago

|

history blame contribute delete

3.22 kB

	---
	license: mit
	language:
	- en
	tags:
	- video-classification
	- education
	- classroom-observation
	- copus
	- vision-language-model
	base_model:
	- openbmb/MiniCPM-V-4_5
	pipeline_tag: visual-question-answering
	---
	# COPUS Classifier

	The system consists of a lightweight classifier head trained on top of the frozen MiniCPM-V-4.5 vision-language model. The base model remains unchanged during training, with only the classification layers being optimized.

	## COPUS Framework

	The model detects 24 classroom activities across two categories:

	Student Actions (13 codes): L (Listening), Ind (Individual work), CG (Clicker groups), WG (Worksheet groups), OG (Other groups), AnQ (Answering questions), SQ (Asking questions), WC (Whole class discussion), Prd (Predictions), SP (Presentations), TQ (Test/Quiz), W (Waiting), O (Other)

	Instructor Actions (11 codes): Lec (Lecturing), RtW (Real-time writing), FUp (Follow-up), PQ (Posing questions), CQ (Clicker questions), AnQ (Answering questions), MG (Moving/Guiding), 1o1 (One-on-one), D/V (Demo/Video), Adm (Administration), W (Waiting)

	## Usage

	```python
	import torch
	import torch.nn as nn
	from transformers import AutoModel, AutoTokenizer
	from PIL import Image
	from decord import VideoReader, cpu

	class COPUSClassifier(nn.Module):
	def __init__(self, input_dim=4096, num_classes=24):
	super().__init__()
	self.classifier = nn.Sequential(
	nn.Linear(input_dim, 1024),
	nn.ReLU(),
	nn.Dropout(0.3),
	nn.Linear(1024, 512),
	nn.ReLU(),
	nn.Dropout(0.2),
	nn.Linear(512, num_classes)
	)

	def forward(self, x):
	return self.classifier(x)

	# Load base model
	base_model = AutoModel.from_pretrained(
	"openbmb/MiniCPM-V-4_5",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
	).cuda().eval()

	tokenizer = AutoTokenizer.from_pretrained(
	"openbmb/MiniCPM-V-4_5",
	trust_remote_code=True
	)

	# Load classifier
	classifier = COPUSClassifier().cuda()
	checkpoint = torch.load("classifier.pt")
	classifier.load_state_dict(checkpoint['classifier_state_dict'])
	classifier.eval()

	# Process video
	def extract_features(frames, prompt):
	with torch.no_grad():
	msgs = [{"role": "user", "content": frames + [prompt]}]
	response = base_model.chat(
	msgs=msgs,
	tokenizer=tokenizer,
	max_new_tokens=500,
	sampling=False
	)
	tokens = tokenizer(response, return_tensors='pt', max_length=512, truncation=True)
	embeddings = base_model.llm.get_input_embeddings()(tokens['input_ids'].cuda())
	return embeddings.mean(dim=1).float()

	# Classify
	frames = load_video_frames("classroom.mp4", num_frames=30)
	features = extract_features(frames, classification_prompt)
	logits = classifier(features)
	predictions = (torch.sigmoid(logits) > 0.5).cpu().numpy()
	```


	## Citation

	```bibtex
	@software{copus_classifier_2025,
	title={COPUS Video Evaluation System: Automated Classroom Observation using Vision-Language Models},
	author={Franck, Andy and Ng, Brendan and Derrod, Zane and Fitzgerald, Ben},
	year={2025},
	url={https://huggingface.co/ajfranck/COPUS-analysis}
	}
	```