--- license: mit language: - en tags: - video-classification - education - classroom-observation - copus - vision-language-model base_model: - openbmb/MiniCPM-V-4_5 pipeline_tag: visual-question-answering --- # COPUS Classifier The system consists of a lightweight classifier head trained on top of the frozen MiniCPM-V-4.5 vision-language model. The base model remains unchanged during training, with only the classification layers being optimized. ## COPUS Framework The model detects 24 classroom activities across two categories: **Student Actions (13 codes)**: L (Listening), Ind (Individual work), CG (Clicker groups), WG (Worksheet groups), OG (Other groups), AnQ (Answering questions), SQ (Asking questions), WC (Whole class discussion), Prd (Predictions), SP (Presentations), TQ (Test/Quiz), W (Waiting), O (Other) **Instructor Actions (11 codes)**: Lec (Lecturing), RtW (Real-time writing), FUp (Follow-up), PQ (Posing questions), CQ (Clicker questions), AnQ (Answering questions), MG (Moving/Guiding), 1o1 (One-on-one), D/V (Demo/Video), Adm (Administration), W (Waiting) ## Usage ```python import torch import torch.nn as nn from transformers import AutoModel, AutoTokenizer from PIL import Image from decord import VideoReader, cpu class COPUSClassifier(nn.Module): def __init__(self, input_dim=4096, num_classes=24): super().__init__() self.classifier = nn.Sequential( nn.Linear(input_dim, 1024), nn.ReLU(), nn.Dropout(0.3), nn.Linear(1024, 512), nn.ReLU(), nn.Dropout(0.2), nn.Linear(512, num_classes) ) def forward(self, x): return self.classifier(x) # Load base model base_model = AutoModel.from_pretrained( "openbmb/MiniCPM-V-4_5", trust_remote_code=True, torch_dtype=torch.bfloat16 ).cuda().eval() tokenizer = AutoTokenizer.from_pretrained( "openbmb/MiniCPM-V-4_5", trust_remote_code=True ) # Load classifier classifier = COPUSClassifier().cuda() checkpoint = torch.load("classifier.pt") classifier.load_state_dict(checkpoint['classifier_state_dict']) classifier.eval() # Process video def extract_features(frames, prompt): with torch.no_grad(): msgs = [{"role": "user", "content": frames + [prompt]}] response = base_model.chat( msgs=msgs, tokenizer=tokenizer, max_new_tokens=500, sampling=False ) tokens = tokenizer(response, return_tensors='pt', max_length=512, truncation=True) embeddings = base_model.llm.get_input_embeddings()(tokens['input_ids'].cuda()) return embeddings.mean(dim=1).float() # Classify frames = load_video_frames("classroom.mp4", num_frames=30) features = extract_features(frames, classification_prompt) logits = classifier(features) predictions = (torch.sigmoid(logits) > 0.5).cpu().numpy() ``` ## Citation ```bibtex @software{copus_classifier_2025, title={COPUS Video Evaluation System: Automated Classroom Observation using Vision-Language Models}, author={Franck, Andy and Ng, Brendan and Derrod, Zane and Fitzgerald, Ben}, year={2025}, url={https://huggingface.co/ajfranck/COPUS-analysis} } ```