---
license: mit
language:
- en
tags:
- video-classification
- education
- classroom-observation
- copus
- vision-language-model
base_model:
- openbmb/MiniCPM-V-4_5
pipeline_tag: visual-question-answering
---
# COPUS Classifier

The system consists of a lightweight classifier head trained on top of the frozen MiniCPM-V-4.5 vision-language model. The base model remains unchanged during training, with only the classification layers being optimized.

## COPUS Framework

The model detects 24 classroom activities across two categories:

**Student Actions (13 codes)**: L (Listening), Ind (Individual work), CG (Clicker groups), WG (Worksheet groups), OG (Other groups), AnQ (Answering questions), SQ (Asking questions), WC (Whole class discussion), Prd (Predictions), SP (Presentations), TQ (Test/Quiz), W (Waiting), O (Other)

**Instructor Actions (11 codes)**: Lec (Lecturing), RtW (Real-time writing), FUp (Follow-up), PQ (Posing questions), CQ (Clicker questions), AnQ (Answering questions), MG (Moving/Guiding), 1o1 (One-on-one), D/V (Demo/Video), Adm (Administration), W (Waiting)

## Usage

```python
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from PIL import Image
from decord import VideoReader, cpu

class COPUSClassifier(nn.Module):
    def __init__(self, input_dim=4096, num_classes=24):
        super().__init__()
        self.classifier = nn.Sequential(
            nn.Linear(input_dim, 1024),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, num_classes)
        )
    
    def forward(self, x):
        return self.classifier(x)

# Load base model
base_model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-V-4_5",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).cuda().eval()

tokenizer = AutoTokenizer.from_pretrained(
    "openbmb/MiniCPM-V-4_5",
    trust_remote_code=True
)

# Load classifier
classifier = COPUSClassifier().cuda()
checkpoint = torch.load("classifier.pt")
classifier.load_state_dict(checkpoint['classifier_state_dict'])
classifier.eval()

# Process video
def extract_features(frames, prompt):
    with torch.no_grad():
        msgs = [{"role": "user", "content": frames + [prompt]}]
        response = base_model.chat(
            msgs=msgs,
            tokenizer=tokenizer,
            max_new_tokens=500,
            sampling=False
        )
        tokens = tokenizer(response, return_tensors='pt', max_length=512, truncation=True)
        embeddings = base_model.llm.get_input_embeddings()(tokens['input_ids'].cuda())
        return embeddings.mean(dim=1).float()

# Classify
frames = load_video_frames("classroom.mp4", num_frames=30)
features = extract_features(frames, classification_prompt)
logits = classifier(features)
predictions = (torch.sigmoid(logits) > 0.5).cpu().numpy()
```


## Citation

```bibtex
@software{copus_classifier_2025,
  title={COPUS Video Evaluation System: Automated Classroom Observation using Vision-Language Models},
  author={Franck, Andy and Ng, Brendan and Derrod, Zane and Fitzgerald, Ben},
  year={2025},
  url={https://huggingface.co/ajfranck/COPUS-analysis}
}
```