Game classifier β MobileNetV3-Small (CS2 / Dota 2 / Valorant)
3-class image classifier used as the Β§1.1 hot-path detector in coach-api, a real-time AI coaching service for competitive PC games. Decides which game is on screen so the service routes the frame to the correct per-game event-extraction pipeline.
Classes
cs2, dota2, valorant β in that order along the logits axis.
Architecture
torchvision.models.mobilenet_v3_small(weights=None)classifier[3]replaced withnn.Linear(in_features, 3)- Input: 224Γ224 RGB, ImageNet mean/std normalization
- Output: softmax over 3 classes
Files
| File | Purpose |
|---|---|
game_classifier_mobilenet_v3_small.pt |
Checkpoint. Dict with state_dict, imagenet_mean, imagenet_std. |
detector_comparison.json |
Custom-CNN vs MobileNetV3 head-to-head: per-class P/R/F1, latency p50/p95/p99, training time, size. |
confusion_matrices.png |
Side-by-side confusion matrices for both candidate detectors on the training split. |
Loading
import torch
from torch import nn
from torchvision import models
CLASSES = ("cs2", "dota2", "valorant")
ckpt = torch.load("game_classifier_mobilenet_v3_small.pt", map_location="cpu", weights_only=True)
model = models.mobilenet_v3_small(weights=None)
model.classifier[3] = nn.Linear(model.classifier[3].in_features, len(CLASSES))
model.load_state_dict(ckpt["state_dict"])
model.eval()
Preprocessing: BGR β RGB β resize 224Γ224 β transforms.ToTensor() β
transforms.Normalize(ckpt["imagenet_mean"], ckpt["imagenet_std"]). See
api/src/coach_api/services/detectors/game_classifier.py
for the production preprocessing exact match.
Training
- Data: 6450 frames extracted at 0.4 fps from 10 YouTube gameplay videos
(3 cs2 / 3 dota2 / 4 valorant). Same
dataset/manifest.csvschema published alongside the dataset repo. - Split: frame-level
random_split(0.70 / 0.15 / 0.15). Same videos appear in train and val β useful for hyperparameter search, not a fair generalization estimate. See "Honest eval" below. - Loss: cross-entropy
- Optimizer / schedule: see the training notebook
notebook/Game_Classifier_β¦ipynb. - Random seed: 42
Training-split metrics (detector_comparison.json)
| Model | Acc | Size | Mean latency (CPU) |
|---|---|---|---|
| Custom CNN | 98.76 % | 1.5 MB | 6.3 ms |
| MobileNetV3-Small (winner) | 99.28 % | 5.8 MB | 10.2 ms |
The MobileNetV3 was selected for its better F1 on the harder valorant class and headroom against domain shift, despite the larger size.
Honest eval β held-out videos (this is the number to cite)
The training split is video-leaky. To get a fair estimate we extracted
frames from 3 fresh YouTube videos (one per class) with zero
video_id overlap with training:
| Set | Frames | Accuracy | Errors |
|---|---|---|---|
| 150 / class evenly-spaced subsample | 450 | 100 % | 0 |
| Full extraction | 1118 | 99.91 % | 1 |
The single error is a Valorant round-end "WON" overlay misread as CS2 with confidence 0.89. Valorant's orange/red post-round Combat Report panel visually mimics CS2's MVP card. Spectator / round-end frames are under-represented in training.
Eval tool: cli/gamed_classification_eval.py.
Eval set: companion dataset
ybashir/gamed-game-classification-dataset.
Latency
Reported by detector_comparison.json on the training split:
| Percentile | Custom CNN | MobileNetV3-Small |
|---|---|---|
| p50 | 5.96 ms | 9.73 ms |
| p95 | 9.11 ms | 13.67 ms |
| p99 | 10.41 ms | 16.88 ms |
The held-out eval CLI measures 7.83 ms mean end-to-end (cv2 imread + preprocessing + forward pass) on CPU.
In production (coach-api service, CPU container) the per-frame
t_classify_us stamp averages ~21 ms β the difference is service
overhead (frame ingest, payload decode, logging).
All well inside the brief's 30 ms p95 budget.
Intended use
Real-time game detection for a coaching service that needs to route frames to per-game event extractors. Designed for CPU inference at 1β4 fps (VLM path) up to 30 fps (cheap CV path).
Limitations
- Trained on YouTube "no-commentary" gameplay videos at 1080p. Frames with heavy streamer overlays, ultrawide aspect ratios, or HDR have not been evaluated.
- Three classes only. Adding a fourth (e.g. LoL, Apex) requires retraining the head.
- Confidently misclassifies round-end / spectator / replay frames β see "Honest eval" above. Mitigation in production: pipeline gates downstream detectors on confidence and frame-class persistence; a single misread doesn't propagate.