---
title: Human Action Recommender
emoji: 🏃
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: "6.15.0" 
python_version: "3.10"
app_file: app.py
pinned: false
---

## 🎬 Video Presentation

<video width="560" height="315" controls>
  <source src="https://huggingface.co/spaces/romi2001/human-action-recommender/resolve/main/proj3.mp4" type="video/mp4">
</video>

## Project Overview

This project builds a visual human action recommendation system based entirely on real photos of people performing everyday activities. Using 18,000 images from the Human Action Recognition dataset, I used the CLIP model to compress each photo into a 512-dimensional embedding vector that captures its visual meaning. I then applied K-Means clustering to group visually similar images and built a Gradio app where users can upload any photo of a person and get the 3 most visually similar images from the dataset, along with a predicted action label.

## Dataset Summary

- **Source:** HuggingFace — [Bingsu/Human_Action_Recognition](https://huggingface.co/datasets/Bingsu/Human_Action_Recognition)
- **Uploaded by:** Dowon Hwang
- **Size:** 18,000 images × 2 features (image, label)
- **Action classes (15):** calling, clapping, cycling, dancing, drinking, eating, fighting, hugging, laughing, listening_to_music, running, sitting, sleeping, texting, using_laptop

### Data Cleaning

**Class imbalance:** The `calling` class had 6,240 images while all other 14 classes had only 840 each. I downsampled calling to 840 to create a perfectly balanced dataset of 12,600 images. This prevents the model from being biased toward the calling class during embedding and similarity search.

**No outlier analysis needed:** This dataset contains only image and label columns — there are no numerical features that could have extreme values. Every image is a valid photo and every label is one of 15 action classes, so outlier detection is not applicable here.

**Aspect ratios:** Images range from 0.61 (tall and skinny) to 3.33 (very wide). CLIP handles this automatically by resizing all images to 224×224 during preprocessing — no manual resizing was needed.

---

## EDA Plots

### Class Distribution — Before & After Balancing

![צילום מסך 2026-05-23 ב-22.58.36](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/1sm5na1L39aq5uQ01KS-3.png)

There is a severe class imbalance: the 'calling' class contains 6,240 image samples, while the remaining 14 classes contain 840 samples each.

Since 840 images per class is still a decent amount of data, I chose to downsample the 'calling' class to 840 samples to match the others. This creates a perfectly balanced dataset (prevents a prediction bias toward the 'calling' class).

![צילום מסך 2026-05-23 ב-22.59.57](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/ig4u6qfUiHL2pjvmSDHWl.png)

### Image Aspect Ratio Distribution

![צילום מסך 2026-05-23 ב-23.01.17](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/7IkFsvDIbIeCIFkewObsQ.png)

Most of the pictures are wide rectangles (around 1.5), pushing them into a square will squish them a bit horizontally. But, as mentioned earlier, because the CLIP searches for general shapes (like a bicycle, a ball, or a human pose) rather than perfect edges, it will still recognize the action.

### Sample Images per Action Class

![צילום מסך 2026-05-23 ב-23.04.04](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/eMl74l-S2fLoNWUU_GOzG.png)

As can see, the model is actually learning from raw photos of humans performing activities.

## Embeddings

### Model Choice

I used **CLIP (openai/clip-vit-base-patch32)** from HuggingFace. CLIP was trained on 400 million image-text pairs, so it understands the visual meaning of photos — not just colors and edges, but high-level concepts like human actions, objects, and scenes. I use CLIP's image encoder via `get_image_features()`, which produces a 512-dimensional vector per image. All embeddings are L2-normalized so cosine similarity works correctly.

### Why CLIP over ViT?

CLIP is trained to understand semantic meaning across a huge variety of real-world photos, making it much more effective at distinguishing between different human activities than a plain ViT trained only on image classification.

### Embedding Matrix

- **Shape:** (4,995, 512) — 4,995 balanced images × 512 dimensions
- **Saved as:** `action_embeddings.json` with action labels and original dataset indices

### Elbow Curve — Choosing K=15

![צילום מסך 2026-05-23 ב-23.23.14](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/ImlJKBKdZgYAYiekiKasO.png)

The elbow curve does not show a sharp bend at K=15. Instead, inertia decreases gradually past K=15. This is consistent with the PCA analysis — several action classes (texting, using_laptop, listening_to_music, calling) share nearly identical visual features: seated indoor posture, arms bent, device near the face. The elbow is soft rather than sharp. K=15 was chosen because it matches the exact number of labeled action classes in the dataset, which is the most principled choice when the number of categories is known in advance.

### PCA 2D — CLIP Feature Embeddings

![צילום מסך 2026-05-23 ב-23.24.23](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/kuiFGV3c4UNS8n11seihm.png)

The PCA plot projects all 4,995 embeddings from 512 dimensions down to 2. Even at this scale, clear structure is visible — active outdoor actions like cycling and running separate from sedentary ones like sleeping and sitting. This confirms that CLIP encodes genuine semantic differences between actions, not just low-level visual patterns.

### K-Means Cluster Assignments

![צילום מסך 2026-05-23 ב-23.25.17](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/kvqouT2VUWqYZrCHKEQ1y.png)

K-Means was applied on the full 512D embeddings with K=15, matching the number of action classes. Visually unique actions form tight, pure clusters — cycling is nearly 100% pure due to its distinctive bicycle and outdoor setting, and sleeping separates cleanly as the only class with a horizontal body posture.
The overlap zone in the center reflects genuine visual ambiguity: calling, texting, using_laptop, and listening_to_music all share the same visual template — a person seated indoors with a device near their face. Even humans would struggle to distinguish these from photos alone. The mean cluster purity of 66.2% tells exactly the same story — visually unique classes score near 100%, while device-related classes pull the average down.

### t-SNE Dual Plot — Clusters vs True Labels

![צילום מסך 2026-05-23 ב-23.29.55](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/ADiWuxxz48xZS886PSOsK.png)

The dual t-SNE plot is the strongest evidence that the CLIP embeddings have learned real visual structure. The left panel shows K-Means cluster assignments and the right panel shows the ground-truth action labels for the exact same points. The two panels are remarkably consistent — the cycling island appears in the same position in both, the sleeping cluster is isolated in both, and the device-related actions overlap in both. This confirms the model discovered real structure without ever seeing the labels.

### Cluster Purity

![צילום מסך 2026-05-23 ב-23.31.20](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/qP8RQL7XCHSfpH_SUic7O.png)

Mean cluster purity: **66.2%** — cycling clusters at nearly 100% purity, while device-related actions (texting, calling, using_laptop, listening) mix heavily due to genuine visual similarity.


## Cluster Insights

| Cluster | Dominant Action | Why it separates |
|---------|-----------------|------------------|
| Cycling | cycling ~91% | Bicycle frames, wheels, outdoor trails — unique to this class |
| Running | running ~91% | Dynamic outdoor motion, athletic clothing |
| Eating | eating ~89% | Close-up faces, hands, food, plates |
| Sleeping | sleeping ~83% | Horizontal body posture, bedroom scenes |
| Mixed | texting/laptop | Same seated indoor pose, device near face |

---

## Recommendation System

### How It Works

1. User uploads any photo of a person
2. CLIP compresses it into a 512-dim normalized vector
3. Cosine similarity is computed against all 4,995 stored embeddings
4. Top-3 most similar images are returned with similarity scores
5. Zero-shot action prediction via nearest class centroid

### Pipeline Test Results

![צילום מסך 2026-05-23 ב-23.32.43](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/1wk07lQJfiKLn2xodFijT.png)

Query: cycling photo → Top-3 all labeled CYCLING, scores: 1.0000, 0.9065, 0.8960

### Visual Output Demo

![צילום מסך 2026-05-23 ב-23.33.47](https://cdn-uploads.huggingface.co/production/uploads/69d0dcfd321ba9ab6851619d/2muAM5REe5r87YXhK0klK.png)

The visual output confirms the recommendation pipeline works correctly. The query image returned 3 results all labeled same as the true label with close similarity scores, proving the model retrieves the correct action class. This shows the CLIP embeddings capture the meaning of the action rather than just surface-level pixel similarity.

## Evaluation

To evaluate the recommendation system, I tested the pipeline on unseen images from the test split and measured how often the Top-3 returned images matched the true action class of the query.

| Metric | Value |
|--------|-------|
| **Mean Cluster Purity** | 66.2% |
| **Embedding Dim** | 512 |
| **DB Size** | 4,995 images |

High-purity actions (cycling, running, drinking) achieve near-perfect
retrieval. Low-purity actions (texting, using_laptop) score lower due
to genuine visual similarity between classes.

## App

- **Input:** Upload any photo of a person or use webcam
- **Output:** Predicted action label + confidence + Top-3 similar images with similarity scores
- **Model:** openai/clip-vit-base-patch32
- **Method:** Cosine similarity + zero-shot centroid prediction

### Ethical & Business Considerations

- **Avoiding False Positives:** Since actions like texting, calling, and using a laptop look very similar visually, this model should not be used for tracking employee or student productivity. Doing so could easily lead to unfair or incorrect mistakes.
- **Privacy First:** Any application using real-world photos must protect user privacy. In a real business setting, user-uploaded images should be processed securely and never stored without explicit permission.