Spaces:
Running
A newer version of the Gradio SDK is available: 6.14.0
title: Scene Graph Generator
emoji: π§
colorFrom: blue
colorTo: purple
sdk: gradio
app_file: app.py
pinned: false
π§ Scene Graph Generator (Multimodal AI System)
A multimodal computer vision system that takes an input image, detects objects, predicts relationships between them, constructs a structured scene graph, and generates a natural language description of the scene.
π Live Demo (Hugging Face Spaces):
https://huggingface.co/spaces//scene-graph-generator
π Features
- πΌοΈ Object Detection using DETR (ResNet-50)
- π Relationship Prediction (Custom Trained Model)
- π Spatial Reasoning (Hybrid AI with Geometry Rules)
- π§© Scene Graph Construction (Directed Graph)
- π Graph Visualization (NetworkX + Matplotlib)
- π§ Graph-to-Text Generation (FLAN-T5)
- π Interactive UI (Gradio)
- βοΈ Deployed on Hugging Face Spaces (CPU)
π§ How It Works (End-to-End Pipeline)
1. Input
- User uploads an image (JPG/PNG) via Gradio UI
- Image is converted from PIL β OpenCV format
2. Object Detection
- Uses
facebook/detr-resnet-50from Hugging Face - Outputs:
- Object labels (COCO classes)
- Bounding boxes
- Confidence scores
- Applies threshold (β₯ 0.7) to filter noise
3. Pairwise Object Processing
- Generates object pairs using
itertools.combinations - Extracts bounding boxes for each pair
- Creates union region for relation inference
- Filters duplicate object pairs
4. Relationship Prediction
- Custom-trained classifier on Visual Genome subset (~10K samples)
- Predicts semantic relations:
on,holding,behind, etc.
- Trained using PyTorch (10 epochs)
5. Spatial Reasoning (Hybrid AI)
- Uses bounding box geometry to compute:
left_of,right_of,above,below,near
- Hybrid logic:
- Semantic relations from model (if confident)
- Otherwise fallback to spatial rules
- Reduces bias (e.g., βeverything = onβ)
6. Graph Construction
- Builds a directed graph (NetworkX DiGraph)
- Nodes β objects
- Edges β relationships
- Removes duplicates and limits edges for clarity
7. Graph Visualization
- Uses NetworkX + Matplotlib
- Displays:
- Directed edges with labels
- Clean layout for readability
8. Graph β Text (NLP)
- Uses
google/flan-t5-small - Converts structured triples into natural language
Example: laptop β on β table mouse β next_to β laptop
Output: "A laptop is placed on a table with a mouse next to it."
9. UI (Gradio)
- Upload image
- View:
- Scene graph
- Generated description
- Fully interactive and browser-based
ποΈ Tech Stack
- **Computer Vision:** DETR (Hugging Face Transformers)
- **Deep Learning:** PyTorch
- **Graph Processing:** NetworkX
- **NLP:** FLAN-T5
- **Image Processing:** OpenCV
- **Frontend/UI:** Gradio
- **Deployment:** Hugging Face Spaces
π Project Structure
scene-graph-generator/ β βββ app.py βββ requirements.txt βββ README.md β βββ src/ β βββ pipeline.py β βββ detection.py β βββ spatial_rules.py β βββ relationship_infer.py β βββ scene_graph.py β βββ visualization.py β βββ text_generation.py
βοΈ Installation (Local Setup)
git clone https://github.com/<your-username>/scene-graph-generator.git
cd scene-graph-generator
pip install -r requirements.txt
python app.py