--- title: Scene Graph Generator emoji: 🧠 colorFrom: blue colorTo: purple sdk: gradio app_file: app.py pinned: false --- # 🧠 Scene Graph Generator (Multimodal AI System) A multimodal computer vision system that takes an input image, detects objects, predicts relationships between them, constructs a structured scene graph, and generates a natural language description of the scene. πŸ”— **Live Demo (Hugging Face Spaces):** https://huggingface.co/spaces//scene-graph-generator --- # πŸš€ Features - πŸ–ΌοΈ Object Detection using DETR (ResNet-50) - πŸ”— Relationship Prediction (Custom Trained Model) - πŸ“ Spatial Reasoning (Hybrid AI with Geometry Rules) - 🧩 Scene Graph Construction (Directed Graph) - πŸ“Š Graph Visualization (NetworkX + Matplotlib) - 🧠 Graph-to-Text Generation (FLAN-T5) - 🌐 Interactive UI (Gradio) - ☁️ Deployed on Hugging Face Spaces (CPU) --- # 🧠 How It Works (End-to-End Pipeline) ### 1. Input - User uploads an image (JPG/PNG) via Gradio UI - Image is converted from PIL β†’ OpenCV format --- ### 2. Object Detection - Uses `facebook/detr-resnet-50` from Hugging Face - Outputs: - Object labels (COCO classes) - Bounding boxes - Confidence scores - Applies threshold (β‰₯ 0.7) to filter noise --- ### 3. Pairwise Object Processing - Generates object pairs using `itertools.combinations` - Extracts bounding boxes for each pair - Creates union region for relation inference - Filters duplicate object pairs --- ### 4. Relationship Prediction - Custom-trained classifier on Visual Genome subset (~10K samples) - Predicts semantic relations: - `on`, `holding`, `behind`, etc. - Trained using PyTorch (10 epochs) --- ### 5. Spatial Reasoning (Hybrid AI) - Uses bounding box geometry to compute: - `left_of`, `right_of`, `above`, `below`, `near` - Hybrid logic: - Semantic relations from model (if confident) - Otherwise fallback to spatial rules - Reduces bias (e.g., β€œeverything = on”) --- ### 6. Graph Construction - Builds a **directed graph (NetworkX DiGraph)** - Nodes β†’ objects - Edges β†’ relationships - Removes duplicates and limits edges for clarity --- ### 7. Graph Visualization - Uses NetworkX + Matplotlib - Displays: - Directed edges with labels - Clean layout for readability --- ### 8. Graph β†’ Text (NLP) - Uses `google/flan-t5-small` - Converts structured triples into natural language Example: laptop β†’ on β†’ table mouse β†’ next_to β†’ laptop Output: "A laptop is placed on a table with a mouse next to it." --- ### 9. UI (Gradio) - Upload image - View: - Scene graph - Generated description - Fully interactive and browser-based --- # πŸ—οΈ Tech Stack ``` - **Computer Vision:** DETR (Hugging Face Transformers) - **Deep Learning:** PyTorch - **Graph Processing:** NetworkX - **NLP:** FLAN-T5 - **Image Processing:** OpenCV - **Frontend/UI:** Gradio - **Deployment:** Hugging Face Spaces ``` --- # πŸ“ Project Structure scene-graph-generator/ β”‚ β”œβ”€β”€ app.py β”œβ”€β”€ requirements.txt β”œβ”€β”€ README.md β”‚ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ pipeline.py β”‚ β”œβ”€β”€ detection.py β”‚ β”œβ”€β”€ spatial_rules.py β”‚ β”œβ”€β”€ relationship_infer.py β”‚ β”œβ”€β”€ scene_graph.py β”‚ β”œβ”€β”€ visualization.py β”‚ β”œβ”€β”€ text_generation.py --- # βš™οΈ Installation (Local Setup) ```bash git clone https://github.com//scene-graph-generator.git cd scene-graph-generator pip install -r requirements.txt python app.py