Spaces:
Running
Running
| title: Scene Graph Generator | |
| emoji: π§ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: false | |
| # π§ Scene Graph Generator (Multimodal AI System) | |
| A multimodal computer vision system that takes an input image, detects objects, predicts relationships between them, constructs a structured scene graph, and generates a natural language description of the scene. | |
| π **Live Demo (Hugging Face Spaces):** | |
| https://huggingface.co/spaces/<your-username>/scene-graph-generator | |
| --- | |
| # π Features | |
| - πΌοΈ Object Detection using DETR (ResNet-50) | |
| - π Relationship Prediction (Custom Trained Model) | |
| - π Spatial Reasoning (Hybrid AI with Geometry Rules) | |
| - π§© Scene Graph Construction (Directed Graph) | |
| - π Graph Visualization (NetworkX + Matplotlib) | |
| - π§ Graph-to-Text Generation (FLAN-T5) | |
| - π Interactive UI (Gradio) | |
| - βοΈ Deployed on Hugging Face Spaces (CPU) | |
| --- | |
| # π§ How It Works (End-to-End Pipeline) | |
| ### 1. Input | |
| - User uploads an image (JPG/PNG) via Gradio UI | |
| - Image is converted from PIL β OpenCV format | |
| --- | |
| ### 2. Object Detection | |
| - Uses `facebook/detr-resnet-50` from Hugging Face | |
| - Outputs: | |
| - Object labels (COCO classes) | |
| - Bounding boxes | |
| - Confidence scores | |
| - Applies threshold (β₯ 0.7) to filter noise | |
| --- | |
| ### 3. Pairwise Object Processing | |
| - Generates object pairs using `itertools.combinations` | |
| - Extracts bounding boxes for each pair | |
| - Creates union region for relation inference | |
| - Filters duplicate object pairs | |
| --- | |
| ### 4. Relationship Prediction | |
| - Custom-trained classifier on Visual Genome subset (~10K samples) | |
| - Predicts semantic relations: | |
| - `on`, `holding`, `behind`, etc. | |
| - Trained using PyTorch (10 epochs) | |
| --- | |
| ### 5. Spatial Reasoning (Hybrid AI) | |
| - Uses bounding box geometry to compute: | |
| - `left_of`, `right_of`, `above`, `below`, `near` | |
| - Hybrid logic: | |
| - Semantic relations from model (if confident) | |
| - Otherwise fallback to spatial rules | |
| - Reduces bias (e.g., βeverything = onβ) | |
| --- | |
| ### 6. Graph Construction | |
| - Builds a **directed graph (NetworkX DiGraph)** | |
| - Nodes β objects | |
| - Edges β relationships | |
| - Removes duplicates and limits edges for clarity | |
| --- | |
| ### 7. Graph Visualization | |
| - Uses NetworkX + Matplotlib | |
| - Displays: | |
| - Directed edges with labels | |
| - Clean layout for readability | |
| --- | |
| ### 8. Graph β Text (NLP) | |
| - Uses `google/flan-t5-small` | |
| - Converts structured triples into natural language | |
| Example: | |
| laptop β on β table | |
| mouse β next_to β laptop | |
| Output: | |
| "A laptop is placed on a table with a mouse next to it." | |
| --- | |
| ### 9. UI (Gradio) | |
| - Upload image | |
| - View: | |
| - Scene graph | |
| - Generated description | |
| - Fully interactive and browser-based | |
| --- | |
| # ποΈ Tech Stack | |
| ``` | |
| - **Computer Vision:** DETR (Hugging Face Transformers) | |
| - **Deep Learning:** PyTorch | |
| - **Graph Processing:** NetworkX | |
| - **NLP:** FLAN-T5 | |
| - **Image Processing:** OpenCV | |
| - **Frontend/UI:** Gradio | |
| - **Deployment:** Hugging Face Spaces | |
| ``` | |
| --- | |
| # π Project Structure | |
| scene-graph-generator/ | |
| β | |
| βββ app.py | |
| βββ requirements.txt | |
| βββ README.md | |
| β | |
| βββ src/ | |
| β βββ pipeline.py | |
| β βββ detection.py | |
| β βββ spatial_rules.py | |
| β βββ relationship_infer.py | |
| β βββ scene_graph.py | |
| β βββ visualization.py | |
| β βββ text_generation.py | |
| --- | |
| # βοΈ Installation (Local Setup) | |
| ```bash | |
| git clone https://github.com/<your-username>/scene-graph-generator.git | |
| cd scene-graph-generator | |
| pip install -r requirements.txt | |
| python app.py |