matanzig's picture
Update README.md
c7457e3 verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade
metadata
title: VISIONARY-Interior Design Engine
emoji: 🌆
colorFrom: green
colorTo: gray
sdk: gradio
sdk_version: 6.15.2
python_version: '3.13'
app_file: app.py
pinned: false
license: mit

VISIONARY: AI-Powered Interior Design Engine

Matan Zigelman
Reichman University


📖 Project Overview

Visionary is a machine learning application designed to provide context-aware interior design recommendations through deep semantic feature extraction. The core objective of the project is to explore how neural networks can understand the complex structural and stylistic nuances of an architectural space.

By utilizing the openai/clip-vit-base-patch32 vision-language model, the system converts raw images into 512-dimensional semantic embeddings. This mathematical representation enables the application to group and retrieve rooms based on abstract design concepts—such as "minimalist," "industrial," or "warm lighting"—allowing users to seamlessly discover relevant interior designs using either visual inspiration photos or text prompts.

Tech Stack: Python | PyTorch | Hugging Face (Transformers, Datasets) | Scikit-Learn (K-Means, t-SNE) | Pandas | Gradio


🗄️ 1. Dataset Architecture

For the foundation of this engine, I selected the my_interior_design_dataset from Hugging Face. I specifically chose this dataset because its dual-label metadata provides a rich, multi-dimensional ground truth, allowing the evaluation of both functional and stylistic semantic similarities.

Feature Details
Total Volume 33,453 images (~33,500 in the training split).
Storage Size 535 MB.
Standardization All image matrices are perfectly uniform at a 256x256 pixel resolution.
Style Labels 11 distinct design classes (e.g., Rustic, Bohemian, Transitional).
Room Labels 5 functional architectural classes (e.g., Bedroom, Kitchen, Living Room).

📊 2. Exploratory Data Analysis (EDA)

Before feeding data into deep neural networks, I conducted rigorous Exploratory Data Analysis to validate structural integrity and uncover hidden visual biases.

2.1 Initial Sanity Checks & Data Integrity

  • Missing Value Verification: I inspected the metadata arrays and confirmed zero missing values (NaNs). Every visual sample is explicitly paired with valid ground-truth labels.
  • Dimension Uniformity: Batch tensor operations require exact spatial dimensions. The checks passed flawlessly—100% of the tested samples maintain the strict 256x256 boundary, eliminating the need for dynamic resizing during training.

2.2 Class Distribution Analysis

Understanding the distribution of our labels is critical to identifying potential model biases.

  • Room Types: There is a moderate class imbalance. Class 4 (Living Room) is the dominant category with >10,000 samples, whereas Class 3 contains ~4,000.
  • Interior Styles: The stylistic distribution is more balanced, gently sloping from ~4,000 samples down to ~2,100 in the least frequent class.
  • Takeaway: While the model has ample data to generate robust embeddings across all vectors, the recommendation engine might naturally perform with higher confidence on Living Rooms due to broader exposure during the embedding phase.
Class Distribution Bar Charts
Figure 1: Distribution of Room Types and Interior Styles across the dataset.

2.3 RGB Channel Distribution

Unlike tabular data, images are composed of pixel intensities across Red, Green, and Blue channels.

  • Insights: The generated histogram reveals a dataset that is predominantly bright and well-lit. Notably, the Red channel dominates the higher intensity ranges. This perfectly aligns with real-world interior photography, which heavily features warm ambient lighting, wooden textures, and earth-toned furniture. The CLIP model utilizes these specific warm-light biases as strong features for calculating room similarities.
RGB Histogram
Figure 2: RGB Pixel Intensity Distribution highlighting warm-tone dominance.

2.4 Spatial Variance & The "Mean Image"

To identify structural patterns, I computed the mathematical "Mean Image" by averaging pixel arrays within specific classes (e.g., Bedrooms).

  • Insights: The resulting images were highly amorphous and blurry, lacking distinct structural boundaries like beds or counters. This proves a high degree of spatial variance—rooms are photographed from vastly different angles and depths.
  • Conclusion: This strictly validates the decision to use Deep Learning embeddings over traditional computer vision techniques; the system requires a model that detects semantic objects, regardless of where they are spatially located in the frame.
Mean Images by Room Type
Figure 3: Mean Image representation showing high spatial variance.

🧠 3. Semantic Embeddings & Clustering Pipeline

I extracted a randomized, reproducible subset of 5,000 images (seed=42) to optimize computational efficiency while fully capturing the dataset's architectural diversity.

3.1 Feature Extraction (CLIP)

I deployed Hugging Face's openai/clip-vit-base-patch32. Leveraging PyTorch (torch.no_grad()) and GPU acceleration, the visual data was processed through the transformer, extracting a highly dense 512-dimensional vector for each image. These vectors capture complex stylistic signatures.

3.2 Unsupervised K-Means Clustering

To test the quality of these embeddings, I applied a K-Means clustering algorithm, explicitly instructing it to find 5 clusters (n_clusters=5) to mirror the 5 ground-truth room types.

3.3 Dimensionality Reduction (t-SNE) & Coherence

Projecting 512 dimensions down to a 2D visual scatter plot via t-SNE yielded outstanding results. Rather than a chaotic cloud, the data formed highly cohesive "islands."

By cross-tabulating the generated clusters against actual room labels, distinct identities emerged:

  • Cluster 0 (The Bathroom Cluster): Highly specialized, containing 706 Bathrooms and almost zero overlaps.
  • Cluster 1 (Kitchen & Dining): A semantic overlap reflecting real-world open-concept architecture.
  • Cluster 4: Dominated entirely by Living Rooms.
  • Clusters 2 & 3 (Soft-Furnishings): A logical blending of Bedrooms and Living Rooms, driven by shared semantic textures (rugs, beds, upholstery).
t-SNE Scatter Plot
Figure 4: t-SNE 2D Projection showing distinct semantic clustering of 512D embeddings.

🚀 4. Recommendation Engine & Final Application

The highly coherent 512D matrix was exported as an optimized .parquet file to serve as the core engine of the production application.

System Architecture:

  1. Input Processing: When a user interacts with the Gradio UI—either by uploading an inspiration photo or typing a conceptual text prompt—the system passes the input through the CLIP processor, translating it into the exact mathematical vector space as the database.
  2. Cosine Similarity: The engine measures the geometric angle between the user's vector and the 5,000 vectors in the catalog.
  3. Top-K Retrieval: The algorithm ranks the closest spatial matches and retrieves the top 3 results alongside their exact confidence scores.
Recommendation Engine Results
Figure 5: System successfully matching a textual/visual input to real-world interior designs.

🎯 Conclusion & Technical Achievements

Testing the complete pipeline confirmed that the system goes far beyond matching raw colors. It successfully extracts and matches abstract architectural concepts and stylistic intent. The backend logic proved highly reliable and was successfully deployed as a fully interactive Hugging Face Space frontend application, demonstrating a complete end-to-end Machine Learning lifecycle.


📁 Project Resources

For a complete technical deep dive, the core development files and generated datasets are available directly within this repository:

  • 📓 Colab (A3.ipynb): Contains the full algorithmic workflow, including the comprehensive Exploratory Data Analysis (EDA), CLIP feature extraction, K-Means clustering, and t-SNE visualizations.
  • 🗄️ Embeddings Database (interior_embeddings.parquet): The pre-computed 512-dimensional CLIP embeddings matrix, serving as the foundational knowledge base for the recommendation engine's vector search.