vqa-backend / README_COMPLETE.md
Deva8's picture
Deploy VQA Space with model downloader
bb8f662

🧠 GenVQA β€” Generative Visual Question Answering

A hybrid neuro-symbolic VQA system that intelligently routes between pure neural networks and knowledge-grounded reasoning


Overview

GenVQA is an advanced Visual Question Answering system that combines the best of both worlds:

  • Neural networks for perception-based visual questions
  • Symbolic reasoning for knowledge-intensive reasoning questions

The system automatically classifies incoming questions and routes them to the optimal processing pipeline, ensuring accurate and grounded answers.


System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        CLIENT                           β”‚
β”‚         Expo React Native App (iOS/Android/Web)                  β”‚
β”‚         β€’ Image upload via camera/gallery                        β”‚
β”‚         β€’ Question input with suggested prompts                  β”‚
β”‚         β€’ Multi-turn conversational interface                    β”‚
β”‚         β€’ Google OAuth authentication                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚ HTTP POST /api/answer
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    BACKEND API LAYER                          β”‚
β”‚                    FastAPI (backend_api.py)                      β”‚
β”‚         β€’ Request handling & validation                          β”‚
β”‚         β€’ Session management & authentication                    β”‚
β”‚         β€’ Multi-turn conversation tracking                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   INTELLIGENT ROUTING LAYER                   β”‚
β”‚                  (ensemble_vqa_app.py)                           β”‚
β”‚                                                                  β”‚
β”‚   CLIP Semantic Classifier:                                     β”‚
β”‚   Encodes question β†’ Compares similarity:                       β”‚
β”‚   "This is a reasoning question about facts"                    β”‚
β”‚              vs                                                  β”‚
β”‚   "This is a visual perception question"                        β”‚
β”‚                                                                  β”‚
β”‚           Similarity > threshold?
                                β”‚
β”‚                     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚                     β”‚         β”‚        β”‚                        β”‚
β”‚               REASONING    VISUAL   SPATIAL                      β”‚
β”‚                     β”‚         β”‚        β”‚                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚         β”‚        β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚        └─────────────┐
        β–Ό                       β–Ό                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ NEURO-SYMBOLIC   β”‚   β”‚  NEURAL VQA PATH  β”‚   β”‚ SPATIAL ADAPTER β”‚
β”‚    PIPELINE      β”‚   β”‚                   β”‚   β”‚      PATH       β”‚
β”‚                  β”‚   β”‚  CLIP + GRU +     β”‚   β”‚                 β”‚
β”‚ β‘  VQA Model      β”‚   β”‚  Attention        β”‚   β”‚  Enhanced with  β”‚
β”‚    Detects       β”‚   β”‚                   β”‚   β”‚  spatial        β”‚
β”‚    Objects       β”‚   β”‚  Direct answer    β”‚   β”‚  self-attention β”‚
β”‚    (e.g. "soup") β”‚   β”‚  prediction from  β”‚   β”‚  for left/right β”‚
β”‚                  β”‚   β”‚  image features   β”‚   β”‚  above/below    β”‚
β”‚ β‘‘ Wikidata API   β”‚   β”‚                   β”‚   β”‚  questions      β”‚
β”‚    Fetches Facts β”‚   β”‚  Outputs:         β”‚   β”‚                 β”‚
β”‚    P31: category β”‚   β”‚  "red"            β”‚   β”‚  Outputs:       β”‚
β”‚    P186: materialβ”‚   β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  "on the left"  β”‚
β”‚    P2101: meltingβ”‚           β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚    P366: use     β”‚           β”‚                        β”‚
β”‚    P2054: densityβ”‚           β”‚                        β”‚
β”‚                  β”‚           β”‚                        β”‚
β”‚ β‘’ Groq LLM       β”‚           β”‚                        β”‚
β”‚    Verbalizes    β”‚           β”‚                        β”‚
β”‚    from facts    β”‚           β”‚                        β”‚
β”‚    (instead
      of free      β”‚           β”‚                        β”‚
β”‚     reasoning)   β”‚           β”‚                        β”‚
β”‚                  β”‚           β”‚                        β”‚
β”‚ Outputs:         β”‚           β”‚                        β”‚
β”‚ "Soup is made of β”‚           β”‚                        β”‚
β”‚  water and       β”‚           β”‚                        β”‚
β”‚  vegetables,     β”‚           β”‚                        β”‚
β”‚  used for eating"β”‚           β”‚                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚                        β”‚
         β”‚                     β”‚                        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  GROQ ACCESSIBILITY  β”‚
         β”‚       SERVICE        β”‚
         β”‚                      β”‚
         β”‚  Generates 2-sentenceβ”‚
         β”‚  screen-reader       β”‚
         β”‚  friendly descriptionβ”‚
         β”‚  for every answer    β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
              JSON Response
         {
           "answer": "...",
           "model_used": "neuro_symbolic|base|spatial",
           "confidence": 0.85,
           "kg_enhancement": true/false,
           "wikidata_entity": "Q123456",
           "description": "...",
           "session_id": "..."
         }

Neural vs Neuro-Symbolic: Deep Dive

Neural Pathway

When Used: Perceptual questions about what's directly visible

  • "What color is the car?"
  • "How many people are in the image?"
  • "Is the dog sitting or standing?"

Architecture:

Image Input
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    CLIP Vision Encoder      β”‚
β”‚    (ViT-B/16)               β”‚
β”‚    β€’ Pre-trained on 400M    β”‚
β”‚      image-text pairs       β”‚
β”‚    β€’ 512-dim embeddings     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
      [512-dim vector] ────────────┐
                                   β”‚
Question Input                     β”‚
    β”‚                              β”‚
    β–Ό                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   GPT-2 Text Encoder        β”‚   β”‚
β”‚   (distilgpt2)              β”‚   β”‚
β”‚   β€’ Contextual embeddings   β”‚   β”‚
β”‚   β€’ 768-dim output          β”‚   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
           β”‚                       β”‚
           β–Ό                       β”‚
      [768-dim vector]             β”‚
           β”‚                       β”‚
           β–Ό                       β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
    β”‚ Linear Proj  β”‚               β”‚
    β”‚ 768 β†’ 512    β”‚               β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
           β”‚                       β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚  Multimodal Fusion   β”‚
            β”‚  β€’ Gated combination β”‚
            β”‚  β€’ 3-layer MLP       β”‚
            β”‚  β€’ ReLU + Dropout    β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚  GRU Decoder with    β”‚
            β”‚  Attention Mechanism β”‚
            β”‚                      β”‚
            β”‚  β€’ Hidden: 512-dim   β”‚
            β”‚  β€’ 2 layers          β”‚
            β”‚  β€’ Seq2seq decoding  β”‚
            β”‚  β€’ Attention over    β”‚
            β”‚    fused features    β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
                 Answer Tokens
                 "red car"

Key Components:

  • CLIP: Zero-shot image understanding, robust to domain shift
  • GPT-2: Contextual question encoding
  • Attention: Decoder focuses on relevant image regions per word
  • GRU: Sequential answer generation with memory

Training:

  • Dataset: VQA v2 (curated, balanced subset)
  • Loss: Cross-entropy over answer vocabulary
  • Fine-tuning: Last 2 CLIP layers + full decoder
  • Accuracy: ~39% on general VQA, ~28% on spatial questions

Neuro-Symbolic Pathway (Knowledge-Grounded Reasoning)

When Used: Questions requiring external knowledge or reasoning

  • "Can soup melt?"
  • "What is ice cream made of?"
  • "Does this float in water?"

Architecture:

Step 1: NEURAL DETECTION
─────────────────────────
Image + Question
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   VQA Model          β”‚
β”‚   (same as above)    β”‚
β”‚                      β”‚
β”‚   Predicts: "soup"   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
    Detected Object
       "soup"

Step 2: SYMBOLIC FACT RETRIEVAL
────────────────────────────────
    "soup"
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Wikidata SPARQL Queries       β”‚
β”‚                                  β”‚
β”‚ β‘  Entity Resolution:             β”‚
β”‚    "soup" β†’ Q41415 (Wikidata ID) β”‚
β”‚                                  β”‚
β”‚ β‘‘ Fetch ALL Relevant Properties: β”‚
β”‚                                  β”‚
β”‚    P31  (instance of):           β”‚
β”‚         β†’ "food"                 β”‚
β”‚         β†’ "liquid food"          β”‚
β”‚         β†’ "dish"                 β”‚
β”‚                                  β”‚
β”‚    P186 (made of):               β”‚
β”‚         β†’ "water"                β”‚
β”‚         β†’ "vegetables"           β”‚
β”‚         β†’ "broth"                β”‚
β”‚                                  β”‚
β”‚    P366 (used for):              β”‚
β”‚         β†’ "consumption"          β”‚
β”‚         β†’ "nutrition"            β”‚
β”‚                                  β”‚
β”‚    P2101 (melting point):        β”‚
β”‚         β†’ (not found)            β”‚
β”‚                                  β”‚
β”‚    P2054 (density):              β”‚
β”‚         β†’ ~1000 kg/mΒ³            β”‚
β”‚         β†’ (floats/sinks calc)    β”‚
β”‚                                  β”‚
β”‚    P2777 (flash point):          β”‚
β”‚         β†’ (not found)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
               β–Ό
    Structured Knowledge Graph
    {
      "entity": "soup (Q41415)",
      "categories": ["food", "liquid"],
      "materials": ["water", "vegetables"],
      "uses": ["consumption"],
      "density": 1000,
      "melting_point": null
    }

Step 3: LLM VERBALIZATION (NOT REASONING!)
───────────────────────────────────────────
    Knowledge Graph
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Groq API                   β”‚
β”‚     (Llama 3.3 70B)                β”‚
β”‚                                    β”‚
β”‚  System Prompt:                    β”‚
β”‚  "You are a fact verbalizer.      β”‚
β”‚   Answer ONLY from provided        β”‚
β”‚   Wikidata facts. Do NOT use       β”‚
β”‚   your training knowledge.         β”‚
β”‚   If facts don't contain the       β”‚
β”‚   answer, say 'unknown from        β”‚
β”‚   available data'."                β”‚
β”‚                                    β”‚
β”‚  User Input:                       β”‚
β”‚  Question: "Can soup melt?"        β”‚
β”‚  Facts: {structured data above}    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
             β–Ό
    Natural Language Answer
    "According to Wikidata, soup is
     a liquid food made of water and
     vegetables. Since it's already
     liquid, it doesn't have a melting
     point like solids do. It can
     freeze, but not melt."

Critical Design Principle:

Groq is a verbalizer, NOT a reasoner. All reasoning happens in the symbolic layer (Wikidata facts). Groq only translates structured facts into natural language.

Why This Matters:

  • Without facts: Groq hallucinates from training data
  • With facts: Groq grounds answers in real-time data
  • Result: Factual accuracy, no made-up information

Knowledge Base Properties Fetched:

Property Wikidata Code Example Value
Category P31 "food", "tool", "animal"
Material P186 "metal", "wood", "plastic"
Melting Point P2101 273.15 K (0Β°C)
Density P2054 917 kg/mΒ³ (floats/sinks)
Use P366 "eating", "transportation"
Flash Point P2777 310 K (flammable)
Location P276 "ocean", "forest"

Spatial Reasoning Pathway

When Used: Questions about relative positions

  • "What is to the left of the car?"
  • "Is the cat above or below the table?"

Architecture Enhancement:

Base VQA Model
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Spatial Self-Attention      β”‚
β”‚  β€’ Multi-head attention (8)  β”‚
β”‚  β€’ Learns spatial relations  β”‚
β”‚  β€’ Position-aware weighting  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
    Spatial-aware answer
    "on the left side"

Keyword Triggering:

  • Detects: left, right, above, below, top, bottom, next to, behind, between, etc.
  • Routes to spatial adapter model
  • Enhanced accuracy on positional questions

Intelligent Routing System

CLIP-Based Semantic Routing:

# Encode question with CLIP
question_embedding = clip.encode_text(question)

# Compare against two templates
reasoning_prompt = "This is a reasoning question about facts and knowledge"
visual_prompt = "This is a visual perception question about what you see"

reasoning_similarity = cosine_similarity(question_embedding,
                                         clip.encode_text(reasoning_prompt))
visual_similarity = cosine_similarity(question_embedding,
                                      clip.encode_text(visual_prompt))

# Route decision
if reasoning_similarity > visual_similarity + THRESHOLD:
    route_to_neuro_symbolic()
elif contains_spatial_keywords(question):
    route_to_spatial_adapter()
else:
    route_to_base_neural()

Routing Logic:

  1. Neuro-Symbolic if CLIP classifies as reasoning (>0.6 similarity)
  2. Spatial if contains spatial keywords (left, right, above, etc.)
  3. Base Neural for all other visual perception questions

Multi-Turn Conversation Support

Conversation Manager Features:

  • Session tracking with UUID
  • Context retention across turns
  • Pronoun resolution (it, this, that β†’ previous object)
  • Automatic session expiry (30 min timeout)

Example Conversation:

Turn 1:
User: "What is this?"
VQA: "A red car"
Objects: ["car"]

Turn 2:
User: "Can it float?"              # "it" = "car"
System: Resolves "it" β†’ "car"
VQA: [Neuro-Symbolic] "According to Wikidata, cars are made
      of metal and plastic with density around 800-1000 kg/mΒ³,
      which is close to water. Most cars would sink."

Turn 3:
User: "What color is it again?"    # Still referring to car
VQA: [Neural] "red"                # From Turn 1 context

Quick Start

Prerequisites

  • Python 3.10+
  • CUDA GPU (recommended, 4GB+ VRAM)
  • Node.js 16+ (for mobile UI)
  • Groq API key (get one free)

Backend Setup

# 1. Clone repository
git clone https://github.com/YourUsername/vqa_coes.git
cd vqa_coes

# 2. Install dependencies
pip install -r requirements_api.txt

# 3. Set environment variables
echo "GROQ_API_KEY=your_groq_api_key_here" > .env

# 4. Download model checkpoints (if not included)
# Ensure these files exist in project root:
#   - vqa_checkpoint.pt (base model)
#   - vqa_spatial_checkpoint.pt (spatial model)

# 5. Start API server
python backend_api.py

# Server will start at http://localhost:8000

Mobile UI Setup

# 1. Navigate to UI folder
cd ui

# 2. Install dependencies
npm install

# 3. Configure API endpoint
# Edit ui/src/config/api.js
# Change: export const API_BASE_URL = 'http://YOUR_LOCAL_IP:8000';

# 4. Start Expo
npx expo start --clear

# Scan QR code with Expo Go app, or press 'w' for web

πŸ”§ API Reference

POST /api/answer

Answer a visual question with optional conversation context.

Request:

curl -X POST http://localhost:8000/api/answer \
  -F "image=@photo.jpg" \
  -F "question=Can this float in water?" \
  -F "session_id=optional-uuid-here"

Response:

{
  "answer": "According to Wikidata, this object has a density of 917 kg/mΒ³, which is less than water (1000 kg/mΒ³), so it would float.",
  "model_used": "neuro_symbolic",
  "confidence": 0.87,
  "kg_enhancement": true,
  "wikidata_entity": "Q41576",
  "description": "The object appears to be made of ice. Based on its physical properties from scientific data, it would float on water due to lower density.",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "conversation_turn": 2
}


## πŸ“„ License

MIT License - see LICENSE file for details

---