Hello / README.md
skshimada's picture
Update README.md
e5f33c9 verified

A newer version of the Gradio SDK is available: 6.8.0

Upgrade
metadata
title: LocalAGI AI Sommelier
emoji: 🍸
colorFrom: indigo
colorTo: purple
sdk: gradio
app_file: app.py
pinned: false

🍸 LocalAGI: The AI Sommelier

πŸ“– Overview

LocalAGI is a multimodal Retrieval-Augmented Generation (RAG) application that acts as an intelligent, interactive bartender. By combining state-of-the-art computer vision with vector search, the application allows users to upload a photo of any liquor bottle and instantly receive curated cocktail recipes utilizing that specific spirit from a custom-ingested library.

Engineered to run entirely on CPU-bound cloud environments (like Hugging Face Spaces), this project showcases advanced optimization techniques, including dynamic image cropping, intelligent text-splitting, and dual-pass vision logic.

✨ Key Features

  • Visual Brand Recognition: Utilizes a Vision-Language Model (VLM) to read labels and identify specific alcohol brands from user-uploaded photos, going beyond generic object categorization.
  • Custom Knowledge Base (RAG): Ingests raw .txt and .pdf recipe books, intelligently splitting them into discrete recipe chunks using RegEx and LangChain, and stores them in a local Chroma vector database.
  • Smart Cropping Pipeline: Implements YOLOv8 to locate bottles or glasses in an image, applying dynamic 25% padding to isolate the label and strip away background noise.
  • Hardware-Optimized Processing: Features custom logic to downscale images and restrict token generation limits, allowing complex 2-billion-parameter models to run efficiently on free-tier cloud CPUs.
  • Interactive UI: A Gradio interface featuring a conversational chat format, session state memory, and a hidden "Vision Debug" gallery for real-time insight into the AI's detection process.

πŸ› οΈ Technical Stack

  • Frontend/UI: Gradio
  • Computer Vision: Ultralytics YOLOv8 (Object Detection)
  • Vision-Language Model: HuggingFaceTB/SmolVLM-Instruct (Label OCR & Context)
  • Vector Database: ChromaDB
  • Embeddings: sentence-transformers/all-MiniLM-L6-v2
  • Orchestration: LangChain (Document Loaders, Text Splitters)

🧠 How It Works Under the Hood

  1. Document Ingestion: The user uploads a recipe book. The system uses a strict "Hard Cut" method to split the document exactly at the start of every new recipe, ensuring clean data retrieval.
  2. Object Detection: When a photo is uploaded, YOLOv8 scans the image for bottles (Class 39) or glasses (Class 40/41), creating a focused, padded crop of the object.
  3. Vision Processing: The cropped image is aggressively downscaled (384x384) and passed to SmolVLM. The model is restricted to a 15-token output to rapidly extract just the brand name (e.g., "Absolut Vodka").
  4. Fallback Logic: If the VLM returns a generic term (e.g., just "Vodka") due to a bad crop, the system automatically triggers a secondary pass using the full, uncropped image to guarantee brand identification.
  5. Context Retrieval (RAG): The extracted brand name is embedded and queried against the Chroma database, retrieving the top 4 most relevant, full-text recipes.
  6. Chat Output: The system formats the retrieved recipes and returns them to the user via the conversational UI.

πŸš€ Future Roadmap

  • Integration with hardware-accelerated APIs (Groq/Gemini) for sub-3-second vision processing.
  • User inventory tracking to suggest recipes based on a combination of multiple owned bottles.