Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,33 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🍸 LocalAGI: The AI Sommelier
|
| 2 |
+
|
| 3 |
+
## 📖 Overview
|
| 4 |
+
LocalAGI is a multimodal Retrieval-Augmented Generation (RAG) application that acts as an intelligent, interactive bartender. By combining state-of-the-art computer vision with vector search, the application allows users to upload a photo of any liquor bottle and instantly receive curated cocktail recipes utilizing that specific spirit from a custom-ingested library.
|
| 5 |
+
|
| 6 |
+
Engineered to run entirely on CPU-bound cloud environments (like Hugging Face Spaces), this project showcases advanced optimization techniques, including dynamic image cropping, intelligent text-splitting, and dual-pass vision logic.
|
| 7 |
+
|
| 8 |
+
## ✨ Key Features
|
| 9 |
+
* **Visual Brand Recognition:** Utilizes a Vision-Language Model (VLM) to read labels and identify specific alcohol brands from user-uploaded photos, going beyond generic object categorization.
|
| 10 |
+
* **Custom Knowledge Base (RAG):** Ingests raw `.txt` and `.pdf` recipe books, intelligently splitting them into discrete recipe chunks using RegEx and LangChain, and stores them in a local Chroma vector database.
|
| 11 |
+
* **Smart Cropping Pipeline:** Implements YOLOv8 to locate bottles or glasses in an image, applying dynamic 25% padding to isolate the label and strip away background noise.
|
| 12 |
+
* **Hardware-Optimized Processing:** Features custom logic to downscale images and restrict token generation limits, allowing complex 2-billion-parameter models to run efficiently on free-tier cloud CPUs.
|
| 13 |
+
* **Interactive UI:** A Gradio 6.0 interface featuring a conversational chat format, session state memory, and a hidden "Vision Debug" gallery for real-time insight into the AI's detection process.
|
| 14 |
+
|
| 15 |
+
## 🛠️ Technical Stack
|
| 16 |
+
* **Frontend/UI:** Gradio 6.0
|
| 17 |
+
* **Computer Vision:** Ultralytics YOLOv8 (Object Detection)
|
| 18 |
+
* **Vision-Language Model:** HuggingFaceTB/SmolVLM-Instruct (Label OCR & Context)
|
| 19 |
+
* **Vector Database:** ChromaDB
|
| 20 |
+
* **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2`
|
| 21 |
+
* **Orchestration:** LangChain (Document Loaders, Text Splitters)
|
| 22 |
+
|
| 23 |
+
## 🧠 How It Works Under the Hood
|
| 24 |
+
1. **Document Ingestion:** The user uploads a recipe book. The system uses a strict "Hard Cut" method to split the document exactly at the start of every new recipe, ensuring clean data retrieval.
|
| 25 |
+
2. **Object Detection:** When a photo is uploaded, YOLOv8 scans the image for bottles (Class 39) or glasses (Class 40/41), creating a focused, padded crop of the object.
|
| 26 |
+
3. **Vision Processing:** The cropped image is aggressively downscaled (384x384) and passed to SmolVLM. The model is restricted to a 15-token output to rapidly extract just the brand name (e.g., "Absolut Vodka").
|
| 27 |
+
4. **Fallback Logic:** If the VLM returns a generic term (e.g., just "Vodka") due to a bad crop, the system automatically triggers a secondary pass using the full, uncropped image to guarantee brand identification.
|
| 28 |
+
5. **Context Retrieval (RAG):** The extracted brand name is embedded and queried against the Chroma database, retrieving the top 4 most relevant, full-text recipes.
|
| 29 |
+
6. **Chat Output:** The system formats the retrieved recipes and returns them to the user via the conversational UI.
|
| 30 |
+
|
| 31 |
+
## 🚀 Future Roadmap
|
| 32 |
+
* Integration with hardware-accelerated APIs (Groq/Gemini) for sub-3-second vision processing.
|
| 33 |
+
* User inventory tracking to suggest recipes based on a combination of multiple owned bottles.
|