Orchestrator_final / methodology.md
Anvit25's picture
Update methodology.md
2bc6924 verified
# Methodology
The chatbot integrates multiple AI workflows into a single Gradio UI. The process follows these main stages:
## Input Handling
Users interact via a multimodal text box (supports text, image, and audio).
The chatbot determines whether the query contains:
Text only
Image file
Audio file
## Intent Classification
Text queries are processed through a rule-based intent classifier (intents.json).
Example intents:
"chat" β†’ Send to hosted chatbot LLM.
"search_local_image" β†’ Trigger local semantic image search.
"request_image_analysis" β†’ Ask user to upload an image.
"request_audio_analysis" β†’ Ask user to upload audio.
## Local Semantic Search
Metadata from image.json provides descriptions for images in /images/.
Each description is encoded using SentenceTransformers (all-MiniLM-L6-v2).
Query embeddings are compared with stored embeddings using cosine similarity.
If similarity > threshold (0.4), best match image is returned.
## Image Analysis Workflow
Uploaded images are passed to the vision model (via gradio_client).
Raw AI output (JSON) is summarized with Groq API (LLaMA-3.3-70B).
Final user-facing response is a friendly explanation.
## Audio Analysis Workflow
Uploaded audio is processed via the audio model (Gradio client).
Returns prediction text (e.g., transcription or classification).
Packaged as a human-readable response.
## Groq Summarization
Any complex JSON output (e.g., image analysis) is summarized.
A system prompt guides Groq to produce short, user-friendly summaries.
Ensures technical data is explained in simple language.
## Conversation Management
All interactions are stored in Chatbot history.
User query + bot response pairs are maintained for continuity.
Multimodal interactions (e.g., image + explanation) are rendered in chat.
## Architecture at a Glance
User Input (Text / Image / Audio)
β”‚
β–Ό
Intent Classifier ──► Rule-based (intents.json)
β”‚
β”œβ”€ Chat β†’ Chatbot Client (LLM)
β”œβ”€ Search Local Image β†’ Embedding Match
β”œβ”€ Image Analysis β†’ Vision Client + Groq Summary
└─ Audio Analysis β†’ Audio Client
β–Ό
Response Generator (Groq Narrative + History)
β–Ό
Gradio Chat UI