--- title: Samarth emoji: πŸ§‘β€πŸ’» colorFrom: indigo colorTo: purple sdk: streamlit pinned: false short_description: 'Ask questions about Agriculture β€” Samarth gives answer from ' sdk_version: 1.51.0 --- # πŸ“Š Samarth: Data-Aware Question Answering System An AI-powered **Question Answering System** built using **LangChain** and **Streamlit**, designed to intelligently answer user queries based on multiple government data sources. --- ## 🧠 Project Overview This system enables natural language querying over diverse datasets. Each dataset is first analyzed to automatically generate structured **metadata** (including summary, columns, and use cases) using an LLM. These metadata representations are embedded into a **vector database** for semantic similarity search. When a user asks a question: 1. The system finds the **most relevant dataset** using semantic search. 2. It uses an **LLM to generate an appropriate SQL query** for that dataset. 3. The query is executed on the dataset, and the **result is interpreted into a human-readable answer** by another LLM. 4. If the dataset lacks relevant information, the system responds gracefully that no answer is available. --- ## βš™οΈ How to Use 1. **Run the Streamlit App** > streamlit run app.py 2. **Upload or choose your dataset(s)** - The app supports multiple tabular datasets (CSV). 3. **Ask a natural language question** Example: > β€œWhat was the average annual rainfall in Telangana in 2000?” 4. **View the result** * The system automatically identifies the most relevant dataset. * It generates, executes, and interprets a SQL query. * You receive a concise, natural answer with a verified data source link. ## πŸ—‚οΈ Adding New Datasets To include new datasets in the system, follow these simple steps: 1. **Download and place your dataset file** - Save the new dataset (CSV format) inside the `/datasets` folder. 2. **Update the metadata generation script** - Open `generate_metadata.py`. - Add the dataset details in the respective lists: - `dataset_links` β†’ the dataset’s source link - `dataset_names` β†’ a short descriptive name for the dataset - `datasets_list` β†’ the filename of the dataset (as saved in the `/datasets` folder) 3. **Generate metadata** - Run the following command to generate structured metadata for all datasets: > python generate_metadata.py 4. **Restart the Streamlit app** - Once metadata is generated, rerun the app: > streamlit run app.py --- ## 🧩 Tech Stack * **LangChain** – for prompt orchestration and LLM integration * **Sentence Transformers + FAISS** – for vector similarity search * **Streamlit** – for interactive web UI * **Google Gemini / Generative AI** – for SQL and natural-language generation --- ## πŸ“ Project Flow Summary User Query ↓ Semantic Search (Vector DB) ↓ SQL Query Generation (LLM) ↓ SQL Execution on Dataset ↓ Natural Language Answer (LLM) ↓ Final Answer + Source Link ---