Samarth / README.md
Himanshu2003's picture
Update README.md
4fa78cf verified
---
title: Samarth
emoji: πŸ§‘β€πŸ’»
colorFrom: indigo
colorTo: purple
sdk: streamlit
pinned: false
short_description: 'Ask questions about Agriculture β€” Samarth gives answer from '
sdk_version: 1.51.0
---
# πŸ“Š Samarth: Data-Aware Question Answering System
An AI-powered **Question Answering System** built using **LangChain** and **Streamlit**, designed to intelligently answer user queries based on multiple government data sources.
---
## 🧠 Project Overview
This system enables natural language querying over diverse datasets.
Each dataset is first analyzed to automatically generate structured **metadata** (including summary, columns, and use cases) using an LLM.
These metadata representations are embedded into a **vector database** for semantic similarity search.
When a user asks a question:
1. The system finds the **most relevant dataset** using semantic search.
2. It uses an **LLM to generate an appropriate SQL query** for that dataset.
3. The query is executed on the dataset, and the **result is interpreted into a human-readable answer** by another LLM.
4. If the dataset lacks relevant information, the system responds gracefully that no answer is available.
---
## βš™οΈ How to Use
1. **Run the Streamlit App**
> streamlit run app.py
2. **Upload or choose your dataset(s)**
- The app supports multiple tabular datasets (CSV).
3. **Ask a natural language question**
Example:
> β€œWhat was the average annual rainfall in Telangana in 2000?”
4. **View the result**
* The system automatically identifies the most relevant dataset.
* It generates, executes, and interprets a SQL query.
* You receive a concise, natural answer with a verified data source link.
## πŸ—‚οΈ Adding New Datasets
To include new datasets in the system, follow these simple steps:
1. **Download and place your dataset file**
- Save the new dataset (CSV format) inside the `/datasets` folder.
2. **Update the metadata generation script**
- Open `generate_metadata.py`.
- Add the dataset details in the respective lists:
- `dataset_links` β†’ the dataset’s source link
- `dataset_names` β†’ a short descriptive name for the dataset
- `datasets_list` β†’ the filename of the dataset (as saved in the `/datasets` folder)
3. **Generate metadata**
- Run the following command to generate structured metadata for all datasets:
> python generate_metadata.py
4. **Restart the Streamlit app**
- Once metadata is generated, rerun the app:
> streamlit run app.py
---
## 🧩 Tech Stack
* **LangChain** – for prompt orchestration and LLM integration
* **Sentence Transformers + FAISS** – for vector similarity search
* **Streamlit** – for interactive web UI
* **Google Gemini / Generative AI** – for SQL and natural-language generation
---
## πŸ“ Project Flow Summary
User Query
↓
Semantic Search (Vector DB)
↓
SQL Query Generation (LLM)
↓
SQL Execution on Dataset
↓
Natural Language Answer (LLM)
↓
Final Answer + Source Link
---