Spaces:

Himanshu2003
/

Samarth

Sleeping

App Files Files Community

Samarth / README.md

Himanshu2003

Update README.md

4fa78cf verified 4 months ago

preview code

raw

history blame contribute delete

3.06 kB

	---
	title: Samarth
	emoji: 🧑‍💻
	colorFrom: indigo
	colorTo: purple
	sdk: streamlit
	pinned: false
	short_description: 'Ask questions about Agriculture — Samarth gives answer from '
	sdk_version: 1.51.0
	---

	# 📊 Samarth: Data-Aware Question Answering System

	An AI-powered Question Answering System built using LangChain and Streamlit, designed to intelligently answer user queries based on multiple government data sources.

	---

	## 🧠 Project Overview

	This system enables natural language querying over diverse datasets.
	Each dataset is first analyzed to automatically generate structured metadata (including summary, columns, and use cases) using an LLM.
	These metadata representations are embedded into a vector database for semantic similarity search.

	When a user asks a question:

	1. The system finds the most relevant dataset using semantic search.
	2. It uses an LLM to generate an appropriate SQL query for that dataset.
	3. The query is executed on the dataset, and the result is interpreted into a human-readable answer by another LLM.
	4. If the dataset lacks relevant information, the system responds gracefully that no answer is available.

	---

	## ⚙️ How to Use

	1. Run the Streamlit App
	> streamlit run app.py


	2. Upload or choose your dataset(s)
	- The app supports multiple tabular datasets (CSV).

	3. Ask a natural language question
	Example:

	> “What was the average annual rainfall in Telangana in 2000?”

	4. View the result

	* The system automatically identifies the most relevant dataset.
	* It generates, executes, and interprets a SQL query.
	* You receive a concise, natural answer with a verified data source link.


	## 🗂️ Adding New Datasets

	To include new datasets in the system, follow these simple steps:

	1. Download and place your dataset file
	- Save the new dataset (CSV format) inside the `/datasets` folder.

	2. Update the metadata generation script
	- Open `generate_metadata.py`.
	- Add the dataset details in the respective lists:
	- `dataset_links` → the dataset’s source link
	- `dataset_names` → a short descriptive name for the dataset
	- `datasets_list` → the filename of the dataset (as saved in the `/datasets` folder)

	3. Generate metadata

	- Run the following command to generate structured metadata for all datasets:
	> python generate_metadata.py

	4. Restart the Streamlit app

	- Once metadata is generated, rerun the app:
	> streamlit run app.py

	---

	## 🧩 Tech Stack

	* LangChain – for prompt orchestration and LLM integration
	* Sentence Transformers + FAISS – for vector similarity search
	* Streamlit – for interactive web UI
	* Google Gemini / Generative AI – for SQL and natural-language generation

	---

	## 📁 Project Flow Summary


	User Query
	↓
	Semantic Search (Vector DB)
	↓
	SQL Query Generation (LLM)
	↓
	SQL Execution on Dataset
	↓
	Natural Language Answer (LLM)
	↓
	Final Answer + Source Link

	---