Samarth / README.md
Himanshu2003's picture
Update README.md
4fa78cf verified

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade
metadata
title: Samarth
emoji: πŸ§‘β€πŸ’»
colorFrom: indigo
colorTo: purple
sdk: streamlit
pinned: false
short_description: 'Ask questions about Agriculture β€” Samarth gives answer from '
sdk_version: 1.51.0

πŸ“Š Samarth: Data-Aware Question Answering System

An AI-powered Question Answering System built using LangChain and Streamlit, designed to intelligently answer user queries based on multiple government data sources.


🧠 Project Overview

This system enables natural language querying over diverse datasets.
Each dataset is first analyzed to automatically generate structured metadata (including summary, columns, and use cases) using an LLM.
These metadata representations are embedded into a vector database for semantic similarity search.

When a user asks a question:

  1. The system finds the most relevant dataset using semantic search.
  2. It uses an LLM to generate an appropriate SQL query for that dataset.
  3. The query is executed on the dataset, and the result is interpreted into a human-readable answer by another LLM.
  4. If the dataset lacks relevant information, the system responds gracefully that no answer is available.

βš™οΈ How to Use

  1. Run the Streamlit App

    streamlit run app.py

  2. Upload or choose your dataset(s)

    • The app supports multiple tabular datasets (CSV).
  3. Ask a natural language question Example:

    β€œWhat was the average annual rainfall in Telangana in 2000?”

  4. View the result

    • The system automatically identifies the most relevant dataset.
    • It generates, executes, and interprets a SQL query.
    • You receive a concise, natural answer with a verified data source link.

πŸ—‚οΈ Adding New Datasets

To include new datasets in the system, follow these simple steps:

  1. Download and place your dataset file

    • Save the new dataset (CSV format) inside the /datasets folder.
  2. Update the metadata generation script

    • Open generate_metadata.py.
    • Add the dataset details in the respective lists:
      • dataset_links β†’ the dataset’s source link
      • dataset_names β†’ a short descriptive name for the dataset
      • datasets_list β†’ the filename of the dataset (as saved in the /datasets folder)
  3. Generate metadata

  • Run the following command to generate structured metadata for all datasets:

    python generate_metadata.py

  1. Restart the Streamlit app
  • Once metadata is generated, rerun the app:

    streamlit run app.py


🧩 Tech Stack

  • LangChain – for prompt orchestration and LLM integration
  • Sentence Transformers + FAISS – for vector similarity search
  • Streamlit – for interactive web UI
  • Google Gemini / Generative AI – for SQL and natural-language generation

πŸ“ Project Flow Summary

User Query ↓ Semantic Search (Vector DB) ↓ SQL Query Generation (LLM) ↓ SQL Execution on Dataset ↓ Natural Language Answer (LLM) ↓ Final Answer + Source Link