HindiRAG / README.md
hardkpentium101's picture
update docker port, exxpose it in docker file and run app on it 7860
176078b
metadata
title: HindiRAG
emoji: ๐Ÿ’ป
colorFrom: pink
colorTo: purple
sdk: docker
app_port: 7860
sdk_version: latest
app_file: Dockerfile
pinned: false

HindiRAG: Multi-Language Indic RAG System

A Retrieval-Augmented Generation (RAG) system for Indic languages using Sarvam-1 model and Qdrant vector database.

Features

  • 10 Indic Languages Support: Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu
  • Automatic Language Detection: System auto-detects query language and responds in the same language
  • Sarvam-1 Model: Optimized for Indic language generation
  • Qdrant Vector Database: Efficient semantic search
  • HuggingFace Datasets Integration: Load datasets directly from HuggingFace
  • Document Ingestion: Support for JSON and TXT formats

Supported Languages

Language Native Name Script
Hindi เคนเคฟเค‚เคฆเฅ€ Devanagari
Bengali เฆฌเฆพเฆ‚เฆฒเฆพ Bengali
Gujarati เช—เซเชœเชฐเชพเชคเซ€ Gujarati
Kannada เฒ•เฒจเณเฒจเฒก Kannada
Malayalam เดฎเดฒเดฏเดพเดณเด‚ Malayalam
Marathi เคฎเคฐเคพเค เฅ€ Devanagari
Odia เฌ“เฌกเฌผเฌฟเฌ† Odia
Punjabi เจชเฉฐเจœเจพเจฌเฉ€ Gurmukhi
Tamil เฎคเฎฎเฎฟเฎดเฏ Tamil
Telugu เฐคเฑ†เฐฒเฑเฐ—เฑ Telugu

Quick Start

1. Clone and Setup

git clone <repository-url>
cd HindiRAG

2. Install Dependencies

pip install -r requirements.txt

3. Configure Environment

cp .env.example .env

Edit .env to configure:

  • HF_DATASETS: HuggingFace datasets to load (e.g., miracl/miracl-corpus:hi:train)
  • QDRANT_HOST: Qdrant host (default: localhost)
  • QDRANT_PORT: Qdrant port (default: 6333)

4. Start Qdrant

docker run -p 6333:6333 qdrant/qdrant

5. Run the Application

# Load datasets and start the system
python main.py

# Or run the frontend directly
streamlit run frontend/app.py

Configuration

Environment Variables

# HuggingFace Datasets (comma-separated)
# Format: dataset_name:config:split
HF_DATASETS=miracl/miracl-corpus:hi:train

# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333

# Generation Parameters
TEMPERATURE=0.7
MAX_NEW_TOKENS=1024

Loading HuggingFace Datasets

Set HF_DATASETS environment variable with dataset specifications:

# Single dataset
HF_DATASETS=miracl/miracl-corpus:hi:train

# Multiple datasets
HF_DATASETS=miracl/miracl-corpus:hi:train,wikipedia:hi:train

# Dataset without config
HF_DATASETS=squad::train

Data Format

Place your documents in the data/ directory:

JSON Format:

[
  {
    "title": "Document Title",
    "author": "Author Name",
    "text": "Document content...",
    "genre": "story"
  }
]

TXT Format:

  • Plain text files
  • Multiple documents separated by double newlines

Docker Deployment

# Build the image
docker build -t hindi-rag .

# Run with docker-compose
docker-compose up

# Or run manually
docker run -p 8501:8501 -p 6333:6333 hindi-rag

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   User Query    โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  Language    โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  Sarvam-1 LLM   โ”‚
โ”‚  (Any Indic)    โ”‚     โ”‚  Detector    โ”‚     โ”‚  (Generation)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                               โ”‚
                               โ–ผ
                        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                        โ”‚   Qdrant     โ”‚โ—€โ”€โ”€โ”€โ”€โ”‚   Embedding     โ”‚
                        โ”‚ Vector DB    โ”‚     โ”‚   Generator     โ”‚
                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Project Structure

HindiRAG/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ llm_manager.py          # Sarvam-1 LLM management
โ”‚   โ”œโ”€โ”€ rag_system.py           # RAG system with multi-language support
โ”‚   โ”œโ”€โ”€ language_detector.py    # Indic language detection
โ”‚   โ”œโ”€โ”€ embedding_generator.py  # Embedding generation
โ”‚   โ”œโ”€โ”€ qdrant_setup.py         # Qdrant database setup
โ”‚   โ”œโ”€โ”€ document_ingestor.py    # Document ingestion
โ”‚   โ””โ”€โ”€ load_huggingface_dataset.py  # HuggingFace dataset loader
โ”œโ”€โ”€ frontend/
โ”‚   โ””โ”€โ”€ app.py                  # Streamlit frontend
โ”œโ”€โ”€ data/                       # Document storage
โ”œโ”€โ”€ main.py                     # Main entry point
โ”œโ”€โ”€ requirements.txt            # Dependencies
โ””โ”€โ”€ .env.example               # Environment template

API Usage

from src.rag_system import HindiRAGSystem

# Initialize the system
rag = HindiRAGSystem()

# Query in any supported language
result = rag.query("เคชเฅเคฐเค•เฅƒเคคเคฟ เค•เคพ เคตเคฐเฅเคฃเคจ เค•เฅˆเคธเฅ‡ เค•เคฟเคฏเคพ เค—เคฏเคพ เคนเฅˆ?", top_k=5)

print(f"Answer: {result['answer']}")
print(f"Detected Language: {result['language_name']}")
print(f"Supported: {result['is_supported']}")

Troubleshooting

LLM Initialization Failed

  • Ensure you have enough memory for Sarvam-1 model (~8GB)
  • Check internet connection for model download

Qdrant Connection Error

  • Verify Qdrant is running: docker ps | grep qdrant
  • Check host/port in .env file

Language Detection Issues

  • System uses Unicode ranges for detection
  • Short queries may have lower confidence

License

MIT License

Acknowledgments