Spaces:
Sleeping
Sleeping
metadata
title: HindiRAG
emoji: ๐ป
colorFrom: pink
colorTo: purple
sdk: docker
app_port: 7860
sdk_version: latest
app_file: Dockerfile
pinned: false
HindiRAG: Multi-Language Indic RAG System
A Retrieval-Augmented Generation (RAG) system for Indic languages using Sarvam-1 model and Qdrant vector database.
Features
- 10 Indic Languages Support: Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu
- Automatic Language Detection: System auto-detects query language and responds in the same language
- Sarvam-1 Model: Optimized for Indic language generation
- Qdrant Vector Database: Efficient semantic search
- HuggingFace Datasets Integration: Load datasets directly from HuggingFace
- Document Ingestion: Support for JSON and TXT formats
Supported Languages
| Language | Native Name | Script |
|---|---|---|
| Hindi | เคนเคฟเคเคฆเฅ | Devanagari |
| Bengali | เฆฌเฆพเฆเฆฒเฆพ | Bengali |
| Gujarati | เชเซเชเชฐเชพเชคเซ | Gujarati |
| Kannada | เฒเฒจเณเฒจเฒก | Kannada |
| Malayalam | เดฎเดฒเดฏเดพเดณเด | Malayalam |
| Marathi | เคฎเคฐเคพเค เฅ | Devanagari |
| Odia | เฌเฌกเฌผเฌฟเฌ | Odia |
| Punjabi | เจชเฉฐเจเจพเจฌเฉ | Gurmukhi |
| Tamil | เฎคเฎฎเฎฟเฎดเฏ | Tamil |
| Telugu | เฐคเฑเฐฒเฑเฐเฑ | Telugu |
Quick Start
1. Clone and Setup
git clone <repository-url>
cd HindiRAG
2. Install Dependencies
pip install -r requirements.txt
3. Configure Environment
cp .env.example .env
Edit .env to configure:
HF_DATASETS: HuggingFace datasets to load (e.g.,miracl/miracl-corpus:hi:train)QDRANT_HOST: Qdrant host (default: localhost)QDRANT_PORT: Qdrant port (default: 6333)
4. Start Qdrant
docker run -p 6333:6333 qdrant/qdrant
5. Run the Application
# Load datasets and start the system
python main.py
# Or run the frontend directly
streamlit run frontend/app.py
Configuration
Environment Variables
# HuggingFace Datasets (comma-separated)
# Format: dataset_name:config:split
HF_DATASETS=miracl/miracl-corpus:hi:train
# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333
# Generation Parameters
TEMPERATURE=0.7
MAX_NEW_TOKENS=1024
Loading HuggingFace Datasets
Set HF_DATASETS environment variable with dataset specifications:
# Single dataset
HF_DATASETS=miracl/miracl-corpus:hi:train
# Multiple datasets
HF_DATASETS=miracl/miracl-corpus:hi:train,wikipedia:hi:train
# Dataset without config
HF_DATASETS=squad::train
Data Format
Place your documents in the data/ directory:
JSON Format:
[
{
"title": "Document Title",
"author": "Author Name",
"text": "Document content...",
"genre": "story"
}
]
TXT Format:
- Plain text files
- Multiple documents separated by double newlines
Docker Deployment
# Build the image
docker build -t hindi-rag .
# Run with docker-compose
docker-compose up
# Or run manually
docker run -p 8501:8501 -p 6333:6333 hindi-rag
Architecture
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ User Query โโโโโโถโ Language โโโโโโถโ Sarvam-1 LLM โ
โ (Any Indic) โ โ Detector โ โ (Generation) โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Qdrant โโโโโโโ Embedding โ
โ Vector DB โ โ Generator โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
Project Structure
HindiRAG/
โโโ src/
โ โโโ llm_manager.py # Sarvam-1 LLM management
โ โโโ rag_system.py # RAG system with multi-language support
โ โโโ language_detector.py # Indic language detection
โ โโโ embedding_generator.py # Embedding generation
โ โโโ qdrant_setup.py # Qdrant database setup
โ โโโ document_ingestor.py # Document ingestion
โ โโโ load_huggingface_dataset.py # HuggingFace dataset loader
โโโ frontend/
โ โโโ app.py # Streamlit frontend
โโโ data/ # Document storage
โโโ main.py # Main entry point
โโโ requirements.txt # Dependencies
โโโ .env.example # Environment template
API Usage
from src.rag_system import HindiRAGSystem
# Initialize the system
rag = HindiRAGSystem()
# Query in any supported language
result = rag.query("เคชเฅเคฐเคเฅเคคเคฟ เคเคพ เคตเคฐเฅเคฃเคจ เคเฅเคธเฅ เคเคฟเคฏเคพ เคเคฏเคพ เคนเฅ?", top_k=5)
print(f"Answer: {result['answer']}")
print(f"Detected Language: {result['language_name']}")
print(f"Supported: {result['is_supported']}")
Troubleshooting
LLM Initialization Failed
- Ensure you have enough memory for Sarvam-1 model (~8GB)
- Check internet connection for model download
Qdrant Connection Error
- Verify Qdrant is running:
docker ps | grep qdrant - Check host/port in
.envfile
Language Detection Issues
- System uses Unicode ranges for detection
- Short queries may have lower confidence
License
MIT License
Acknowledgments
- Sarvam-1 for the Indic language model
- Qdrant for vector database
- HuggingFace for datasets