YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Swami Vivekananda AI

An AI system embodying the wisdom and teachings of Swami Vivekananda, built with pure logic, no hardcoding, and fully config-driven architecture.

Features

  • Config-Driven: Everything configurable via config.yaml
  • NLP Processing: spaCy + NLTK for intelligent text processing
  • RAG System: LangChain-based Retrieval-Augmented Generation
  • MPS Support: Optimized for Apple Silicon (M4 Mac Mini)
  • GGUF Models: Support for Mistral 7B quantized models
  • Modular Design: Clean, maintainable, production-ready code

Requirements

  • Python 3.9+
  • Mac Mini M4 with MPS (or CUDA/CPU)
  • 16GB+ RAM recommended
  • 10GB+ free disk space

Quick Start

1. Clone & Setup

# Clone repository
git clone <your-repo-url>
cd vivekananda-ai

# Run automated setup
chmod +x setup.sh
./setup.sh

2. Add Your Data

# Add PDF files
cp /path/to/pdfs/* data/raw/complete_works/

# Add your JSON dataset
cp vivekananda_dataset_1.json data/processed/

# Add Mistral model
cp mistral-7b-instruct-v0.1.Q4_K_M.gguf models/base/

3. Configure

Edit config.yaml to customize:

  • Model settings
  • Embedding parameters
  • RAG configuration
  • NLP preprocessing options

4. Create Vector Database

source venv/bin/activate
python scripts/01_embed_data.py

Expected output: ```

VIVEKANANDA AI - VECTOR DATABASE CREATION

STEP 1: LOADING DOCUMENTS Found 9 PDF files Loading: SWAMI-VIVEKANANDA-VOL-1.pdf Loaded 234 pages ...

STEP 2: PROCESSING WITH NLP Processing documents with NLP... ...

SUCCESS! VECTOR DATABASE READY Total chunks embedded: 8,469


### 5. Test RAG

```bash
python scripts/02_query_rag.py

6. Run Streamlit App

streamlit run app.py

πŸ“ Project Structure

vivekananda-ai/
β”‚
β”œβ”€β”€ config.yaml                 # Central configuration
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ setup.sh                    # Automated setup script
β”‚
β”œβ”€β”€ utils.py                    # Core utilities
β”œβ”€β”€ nlp_processor.py            # NLP processing (spaCy + NLTK)
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ 01_embed_data.py       # Create vector database
β”‚   β”œβ”€β”€ 02_query_rag.py        # Test RAG retrieval
β”‚   β”œβ”€β”€ 03_test_mistral.py     # Test Mistral model
β”‚   └── 04_prepare_finetune.py # Prepare fine-tuning data
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                   # Original PDFs
β”‚   β”œβ”€β”€ processed/             # JSON datasets
β”‚   └── extracted_text/        # Extracted text files
β”‚
β”œβ”€β”€ vectorstore/
β”‚   └── vivekananda_db/        # FAISS vector database
β”‚
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ base/                  # Base models (GGUF)
β”‚   └── fine_tuned/            # Fine-tuned adapters
β”‚
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ logs/                  # Application logs
β”‚   └── results/               # Results & analysis
β”‚
└── app.py                     # Streamlit app

Configuration

All settings in config.yaml:

Model Settings

model:
  generation:
    max_tokens: 512
    temperature: 0.7
    top_p: 0.9

Embeddings

embeddings:
  model_name: "sentence-transformers/all-mpnet-base-v2"
  chunk:
    size: 500
    overlap: 50

NLP Processing

nlp:
  spacy:
    model: "en_core_web_sm"
  nltk:
    tokenizer: "punkt"
  preprocessing:
    lemmatize: true
    remove_stopwords: false

RAG

rag:
  retrieval:
    top_k: 5
    similarity_threshold: 0.5

How It Works

1. Document Loading

  • Loads PDFs, text files, and JSON Q&A pairs
  • Uses LangChain loaders for consistency
  • Extracts Vivekananda-specific metadata

2. NLP Processing

  • spaCy: Sentence segmentation, lemmatization, NER
  • NLTK: Tokenization, stopword removal, stemming
  • Config-driven preprocessing pipeline

3. Chunking

  • Intelligent sentence-boundary-aware chunking
  • Configurable size and overlap
  • Preserves context

4. Embedding

  • HuggingFace sentence transformers
  • MPS-accelerated on Apple Silicon
  • Normalized embeddings for semantic search

5. Vector Store

  • FAISS for fast similarity search
  • Persistent storage
  • Metadata filtering support

6. RAG Retrieval

  • Query β†’ Embed β†’ Retrieve top-k chunks
  • Re-ranking (optional)
  • Context assembly for LLM

7. Generation

  • Mistral 7B GGUF (quantized)
  • Context-aware prompts
  • Vivekananda-style responses

Dataset

Your JSON dataset structure:

{
  "instruction": "What is Karma Yoga?",
  "response": "Karma Yoga is the path of selfless action...",
  "source": "Karma-Yoga.pdf",
  "work_type": "Karma-Yoga",
  "topic": "philosophy"
}
  • 52 Q&A pairs (current)
  • Expandable to 500-2000 for fine-tuning
  • Rich metadata for filtering

πŸ”§ Advanced Usage

Modify Embedding Model

Edit config.yaml:

embeddings:
  model_name: "BAAI/bge-small-en-v1.5"  # Faster alternative

Change Chunk Size

embeddings:
  chunk:
    size: 300  # Smaller chunks
    overlap: 30

Adjust NLP Processing

nlp:
  preprocessing:
    lowercase: true
    remove_stopwords: true
    lemmatize: true

Tune RAG Retrieval

rag:
  retrieval:
    top_k: 10  # More context
    similarity_threshold: 0.3  # Lower threshold

Testing

Test Embeddings

python scripts/01_embed_data.py

Test RAG Retrieval

python scripts/02_query_rag.py "What is Karma Yoga?"

Test Mistral Model

python scripts/03_test_mistral.py

Interactive Testing

streamlit run app.py

Performance

On Mac Mini M4:

Task Time Memory
Embedding 9 volumes 10-15 min 4-6 GB
Query retrieval 50-200 ms 2-3 GB
Model inference 1-3 sec 6-8 GB

Troubleshooting

spaCy Model Not Found

python -m spacy download en_core_web_sm

NLTK Data Missing

import nltk
nltk.download('punkt')
nltk.download('stopwords')

MPS Not Available

Check config.yaml:

hardware:
  device: "cpu"  # Fallback to CPU

Out of Memory

Reduce batch size in config.yaml:

embeddings:
  batch_size: 16  # Smaller batches

Next Steps

  1. Fine-Tuning: Train on your 52 Q&A pairs
  2. Expand Dataset: Create 500+ more examples
  3. Evaluation: Test response quality
  4. Deployment: FastAPI β†’ AWS
  5. Frontend: Build user interface

Resources

Contributing

  1. Fork the repository
  2. Create feature branch
  3. Make changes
  4. Test thoroughly
  5. Submit pull request

License

MIT License - See LICENSE file

Acknowledgments

  • Swami Vivekananda's teachings
  • Ramakrishna Mission for digitized texts
  • LangChain community
  • HuggingFace team

Built with for spreading Vivekananda's wisdom

For questions: Create an issue

Vivekananda AI – High-Fidelity Persona Fine-Tuning Pipeline

This repository includes a production-ready pipeline to build a high-fidelity, persona-aligned AI system in the voice of Swami Vivekananda using a Mistral architecture.

Features

  • Supervised fine-tuning (SFT) with LoRA/QLoRA (Transformers + Accelerate + PEFT)
  • Preference optimization via DPO (TRL)
  • Optional RAG interface (FAISS + sentence-transformers)
  • Strict persona system prompt enforcing tone and guardrails
  • Inference with probabilistic decoding and token probability tracing
  • Dockerfile for GPU training environment

Repository Additions

  • training/data_preprocess.py: Convert processed dataset to JSONL instruction format
  • training/sft_train.py: SFT training with LoRA/QLoRA
  • training/dpo_train.py: DPO preference optimization
  • training/export_adapter.py: (optional) merge/unload adapters – to be added if needed
  • inference/generate.py: Persona inference with sampling controls and token probabilities
  • rag/build_index.py, rag/retrieve.py: Optional retrieval augmentation
  • configs/train_sft.yaml, configs/train_dpo.yaml: Training configs
  • prompts/system_vivekananda.txt: Persona system prompt
  • docker/Dockerfile: GPU training environment

Quick Start

  1. Preprocess dataset to JSONL

    • python training/data_preprocess.py --source data/processed/vivekananda_dataset_1.json --out data/datasets/sft_train.jsonl --val data/datasets/sft_val.jsonl --val-ratio 0.05
  2. SFT training (LoRA/QLoRA)

    • Ensure GPU environment (Docker provided below)
    • accelerate launch training/sft_train.py --config configs/train_sft.yaml --train data/datasets/sft_train.jsonl --val data/datasets/sft_val.jsonl --output runs/sft_vivekananda
  3. DPO preference optimization

    • Prepare preference dataset JSONL with {prompt, chosen, rejected}
    • accelerate launch training/dpo_train.py --config configs/train_dpo.yaml --prefs data/datasets/dpo_prefs.jsonl --sft-checkpoint runs/sft_vivekananda/step_XXXX --output runs/dpo_vivekananda
  4. Optional RAG

    • Build index: python rag/build_index.py --source data/processed/vivekananda_dataset_1.json --out vectorstore/faiss_index
    • Retrieve for inference: python rag/retrieve.py --index vectorstore/faiss_index --query "Your question"
    • Pass concatenated context into inference via --rag-context
  5. Inference with persona and probabilistic decoding

    • python inference/generate.py --model mistralai/Mistral-7B-Instruct-v0.2 --adapter runs/sft_vivekananda/step_XXXX --prompt "Question" --temperature 0.3 --top_p 0.85 --top_k 40 --max_tokens 512 --rag-context "<CONTEXT>..."

Docker (GPU Training)

  • Build: docker build -t vivekananda-ai:train -f docker/Dockerfile .
  • Run: docker run --gpus all -it -v $(pwd):/workspace vivekananda-ai:train bash

Notes

  • Ensure bitsandbytes detects GPU; use CUDA-compatible base.
  • For strict persona enforcement, customize prompts/system_vivekananda.txt.
  • Avoid fabrication; when facts are missing, the prompt instructs the model to say so plainly.
Downloads last month
3
GGUF
Model size
7B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using jyotirmoy05/VivekanandaAI 1