YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Swami Vivekananda AI
An AI system embodying the wisdom and teachings of Swami Vivekananda, built with pure logic, no hardcoding, and fully config-driven architecture.
Features
- Config-Driven: Everything configurable via
config.yaml - NLP Processing: spaCy + NLTK for intelligent text processing
- RAG System: LangChain-based Retrieval-Augmented Generation
- MPS Support: Optimized for Apple Silicon (M4 Mac Mini)
- GGUF Models: Support for Mistral 7B quantized models
- Modular Design: Clean, maintainable, production-ready code
Requirements
- Python 3.9+
- Mac Mini M4 with MPS (or CUDA/CPU)
- 16GB+ RAM recommended
- 10GB+ free disk space
Quick Start
1. Clone & Setup
# Clone repository
git clone <your-repo-url>
cd vivekananda-ai
# Run automated setup
chmod +x setup.sh
./setup.sh
2. Add Your Data
# Add PDF files
cp /path/to/pdfs/* data/raw/complete_works/
# Add your JSON dataset
cp vivekananda_dataset_1.json data/processed/
# Add Mistral model
cp mistral-7b-instruct-v0.1.Q4_K_M.gguf models/base/
3. Configure
Edit config.yaml to customize:
- Model settings
- Embedding parameters
- RAG configuration
- NLP preprocessing options
4. Create Vector Database
source venv/bin/activate
python scripts/01_embed_data.py
Expected output: ```
VIVEKANANDA AI - VECTOR DATABASE CREATION
STEP 1: LOADING DOCUMENTS Found 9 PDF files Loading: SWAMI-VIVEKANANDA-VOL-1.pdf Loaded 234 pages ...
STEP 2: PROCESSING WITH NLP Processing documents with NLP... ...
SUCCESS! VECTOR DATABASE READY Total chunks embedded: 8,469
### 5. Test RAG
```bash
python scripts/02_query_rag.py
6. Run Streamlit App
streamlit run app.py
π Project Structure
vivekananda-ai/
β
βββ config.yaml # Central configuration
βββ requirements.txt # Python dependencies
βββ setup.sh # Automated setup script
β
βββ utils.py # Core utilities
βββ nlp_processor.py # NLP processing (spaCy + NLTK)
β
βββ scripts/
β βββ 01_embed_data.py # Create vector database
β βββ 02_query_rag.py # Test RAG retrieval
β βββ 03_test_mistral.py # Test Mistral model
β βββ 04_prepare_finetune.py # Prepare fine-tuning data
β
βββ data/
β βββ raw/ # Original PDFs
β βββ processed/ # JSON datasets
β βββ extracted_text/ # Extracted text files
β
βββ vectorstore/
β βββ vivekananda_db/ # FAISS vector database
β
βββ models/
β βββ base/ # Base models (GGUF)
β βββ fine_tuned/ # Fine-tuned adapters
β
βββ outputs/
β βββ logs/ # Application logs
β βββ results/ # Results & analysis
β
βββ app.py # Streamlit app
Configuration
All settings in config.yaml:
Model Settings
model:
generation:
max_tokens: 512
temperature: 0.7
top_p: 0.9
Embeddings
embeddings:
model_name: "sentence-transformers/all-mpnet-base-v2"
chunk:
size: 500
overlap: 50
NLP Processing
nlp:
spacy:
model: "en_core_web_sm"
nltk:
tokenizer: "punkt"
preprocessing:
lemmatize: true
remove_stopwords: false
RAG
rag:
retrieval:
top_k: 5
similarity_threshold: 0.5
How It Works
1. Document Loading
- Loads PDFs, text files, and JSON Q&A pairs
- Uses LangChain loaders for consistency
- Extracts Vivekananda-specific metadata
2. NLP Processing
- spaCy: Sentence segmentation, lemmatization, NER
- NLTK: Tokenization, stopword removal, stemming
- Config-driven preprocessing pipeline
3. Chunking
- Intelligent sentence-boundary-aware chunking
- Configurable size and overlap
- Preserves context
4. Embedding
- HuggingFace sentence transformers
- MPS-accelerated on Apple Silicon
- Normalized embeddings for semantic search
5. Vector Store
- FAISS for fast similarity search
- Persistent storage
- Metadata filtering support
6. RAG Retrieval
- Query β Embed β Retrieve top-k chunks
- Re-ranking (optional)
- Context assembly for LLM
7. Generation
- Mistral 7B GGUF (quantized)
- Context-aware prompts
- Vivekananda-style responses
Dataset
Your JSON dataset structure:
{
"instruction": "What is Karma Yoga?",
"response": "Karma Yoga is the path of selfless action...",
"source": "Karma-Yoga.pdf",
"work_type": "Karma-Yoga",
"topic": "philosophy"
}
- 52 Q&A pairs (current)
- Expandable to 500-2000 for fine-tuning
- Rich metadata for filtering
π§ Advanced Usage
Modify Embedding Model
Edit config.yaml:
embeddings:
model_name: "BAAI/bge-small-en-v1.5" # Faster alternative
Change Chunk Size
embeddings:
chunk:
size: 300 # Smaller chunks
overlap: 30
Adjust NLP Processing
nlp:
preprocessing:
lowercase: true
remove_stopwords: true
lemmatize: true
Tune RAG Retrieval
rag:
retrieval:
top_k: 10 # More context
similarity_threshold: 0.3 # Lower threshold
Testing
Test Embeddings
python scripts/01_embed_data.py
Test RAG Retrieval
python scripts/02_query_rag.py "What is Karma Yoga?"
Test Mistral Model
python scripts/03_test_mistral.py
Interactive Testing
streamlit run app.py
Performance
On Mac Mini M4:
| Task | Time | Memory |
|---|---|---|
| Embedding 9 volumes | 10-15 min | 4-6 GB |
| Query retrieval | 50-200 ms | 2-3 GB |
| Model inference | 1-3 sec | 6-8 GB |
Troubleshooting
spaCy Model Not Found
python -m spacy download en_core_web_sm
NLTK Data Missing
import nltk
nltk.download('punkt')
nltk.download('stopwords')
MPS Not Available
Check config.yaml:
hardware:
device: "cpu" # Fallback to CPU
Out of Memory
Reduce batch size in config.yaml:
embeddings:
batch_size: 16 # Smaller batches
Next Steps
- Fine-Tuning: Train on your 52 Q&A pairs
- Expand Dataset: Create 500+ more examples
- Evaluation: Test response quality
- Deployment: FastAPI β AWS
- Frontend: Build user interface
Resources
- LangChain Documentation
- spaCy Documentation
- NLTK Documentation
- FAISS Documentation
- Mistral Documentation
Contributing
- Fork the repository
- Create feature branch
- Make changes
- Test thoroughly
- Submit pull request
License
MIT License - See LICENSE file
Acknowledgments
- Swami Vivekananda's teachings
- Ramakrishna Mission for digitized texts
- LangChain community
- HuggingFace team
Built with for spreading Vivekananda's wisdom
For questions: Create an issue
Vivekananda AI β High-Fidelity Persona Fine-Tuning Pipeline
This repository includes a production-ready pipeline to build a high-fidelity, persona-aligned AI system in the voice of Swami Vivekananda using a Mistral architecture.
Features
- Supervised fine-tuning (SFT) with LoRA/QLoRA (Transformers + Accelerate + PEFT)
- Preference optimization via DPO (TRL)
- Optional RAG interface (FAISS + sentence-transformers)
- Strict persona system prompt enforcing tone and guardrails
- Inference with probabilistic decoding and token probability tracing
- Dockerfile for GPU training environment
Repository Additions
training/data_preprocess.py: Convert processed dataset to JSONL instruction formattraining/sft_train.py: SFT training with LoRA/QLoRAtraining/dpo_train.py: DPO preference optimizationtraining/export_adapter.py: (optional) merge/unload adapters β to be added if neededinference/generate.py: Persona inference with sampling controls and token probabilitiesrag/build_index.py,rag/retrieve.py: Optional retrieval augmentationconfigs/train_sft.yaml,configs/train_dpo.yaml: Training configsprompts/system_vivekananda.txt: Persona system promptdocker/Dockerfile: GPU training environment
Quick Start
Preprocess dataset to JSONL
python training/data_preprocess.py --source data/processed/vivekananda_dataset_1.json --out data/datasets/sft_train.jsonl --val data/datasets/sft_val.jsonl --val-ratio 0.05
SFT training (LoRA/QLoRA)
- Ensure GPU environment (Docker provided below)
accelerate launch training/sft_train.py --config configs/train_sft.yaml --train data/datasets/sft_train.jsonl --val data/datasets/sft_val.jsonl --output runs/sft_vivekananda
DPO preference optimization
- Prepare preference dataset JSONL with
{prompt, chosen, rejected} accelerate launch training/dpo_train.py --config configs/train_dpo.yaml --prefs data/datasets/dpo_prefs.jsonl --sft-checkpoint runs/sft_vivekananda/step_XXXX --output runs/dpo_vivekananda
- Prepare preference dataset JSONL with
Optional RAG
- Build index:
python rag/build_index.py --source data/processed/vivekananda_dataset_1.json --out vectorstore/faiss_index - Retrieve for inference:
python rag/retrieve.py --index vectorstore/faiss_index --query "Your question" - Pass concatenated context into inference via
--rag-context
- Build index:
Inference with persona and probabilistic decoding
python inference/generate.py --model mistralai/Mistral-7B-Instruct-v0.2 --adapter runs/sft_vivekananda/step_XXXX --prompt "Question" --temperature 0.3 --top_p 0.85 --top_k 40 --max_tokens 512 --rag-context "<CONTEXT>..."
Docker (GPU Training)
- Build:
docker build -t vivekananda-ai:train -f docker/Dockerfile . - Run:
docker run --gpus all -it -v $(pwd):/workspace vivekananda-ai:train bash
Notes
- Ensure
bitsandbytesdetects GPU; use CUDA-compatible base. - For strict persona enforcement, customize
prompts/system_vivekananda.txt. - Avoid fabrication; when facts are missing, the prompt instructs the model to say so plainly.
- Downloads last month
- 3
4-bit