title: Fine-Tuned RAG Framework For Python Documentation QA
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 6.6.0
app_file: app.py
pinned: false
license: mit
short_description: Fine-tuned RAG system for Python documentation queries
Fine-Tuned RAG Framework for Python Documentation Q&A
A Retrieval-Augmented Generation (RAG) system that answers questions about Python's standard library using a fine-tuned GPT-2 model with LoRA and vector search. This project demonstrates the complete pipeline of building a RAG system including data collection, model fine-tuning, vector database implementation, and evaluation.
About
This portfolio project showcases practical skills in building RAG systems by implementing a question-answering system for Python documentation. The system combines semantic search over Python documentation with a fine-tuned language model to generate contextually relevant answers.
Author: Spencer Purdy
Development Environment: Google Colab Pro (A100 GPU, High RAM)
Features
- Data Collection: Automated scraping of Python 3 official documentation
- Document Processing: Chunking with overlap for optimal retrieval
- Vector Database: ChromaDB with sentence-transformers embeddings
- Model Fine-Tuning: GPT-2 fine-tuned using LoRA/PEFT for parameter efficiency
- RAG Pipeline: Combines retrieval and generation for grounded responses
- Interactive Interface: Gradio web application for querying the system
- Comprehensive Evaluation: ROUGE, BERTScore, and retrieval accuracy metrics
- Performance Monitoring: Tracks latency and sources retrieved per query
Dataset
- Source: Python 3 Official Documentation (docs.python.org)
- License: Python Software Foundation License (PSF License, GPL-compatible)
- Documents Collected: 67
- Total Chunks: 5,257
- Training Samples Generated: 734
System Performance
Performance metrics evaluated on test set:
| Metric | Score |
|---|---|
| Retrieval Accuracy | 94.0% |
| ROUGE-L F1 | 0.063 |
| BERTScore F1 | 0.794 |
| Average Latency | 2,084ms (~2 seconds) |
| Average Sources Retrieved | 1.2 per query |
Model: GPT-2 (124M parameters) with LoRA fine-tuning
Training Steps: 500
Embedding Model: sentence-transformers/all-MiniLM-L6-v2
Technical Stack
- Model Fine-Tuning: transformers, peft (LoRA)
- Vector Database: ChromaDB
- Embeddings: sentence-transformers
- Evaluation Metrics: rouge-score, bert-score
- UI Framework: Gradio
- Data Processing: beautifulsoup4, requests, pandas, numpy
- Development: Google Colab Pro with A100 GPU
Setup and Usage
Running in Google Colab
- Clone this repository or download the notebook file
- Upload
Fine-Tuned RAG Framework for Python Documentation Q&A.ipynbto Google Colab - Select Runtime > Change runtime type > A100 GPU (or T4 GPU for free tier)
- Run all cells sequentially
The notebook will automatically:
- Install required dependencies
- Collect Python documentation
- Process and chunk documents
- Fine-tune the GPT-2 model with LoRA
- Build the vector database
- Evaluate the system
- Launch a Gradio interface with a shareable link
Running Locally
# Clone the repository
git clone https://github.com/SpencerCPurdy/Fine-Tuned_RAG_Framework_for_Python_Documentation_QA.git
cd Fine-Tuned_RAG_Framework_for_Python_Documentation_QA
# Install dependencies
pip install torch transformers datasets peft gradio pandas numpy scikit-learn tqdm requests beautifulsoup4 rouge-score bert-score accelerate sentence-transformers chromadb
# Run the notebook
jupyter notebook "Fine-Tuned RAG Framework for Python Documentation Q&A.ipynb"
Note: First run will take approximately 10-15 minutes for data collection, training, and setup.
Project Structure
βββ Fine-Tuned RAG Framework for Python Documentation Q&A.ipynb
βββ README.md
βββ LICENSE
βββ .gitignore
The notebook contains the following components:
- Configuration & Setup: System parameters, random seed initialization
- Data Collection: Web scraping of Python documentation
- Document Processing: Chunking and preprocessing
- Vector Database: ChromaDB initialization and document embedding
- Model Fine-Tuning: GPT-2 fine-tuning with LoRA
- RAG Pipeline: Retrieval and generation integration
- Evaluation: Comprehensive metrics computation
- Gradio Interface: Interactive web application
Key Implementation Details
- Reproducibility: All random seeds set to 42 for deterministic results
- LoRA Configuration: Rank 16, alpha 32, dropout 0.05
- Chunk Size: 400 characters with 50 character overlap
- Retrieval: Top-3 documents with minimum relevance score of 0.15
- Generation Parameters: Temperature 0.7, top-p 0.9, max 150 new tokens
Limitations and Known Issues
Data Limitations
- Limited to Python standard library documentation only (no third-party packages)
- Documentation snapshot may be outdated for latest Python versions
- Coverage of some modules may be incomplete
Performance Limitations
- ROUGE-L F1 score of 0.063 indicates generated answers differ significantly from reference formats
- Answers can be verbose or include unnecessary details
- May generate plausible-sounding but incorrect information (hallucination)
- Sometimes fails to retrieve relevant sources for niche topics
Input Limitations
- Maximum query length: 500 characters
- Best performance on clear, focused questions
- Ambiguous questions may produce generic answers
General Limitations
- Not suitable for production use without further validation and testing
- Best for conceptual questions rather than version-specific details
- Always verify critical information with official Python documentation
Evaluation Results
The system was evaluated on 50 test queries:
- Retrieval Performance: Successfully retrieved relevant documents for 94% of queries
- Generation Quality: BERTScore F1 of 0.794 indicates reasonable semantic similarity
- Answer Format: Low ROUGE scores suggest answers need better formatting and conciseness
Example Queries
The system handles questions such as:
- "What is the datetime module used for?"
- "How do I read and write JSON files in Python?"
- "Explain list comprehensions in Python"
- "What are the main features of the collections module?"
- "How do I use regular expressions in Python?"
- "What is the difference between os and pathlib?"
RAG Configuration
- Chunk Size: 400 characters
- Chunk Overlap: 50 characters
- Retrieval Top-K: 3 documents
- Minimum Relevance Score: 0.15
- Vector Database: ChromaDB with persistent storage
License
This project is licensed under the MIT License - see the LICENSE file for details.
Data Attribution
- Python Documentation: Python Software Foundation (PSF License)
- GPT-2 Model: OpenAI (MIT License)
- Sentence-Transformers: Apache 2.0 License
Acknowledgments
- Python Software Foundation for excellent documentation
- Hugging Face for transformers and PEFT libraries
- Open-source community for the tools and frameworks used
Contact
Spencer Purdy
GitHub: @SpencerCPurdy
This is a portfolio project developed to demonstrate RAG system implementation, model fine-tuning, and NLP engineering capabilities. The system is intended for educational and demonstrational purposes. Always verify important information with official Python documentation at https://docs.python.org/3/