--- title: Fine-Tuned RAG Framework For Python Documentation QA colorFrom: gray colorTo: gray sdk: gradio sdk_version: 6.6.0 app_file: app.py pinned: false license: mit short_description: Fine-tuned RAG system for Python documentation queries --- # Fine-Tuned RAG Framework for Python Documentation Q&A A Retrieval-Augmented Generation (RAG) system that answers questions about Python's standard library using a fine-tuned GPT-2 model with LoRA and vector search. This project demonstrates the complete pipeline of building a RAG system including data collection, model fine-tuning, vector database implementation, and evaluation. ## About This portfolio project showcases practical skills in building RAG systems by implementing a question-answering system for Python documentation. The system combines semantic search over Python documentation with a fine-tuned language model to generate contextually relevant answers. **Author:** Spencer Purdy **Development Environment:** Google Colab Pro (A100 GPU, High RAM) ## Features - **Data Collection**: Automated scraping of Python 3 official documentation - **Document Processing**: Chunking with overlap for optimal retrieval - **Vector Database**: ChromaDB with sentence-transformers embeddings - **Model Fine-Tuning**: GPT-2 fine-tuned using LoRA/PEFT for parameter efficiency - **RAG Pipeline**: Combines retrieval and generation for grounded responses - **Interactive Interface**: Gradio web application for querying the system - **Comprehensive Evaluation**: ROUGE, BERTScore, and retrieval accuracy metrics - **Performance Monitoring**: Tracks latency and sources retrieved per query ## Dataset - **Source:** Python 3 Official Documentation (docs.python.org) - **License:** Python Software Foundation License (PSF License, GPL-compatible) - **Documents Collected:** 67 - **Total Chunks:** 5,257 - **Training Samples Generated:** 734 ## System Performance Performance metrics evaluated on test set: | Metric | Score | |--------|-------| | Retrieval Accuracy | 94.0% | | ROUGE-L F1 | 0.063 | | BERTScore F1 | 0.794 | | Average Latency | 2,084ms (~2 seconds) | | Average Sources Retrieved | 1.2 per query | **Model:** GPT-2 (124M parameters) with LoRA fine-tuning **Training Steps:** 500 **Embedding Model:** sentence-transformers/all-MiniLM-L6-v2 ## Technical Stack - **Model Fine-Tuning:** transformers, peft (LoRA) - **Vector Database:** ChromaDB - **Embeddings:** sentence-transformers - **Evaluation Metrics:** rouge-score, bert-score - **UI Framework:** Gradio - **Data Processing:** beautifulsoup4, requests, pandas, numpy - **Development:** Google Colab Pro with A100 GPU ## Setup and Usage ### Running in Google Colab 1. Clone this repository or download the notebook file 2. Upload `Fine-Tuned RAG Framework for Python Documentation Q&A.ipynb` to Google Colab 3. Select Runtime > Change runtime type > A100 GPU (or T4 GPU for free tier) 4. Run all cells sequentially The notebook will automatically: - Install required dependencies - Collect Python documentation - Process and chunk documents - Fine-tune the GPT-2 model with LoRA - Build the vector database - Evaluate the system - Launch a Gradio interface with a shareable link ### Running Locally ```bash # Clone the repository git clone https://github.com/SpencerCPurdy/Fine-Tuned_RAG_Framework_for_Python_Documentation_QA.git cd Fine-Tuned_RAG_Framework_for_Python_Documentation_QA # Install dependencies pip install torch transformers datasets peft gradio pandas numpy scikit-learn tqdm requests beautifulsoup4 rouge-score bert-score accelerate sentence-transformers chromadb # Run the notebook jupyter notebook "Fine-Tuned RAG Framework for Python Documentation Q&A.ipynb" ``` **Note:** First run will take approximately 10-15 minutes for data collection, training, and setup. ## Project Structure ``` ├── Fine-Tuned RAG Framework for Python Documentation Q&A.ipynb ├── README.md ├── LICENSE └── .gitignore ``` The notebook contains the following components: 1. **Configuration & Setup**: System parameters, random seed initialization 2. **Data Collection**: Web scraping of Python documentation 3. **Document Processing**: Chunking and preprocessing 4. **Vector Database**: ChromaDB initialization and document embedding 5. **Model Fine-Tuning**: GPT-2 fine-tuning with LoRA 6. **RAG Pipeline**: Retrieval and generation integration 7. **Evaluation**: Comprehensive metrics computation 8. **Gradio Interface**: Interactive web application ## Key Implementation Details - **Reproducibility:** All random seeds set to 42 for deterministic results - **LoRA Configuration:** Rank 16, alpha 32, dropout 0.05 - **Chunk Size:** 400 characters with 50 character overlap - **Retrieval:** Top-3 documents with minimum relevance score of 0.15 - **Generation Parameters:** Temperature 0.7, top-p 0.9, max 150 new tokens ## Limitations and Known Issues ### Data Limitations - Limited to Python standard library documentation only (no third-party packages) - Documentation snapshot may be outdated for latest Python versions - Coverage of some modules may be incomplete ### Performance Limitations - ROUGE-L F1 score of 0.063 indicates generated answers differ significantly from reference formats - Answers can be verbose or include unnecessary details - May generate plausible-sounding but incorrect information (hallucination) - Sometimes fails to retrieve relevant sources for niche topics ### Input Limitations - Maximum query length: 500 characters - Best performance on clear, focused questions - Ambiguous questions may produce generic answers ### General Limitations - Not suitable for production use without further validation and testing - Best for conceptual questions rather than version-specific details - Always verify critical information with official Python documentation ## Evaluation Results The system was evaluated on 50 test queries: - **Retrieval Performance:** Successfully retrieved relevant documents for 94% of queries - **Generation Quality:** BERTScore F1 of 0.794 indicates reasonable semantic similarity - **Answer Format:** Low ROUGE scores suggest answers need better formatting and conciseness ## Example Queries The system handles questions such as: - "What is the datetime module used for?" - "How do I read and write JSON files in Python?" - "Explain list comprehensions in Python" - "What are the main features of the collections module?" - "How do I use regular expressions in Python?" - "What is the difference between os and pathlib?" ## RAG Configuration - **Chunk Size:** 400 characters - **Chunk Overlap:** 50 characters - **Retrieval Top-K:** 3 documents - **Minimum Relevance Score:** 0.15 - **Vector Database:** ChromaDB with persistent storage ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Data Attribution - **Python Documentation:** Python Software Foundation (PSF License) - **GPT-2 Model:** OpenAI (MIT License) - **Sentence-Transformers:** Apache 2.0 License ## Acknowledgments - Python Software Foundation for excellent documentation - Hugging Face for transformers and PEFT libraries - Open-source community for the tools and frameworks used ## Contact **Spencer Purdy** GitHub: [@SpencerCPurdy](https://github.com/SpencerCPurdy) --- *This is a portfolio project developed to demonstrate RAG system implementation, model fine-tuning, and NLP engineering capabilities. The system is intended for educational and demonstrational purposes. Always verify important information with official Python documentation at https://docs.python.org/3/*