Spaces:

SpencerCPurdy
/

Fine-Tuned_RAG_Framework_for_Python_Documentation_QA

Running

App Files Files Community

Fine-Tuned_RAG_Framework_for_Python_Documentation_QA / README.md

SpencerCPurdy

Update README.md

064671a verified 5 days ago

preview code

raw

history blame contribute delete

7.65 kB

metadata

title: Fine-Tuned RAG Framework For Python Documentation QA
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 6.6.0
app_file: app.py
pinned: false
license: mit
short_description: Fine-tuned RAG system for Python documentation queries

Fine-Tuned RAG Framework for Python Documentation Q&A

A Retrieval-Augmented Generation (RAG) system that answers questions about Python's standard library using a fine-tuned GPT-2 model with LoRA and vector search. This project demonstrates the complete pipeline of building a RAG system including data collection, model fine-tuning, vector database implementation, and evaluation.

About

This portfolio project showcases practical skills in building RAG systems by implementing a question-answering system for Python documentation. The system combines semantic search over Python documentation with a fine-tuned language model to generate contextually relevant answers.

Author: Spencer Purdy
Development Environment: Google Colab Pro (A100 GPU, High RAM)

Features

Data Collection: Automated scraping of Python 3 official documentation
Document Processing: Chunking with overlap for optimal retrieval
Vector Database: ChromaDB with sentence-transformers embeddings
Model Fine-Tuning: GPT-2 fine-tuned using LoRA/PEFT for parameter efficiency
RAG Pipeline: Combines retrieval and generation for grounded responses
Interactive Interface: Gradio web application for querying the system
Comprehensive Evaluation: ROUGE, BERTScore, and retrieval accuracy metrics
Performance Monitoring: Tracks latency and sources retrieved per query

Dataset

Source: Python 3 Official Documentation (docs.python.org)
License: Python Software Foundation License (PSF License, GPL-compatible)
Documents Collected: 67
Total Chunks: 5,257
Training Samples Generated: 734

System Performance

Performance metrics evaluated on test set:

Metric	Score
Retrieval Accuracy	94.0%
ROUGE-L F1	0.063
BERTScore F1	0.794
Average Latency	2,084ms (~2 seconds)
Average Sources Retrieved	1.2 per query

Model: GPT-2 (124M parameters) with LoRA fine-tuning
Training Steps: 500
Embedding Model: sentence-transformers/all-MiniLM-L6-v2

Technical Stack

Model Fine-Tuning: transformers, peft (LoRA)
Vector Database: ChromaDB
Embeddings: sentence-transformers
Evaluation Metrics: rouge-score, bert-score
UI Framework: Gradio
Data Processing: beautifulsoup4, requests, pandas, numpy
Development: Google Colab Pro with A100 GPU

Setup and Usage

Running in Google Colab

Clone this repository or download the notebook file
Upload Fine-Tuned RAG Framework for Python Documentation Q&A.ipynb to Google Colab
Select Runtime > Change runtime type > A100 GPU (or T4 GPU for free tier)
Run all cells sequentially

The notebook will automatically:

Install required dependencies
Collect Python documentation
Process and chunk documents
Fine-tune the GPT-2 model with LoRA
Build the vector database
Evaluate the system
Launch a Gradio interface with a shareable link

Running Locally

# Clone the repository
git clone https://github.com/SpencerCPurdy/Fine-Tuned_RAG_Framework_for_Python_Documentation_QA.git
cd Fine-Tuned_RAG_Framework_for_Python_Documentation_QA

# Install dependencies
pip install torch transformers datasets peft gradio pandas numpy scikit-learn tqdm requests beautifulsoup4 rouge-score bert-score accelerate sentence-transformers chromadb

# Run the notebook
jupyter notebook "Fine-Tuned RAG Framework for Python Documentation Q&A.ipynb"

Note: First run will take approximately 10-15 minutes for data collection, training, and setup.

Project Structure

├── Fine-Tuned RAG Framework for Python Documentation Q&A.ipynb
├── README.md
├── LICENSE
└── .gitignore

The notebook contains the following components:

Configuration & Setup: System parameters, random seed initialization
Data Collection: Web scraping of Python documentation
Document Processing: Chunking and preprocessing
Vector Database: ChromaDB initialization and document embedding
Model Fine-Tuning: GPT-2 fine-tuning with LoRA
RAG Pipeline: Retrieval and generation integration
Evaluation: Comprehensive metrics computation
Gradio Interface: Interactive web application

Key Implementation Details

Reproducibility: All random seeds set to 42 for deterministic results
LoRA Configuration: Rank 16, alpha 32, dropout 0.05
Chunk Size: 400 characters with 50 character overlap
Retrieval: Top-3 documents with minimum relevance score of 0.15
Generation Parameters: Temperature 0.7, top-p 0.9, max 150 new tokens

Limitations and Known Issues

Data Limitations

Limited to Python standard library documentation only (no third-party packages)
Documentation snapshot may be outdated for latest Python versions
Coverage of some modules may be incomplete

Performance Limitations

ROUGE-L F1 score of 0.063 indicates generated answers differ significantly from reference formats
Answers can be verbose or include unnecessary details
May generate plausible-sounding but incorrect information (hallucination)
Sometimes fails to retrieve relevant sources for niche topics

Input Limitations

Maximum query length: 500 characters
Best performance on clear, focused questions
Ambiguous questions may produce generic answers

General Limitations

Not suitable for production use without further validation and testing
Best for conceptual questions rather than version-specific details
Always verify critical information with official Python documentation

Evaluation Results

The system was evaluated on 50 test queries:

Retrieval Performance: Successfully retrieved relevant documents for 94% of queries
Generation Quality: BERTScore F1 of 0.794 indicates reasonable semantic similarity
Answer Format: Low ROUGE scores suggest answers need better formatting and conciseness

Example Queries

The system handles questions such as:

"What is the datetime module used for?"
"How do I read and write JSON files in Python?"
"Explain list comprehensions in Python"
"What are the main features of the collections module?"
"How do I use regular expressions in Python?"
"What is the difference between os and pathlib?"

RAG Configuration

Chunk Size: 400 characters
Chunk Overlap: 50 characters
Retrieval Top-K: 3 documents
Minimum Relevance Score: 0.15
Vector Database: ChromaDB with persistent storage

License

This project is licensed under the MIT License - see the LICENSE file for details.

Data Attribution

Python Documentation: Python Software Foundation (PSF License)
GPT-2 Model: OpenAI (MIT License)
Sentence-Transformers: Apache 2.0 License

Acknowledgments

Python Software Foundation for excellent documentation
Hugging Face for transformers and PEFT libraries
Open-source community for the tools and frameworks used

Contact

Spencer Purdy
GitHub: @SpencerCPurdy

This is a portfolio project developed to demonstrate RAG system implementation, model fine-tuning, and NLP engineering capabilities. The system is intended for educational and demonstrational purposes. Always verify important information with official Python documentation at https://docs.python.org/3/