bioethics-rag / README.md
ciorant's picture
Update README.md
cfe0034 verified
---
title: Bioethics RAG
emoji: 🧬
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.49.1
app_file: streamlit_app.py
pinned: false
---
# Bioethics AI Assistant
A retrieval-augmented generation (RAG) system that provides intelligent answers to bioethics questions by searching through academic papers and generating contextually-aware responses.
## Features
- **Semantic Search**: FAISS vector store with OpenAI embeddings for accurate document retrieval
- **Confidence-based Citations**: Automatic citation generation with confidence levels based on similarity scores
- **Streaming Responses**: Real-time answer generation with interactive UI
- **Document Processing**: Automated PDF text extraction, cleaning, and chunking
- **Conversation History**: Context-aware responses that consider previous exchanges
## Architecture
- **Workflow**: User Query β†’ Embedding β†’ Vector Search β†’ Context Assembly β†’ LLM β†’ Cited Response
- **Document Processing**: PyMuPDF and PyPDF2 for text extraction and metadata parsing
- **Vector Store**: FAISS with cosine similarity search on normalized embeddings
- **Language Model**: OpenAI GPT-4o-mini with streaming support
- **Frontend**: Streamlit with custom CSS for chat interface
## Technical Implementation
### Document Processing Pipeline
- PDF text extraction with page markers and metadata inference
- Text cleaning (whitespace normalization, header/footer removal)
- Semantic chunking with configurable overlap for context preservation
- Automated metadata extraction (title, authors, publication year)
### Retrieval System
- 3072-dimensional embeddings using OpenAI's `text-embedding-3-large`
- L2-normalized vectors for cosine similarity computation
- Confidence thresholds for citation reliability (high: 0.8+, medium: 0.65+, low: 0.5+)
- Thread-safe operations with persistent storage
### Response Generation
- Context assembly from retrieved chunks with confidence-based grouping
- Conversation history integration (last 4 exchanges)
- Citation formatting based on similarity confidence levels
- Streaming response delivery with real-time UI updates
## Installation
```bash
git clone <repository-url>
cd bioethics-rag
pip install -r requirements.txt
echo "OPENAI_API_KEY=your_key_here" > .env
streamlit run streamlit_app.py
```
Built for demonstration of RAG system design, vector search implementation, and conversational AI interfaces.