Spaces:
Sleeping
Sleeping
| title: Bioethics RAG | |
| emoji: 𧬠| |
| colorFrom: pink | |
| colorTo: purple | |
| sdk: streamlit | |
| sdk_version: 1.49.1 | |
| app_file: streamlit_app.py | |
| pinned: false | |
| # Bioethics AI Assistant | |
| A retrieval-augmented generation (RAG) system that provides intelligent answers to bioethics questions by searching through academic papers and generating contextually-aware responses. | |
| ## Features | |
| - **Semantic Search**: FAISS vector store with OpenAI embeddings for accurate document retrieval | |
| - **Confidence-based Citations**: Automatic citation generation with confidence levels based on similarity scores | |
| - **Streaming Responses**: Real-time answer generation with interactive UI | |
| - **Document Processing**: Automated PDF text extraction, cleaning, and chunking | |
| - **Conversation History**: Context-aware responses that consider previous exchanges | |
| ## Architecture | |
| - **Workflow**: User Query β Embedding β Vector Search β Context Assembly β LLM β Cited Response | |
| - **Document Processing**: PyMuPDF and PyPDF2 for text extraction and metadata parsing | |
| - **Vector Store**: FAISS with cosine similarity search on normalized embeddings | |
| - **Language Model**: OpenAI GPT-4o-mini with streaming support | |
| - **Frontend**: Streamlit with custom CSS for chat interface | |
| ## Technical Implementation | |
| ### Document Processing Pipeline | |
| - PDF text extraction with page markers and metadata inference | |
| - Text cleaning (whitespace normalization, header/footer removal) | |
| - Semantic chunking with configurable overlap for context preservation | |
| - Automated metadata extraction (title, authors, publication year) | |
| ### Retrieval System | |
| - 3072-dimensional embeddings using OpenAI's `text-embedding-3-large` | |
| - L2-normalized vectors for cosine similarity computation | |
| - Confidence thresholds for citation reliability (high: 0.8+, medium: 0.65+, low: 0.5+) | |
| - Thread-safe operations with persistent storage | |
| ### Response Generation | |
| - Context assembly from retrieved chunks with confidence-based grouping | |
| - Conversation history integration (last 4 exchanges) | |
| - Citation formatting based on similarity confidence levels | |
| - Streaming response delivery with real-time UI updates | |
| ## Installation | |
| ```bash | |
| git clone <repository-url> | |
| cd bioethics-rag | |
| pip install -r requirements.txt | |
| echo "OPENAI_API_KEY=your_key_here" > .env | |
| streamlit run streamlit_app.py | |
| ``` | |
| Built for demonstration of RAG system design, vector search implementation, and conversational AI interfaces. | |