bioethics-rag / README.md
ciorant's picture
Update README.md
cfe0034 verified

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade
metadata
title: Bioethics RAG
emoji: 🧬
colorFrom: pink
colorTo: purple
sdk: streamlit
sdk_version: 1.49.1
app_file: streamlit_app.py
pinned: false

Bioethics AI Assistant

A retrieval-augmented generation (RAG) system that provides intelligent answers to bioethics questions by searching through academic papers and generating contextually-aware responses.

Features

  • Semantic Search: FAISS vector store with OpenAI embeddings for accurate document retrieval
  • Confidence-based Citations: Automatic citation generation with confidence levels based on similarity scores
  • Streaming Responses: Real-time answer generation with interactive UI
  • Document Processing: Automated PDF text extraction, cleaning, and chunking
  • Conversation History: Context-aware responses that consider previous exchanges

Architecture

  • Workflow: User Query → Embedding → Vector Search → Context Assembly → LLM → Cited Response
  • Document Processing: PyMuPDF and PyPDF2 for text extraction and metadata parsing
  • Vector Store: FAISS with cosine similarity search on normalized embeddings
  • Language Model: OpenAI GPT-4o-mini with streaming support
  • Frontend: Streamlit with custom CSS for chat interface

Technical Implementation

Document Processing Pipeline

  • PDF text extraction with page markers and metadata inference
  • Text cleaning (whitespace normalization, header/footer removal)
  • Semantic chunking with configurable overlap for context preservation
  • Automated metadata extraction (title, authors, publication year)

Retrieval System

  • 3072-dimensional embeddings using OpenAI's text-embedding-3-large
  • L2-normalized vectors for cosine similarity computation
  • Confidence thresholds for citation reliability (high: 0.8+, medium: 0.65+, low: 0.5+)
  • Thread-safe operations with persistent storage

Response Generation

  • Context assembly from retrieved chunks with confidence-based grouping
  • Conversation history integration (last 4 exchanges)
  • Citation formatting based on similarity confidence levels
  • Streaming response delivery with real-time UI updates

Installation

git clone <repository-url>
cd bioethics-rag
pip install -r requirements.txt
echo "OPENAI_API_KEY=your_key_here" > .env
streamlit run streamlit_app.py

Built for demonstration of RAG system design, vector search implementation, and conversational AI interfaces.