Spaces:

ciorant
/

bioethics-rag

Sleeping

App Files Files Community

bioethics-rag / README.md

ciorant

Update README.md

cfe0034 verified 6 months ago

preview code

raw

history blame contribute delete

2.42 kB

	---
	title: Bioethics RAG
	emoji: 🧬
	colorFrom: pink
	colorTo: purple
	sdk: streamlit
	sdk_version: 1.49.1
	app_file: streamlit_app.py
	pinned: false
	---

	# Bioethics AI Assistant

	A retrieval-augmented generation (RAG) system that provides intelligent answers to bioethics questions by searching through academic papers and generating contextually-aware responses.

	## Features

	- Semantic Search: FAISS vector store with OpenAI embeddings for accurate document retrieval
	- Confidence-based Citations: Automatic citation generation with confidence levels based on similarity scores
	- Streaming Responses: Real-time answer generation with interactive UI
	- Document Processing: Automated PDF text extraction, cleaning, and chunking
	- Conversation History: Context-aware responses that consider previous exchanges

	## Architecture
	- Workflow: User Query → Embedding → Vector Search → Context Assembly → LLM → Cited Response
	- Document Processing: PyMuPDF and PyPDF2 for text extraction and metadata parsing
	- Vector Store: FAISS with cosine similarity search on normalized embeddings
	- Language Model: OpenAI GPT-4o-mini with streaming support
	- Frontend: Streamlit with custom CSS for chat interface

	## Technical Implementation

	### Document Processing Pipeline
	- PDF text extraction with page markers and metadata inference
	- Text cleaning (whitespace normalization, header/footer removal)
	- Semantic chunking with configurable overlap for context preservation
	- Automated metadata extraction (title, authors, publication year)

	### Retrieval System
	- 3072-dimensional embeddings using OpenAI's `text-embedding-3-large`
	- L2-normalized vectors for cosine similarity computation
	- Confidence thresholds for citation reliability (high: 0.8+, medium: 0.65+, low: 0.5+)
	- Thread-safe operations with persistent storage

	### Response Generation
	- Context assembly from retrieved chunks with confidence-based grouping
	- Conversation history integration (last 4 exchanges)
	- Citation formatting based on similarity confidence levels
	- Streaming response delivery with real-time UI updates

	## Installation

	```bash
	git clone <repository-url>
	cd bioethics-rag
	pip install -r requirements.txt
	echo "OPENAI_API_KEY=your_key_here" > .env
	streamlit run streamlit_app.py
	```

	Built for demonstration of RAG system design, vector search implementation, and conversational AI interfaces.