BPL-RAG-Spring-2026 / README.md
han-na's picture
Initial Spring 2026 deployment
a3ae00a
---
title: BPL RAG Spring 2026
emoji: πŸ“š
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
license: mit
---
# Boston Public Library - RAG Search System (Spring 2026)
Natural language search for the BPL Digital Commonwealth collection using Retrieval-Augmented Generation.
## 🎯 Project Overview
This system allows users to search through the Boston Public Library's digital collections using natural language queries. Built as part of the BU Spark! DS549 course in Spring 2026.
### Key Features
- **Semantic Search**: Natural language queries across 445K+ BPL items
- **Full-Text Search**: Search within document content, not just metadata
- **Metadata Filtering**: Time-based, location-based, and type-based filtering
- **Evaluation Framework**: Built-in testing with gold-standard datasets
- **PostgreSQL + pgvector**: Scalable vector database backend
## πŸš€ Quick Start
1. Enter your query in natural language (e.g., "What were important events in Boston in 1919?")
2. View retrieved documents with relevance scores
3. Read AI-generated contextual explanations
## πŸ“Š Example Queries
- "Find pictures of JFK's house on Cape Cod"
- "Are there any maps of Worcester, MA from the 18th century?"
- "Show me depictions of indigenous Americans"
- "What were important historical events in Boston in 1919?"
## πŸ› οΈ Technical Stack
- **Frontend**: Streamlit
- **Database**: PostgreSQL with pgvector extension
- **Embeddings**: Sentence Transformers
- **LLM**: OpenAI GPT-4 / Anthropic Claude
- **Retrieval**: BM25 + Vector Search with metadata filtering
- **Evaluation**: DeepEval framework
## πŸ“ Project Structure
```
current_spring2026/
β”œβ”€β”€ app.py # Main Streamlit application
β”œβ”€β”€ pipeline.py # RAG pipeline orchestration
β”œβ”€β”€ config.py # Configuration management
β”œβ”€β”€ database/ # Database connection & queries
β”œβ”€β”€ embedding/ # Vector embeddings
β”œβ”€β”€ retrieval/ # Document retrieval logic
β”œβ”€β”€ generation/ # LLM response generation
β”œβ”€β”€ evaluation/ # Testing & metrics
β”œβ”€β”€ ingestion/ # Data processing pipeline
└── scripts/ # Utility scripts
```
## πŸ” Environment Variables
Required secrets (set in Space Settings β†’ Repository secrets):
```
OPENAI_API_KEY=your_openai_key
DATABASE_URL=postgresql://user:pass@host:port/dbname
```
## πŸ“ˆ Spring 2026 Improvements
Building on Fall 2025 work, this version adds:
- βœ… Full-text document search (not just metadata)
- βœ… Gold-standard evaluation dataset
- βœ… Structured logging of queries and metrics
- βœ… Improved retrieval metrics (Precision, Recall, MRR)
- βœ… Enhanced UI with developer debug view
## πŸ“š Data
- **Source**: Digital Commonwealth (BPL subset)
- **Records**: ~445,000 items
- **Types**: Photographs, maps, newspapers, manuscripts, books
- **Full-text**: 210K items with OCR (~1.5M pages)
## πŸ‘₯ Team
**Spring 2026 Spark! Team**
- Boston University Data Science (DS549)
- Client: Eben English, Boston Public Library
**Previous Semesters**
- Fall 2025: Infrastructure & pgvector migration
- Fall 2024: Initial RAG prototype
## πŸ“„ License
MIT License
## πŸ”— Links
- [Digital Commonwealth](https://www.digitalcommonwealth.org/)
- [BPL Digital Repository](https://www.digitalcommonwealth.org/institutions/boston-public-library)
- [Project Documentation](https://github.com/your-repo-link)
---
*Built with ❀️ by BU Spark! for Boston Public Library*