Spaces:
Sleeping
Sleeping
A newer version of the Streamlit SDK is available: 1.58.0
metadata
title: BPL RAG Spring 2026
emoji: π
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
license: mit
Boston Public Library - RAG Search System (Spring 2026)
Natural language search for the BPL Digital Commonwealth collection using Retrieval-Augmented Generation.
π― Project Overview
This system allows users to search through the Boston Public Library's digital collections using natural language queries. Built as part of the BU Spark! DS549 course in Spring 2026.
Key Features
- Semantic Search: Natural language queries across 445K+ BPL items
- Full-Text Search: Search within document content, not just metadata
- Metadata Filtering: Time-based, location-based, and type-based filtering
- Evaluation Framework: Built-in testing with gold-standard datasets
- PostgreSQL + pgvector: Scalable vector database backend
π Quick Start
- Enter your query in natural language (e.g., "What were important events in Boston in 1919?")
- View retrieved documents with relevance scores
- Read AI-generated contextual explanations
π Example Queries
- "Find pictures of JFK's house on Cape Cod"
- "Are there any maps of Worcester, MA from the 18th century?"
- "Show me depictions of indigenous Americans"
- "What were important historical events in Boston in 1919?"
π οΈ Technical Stack
- Frontend: Streamlit
- Database: PostgreSQL with pgvector extension
- Embeddings: Sentence Transformers
- LLM: OpenAI GPT-4 / Anthropic Claude
- Retrieval: BM25 + Vector Search with metadata filtering
- Evaluation: DeepEval framework
π Project Structure
current_spring2026/
βββ app.py # Main Streamlit application
βββ pipeline.py # RAG pipeline orchestration
βββ config.py # Configuration management
βββ database/ # Database connection & queries
βββ embedding/ # Vector embeddings
βββ retrieval/ # Document retrieval logic
βββ generation/ # LLM response generation
βββ evaluation/ # Testing & metrics
βββ ingestion/ # Data processing pipeline
βββ scripts/ # Utility scripts
π Environment Variables
Required secrets (set in Space Settings β Repository secrets):
OPENAI_API_KEY=your_openai_key
DATABASE_URL=postgresql://user:pass@host:port/dbname
π Spring 2026 Improvements
Building on Fall 2025 work, this version adds:
- β Full-text document search (not just metadata)
- β Gold-standard evaluation dataset
- β Structured logging of queries and metrics
- β Improved retrieval metrics (Precision, Recall, MRR)
- β Enhanced UI with developer debug view
π Data
- Source: Digital Commonwealth (BPL subset)
- Records: ~445,000 items
- Types: Photographs, maps, newspapers, manuscripts, books
- Full-text: 210K items with OCR (~1.5M pages)
π₯ Team
Spring 2026 Spark! Team
- Boston University Data Science (DS549)
- Client: Eben English, Boston Public Library
Previous Semesters
- Fall 2025: Infrastructure & pgvector migration
- Fall 2024: Initial RAG prototype
π License
MIT License
π Links
Built with β€οΈ by BU Spark! for Boston Public Library