BPL-RAG-Spring-2026 / README.md
han-na's picture
Initial Spring 2026 deployment
a3ae00a

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade
metadata
title: BPL RAG Spring 2026
emoji: πŸ“š
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
license: mit

Boston Public Library - RAG Search System (Spring 2026)

Natural language search for the BPL Digital Commonwealth collection using Retrieval-Augmented Generation.

🎯 Project Overview

This system allows users to search through the Boston Public Library's digital collections using natural language queries. Built as part of the BU Spark! DS549 course in Spring 2026.

Key Features

  • Semantic Search: Natural language queries across 445K+ BPL items
  • Full-Text Search: Search within document content, not just metadata
  • Metadata Filtering: Time-based, location-based, and type-based filtering
  • Evaluation Framework: Built-in testing with gold-standard datasets
  • PostgreSQL + pgvector: Scalable vector database backend

πŸš€ Quick Start

  1. Enter your query in natural language (e.g., "What were important events in Boston in 1919?")
  2. View retrieved documents with relevance scores
  3. Read AI-generated contextual explanations

πŸ“Š Example Queries

  • "Find pictures of JFK's house on Cape Cod"
  • "Are there any maps of Worcester, MA from the 18th century?"
  • "Show me depictions of indigenous Americans"
  • "What were important historical events in Boston in 1919?"

πŸ› οΈ Technical Stack

  • Frontend: Streamlit
  • Database: PostgreSQL with pgvector extension
  • Embeddings: Sentence Transformers
  • LLM: OpenAI GPT-4 / Anthropic Claude
  • Retrieval: BM25 + Vector Search with metadata filtering
  • Evaluation: DeepEval framework

πŸ“ Project Structure

current_spring2026/
β”œβ”€β”€ app.py                      # Main Streamlit application
β”œβ”€β”€ pipeline.py                 # RAG pipeline orchestration
β”œβ”€β”€ config.py                   # Configuration management
β”œβ”€β”€ database/                   # Database connection & queries
β”œβ”€β”€ embedding/                  # Vector embeddings
β”œβ”€β”€ retrieval/                  # Document retrieval logic
β”œβ”€β”€ generation/                 # LLM response generation
β”œβ”€β”€ evaluation/                 # Testing & metrics
β”œβ”€β”€ ingestion/                  # Data processing pipeline
└── scripts/                    # Utility scripts

πŸ” Environment Variables

Required secrets (set in Space Settings β†’ Repository secrets):

OPENAI_API_KEY=your_openai_key
DATABASE_URL=postgresql://user:pass@host:port/dbname

πŸ“ˆ Spring 2026 Improvements

Building on Fall 2025 work, this version adds:

  • βœ… Full-text document search (not just metadata)
  • βœ… Gold-standard evaluation dataset
  • βœ… Structured logging of queries and metrics
  • βœ… Improved retrieval metrics (Precision, Recall, MRR)
  • βœ… Enhanced UI with developer debug view

πŸ“š Data

  • Source: Digital Commonwealth (BPL subset)
  • Records: ~445,000 items
  • Types: Photographs, maps, newspapers, manuscripts, books
  • Full-text: 210K items with OCR (~1.5M pages)

πŸ‘₯ Team

Spring 2026 Spark! Team

  • Boston University Data Science (DS549)
  • Client: Eben English, Boston Public Library

Previous Semesters

  • Fall 2025: Infrastructure & pgvector migration
  • Fall 2024: Initial RAG prototype

πŸ“„ License

MIT License

πŸ”— Links


Built with ❀️ by BU Spark! for Boston Public Library