Spaces:

spark-ds549
/

BPL-RAG-Spring-2026

Sleeping

App Files Files Community

BPL-RAG-Spring-2026 / README.md

han-na

Initial Spring 2026 deployment

a3ae00a about 2 months ago

preview code

raw

history blame contribute delete

3.64 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

metadata

title: BPL RAG Spring 2026
emoji: 📚
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.32.0
app_file: app.py
pinned: false
license: mit

Boston Public Library - RAG Search System (Spring 2026)

Natural language search for the BPL Digital Commonwealth collection using Retrieval-Augmented Generation.

🎯 Project Overview

This system allows users to search through the Boston Public Library's digital collections using natural language queries. Built as part of the BU Spark! DS549 course in Spring 2026.

Key Features

Semantic Search: Natural language queries across 445K+ BPL items
Full-Text Search: Search within document content, not just metadata
Metadata Filtering: Time-based, location-based, and type-based filtering
Evaluation Framework: Built-in testing with gold-standard datasets
PostgreSQL + pgvector: Scalable vector database backend

🚀 Quick Start

Enter your query in natural language (e.g., "What were important events in Boston in 1919?")
View retrieved documents with relevance scores
Read AI-generated contextual explanations

📊 Example Queries

"Find pictures of JFK's house on Cape Cod"
"Are there any maps of Worcester, MA from the 18th century?"
"Show me depictions of indigenous Americans"
"What were important historical events in Boston in 1919?"

🛠️ Technical Stack

Frontend: Streamlit
Database: PostgreSQL with pgvector extension
Embeddings: Sentence Transformers
LLM: OpenAI GPT-4 / Anthropic Claude
Retrieval: BM25 + Vector Search with metadata filtering
Evaluation: DeepEval framework

📁 Project Structure

current_spring2026/
├── app.py                      # Main Streamlit application
├── pipeline.py                 # RAG pipeline orchestration
├── config.py                   # Configuration management
├── database/                   # Database connection & queries
├── embedding/                  # Vector embeddings
├── retrieval/                  # Document retrieval logic
├── generation/                 # LLM response generation
├── evaluation/                 # Testing & metrics
├── ingestion/                  # Data processing pipeline
└── scripts/                    # Utility scripts

🔐 Environment Variables

Required secrets (set in Space Settings → Repository secrets):

OPENAI_API_KEY=your_openai_key
DATABASE_URL=postgresql://user:pass@host:port/dbname

📈 Spring 2026 Improvements

Building on Fall 2025 work, this version adds:

✅ Full-text document search (not just metadata)
✅ Gold-standard evaluation dataset
✅ Structured logging of queries and metrics
✅ Improved retrieval metrics (Precision, Recall, MRR)
✅ Enhanced UI with developer debug view

📚 Data

Source: Digital Commonwealth (BPL subset)
Records: ~445,000 items
Types: Photographs, maps, newspapers, manuscripts, books
Full-text: 210K items with OCR (~1.5M pages)

👥 Team

Spring 2026 Spark! Team

Boston University Data Science (DS549)
Client: Eben English, Boston Public Library

Previous Semesters

Fall 2025: Infrastructure & pgvector migration
Fall 2024: Initial RAG prototype

📄 License

MIT License

🔗 Links

Built with ❤️ by BU Spark! for Boston Public Library