Spaces:
Sleeping
Sleeping
| title: BPL RAG Spring 2026 | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: streamlit | |
| sdk_version: 1.32.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # Boston Public Library - RAG Search System (Spring 2026) | |
| Natural language search for the BPL Digital Commonwealth collection using Retrieval-Augmented Generation. | |
| ## π― Project Overview | |
| This system allows users to search through the Boston Public Library's digital collections using natural language queries. Built as part of the BU Spark! DS549 course in Spring 2026. | |
| ### Key Features | |
| - **Semantic Search**: Natural language queries across 445K+ BPL items | |
| - **Full-Text Search**: Search within document content, not just metadata | |
| - **Metadata Filtering**: Time-based, location-based, and type-based filtering | |
| - **Evaluation Framework**: Built-in testing with gold-standard datasets | |
| - **PostgreSQL + pgvector**: Scalable vector database backend | |
| ## π Quick Start | |
| 1. Enter your query in natural language (e.g., "What were important events in Boston in 1919?") | |
| 2. View retrieved documents with relevance scores | |
| 3. Read AI-generated contextual explanations | |
| ## π Example Queries | |
| - "Find pictures of JFK's house on Cape Cod" | |
| - "Are there any maps of Worcester, MA from the 18th century?" | |
| - "Show me depictions of indigenous Americans" | |
| - "What were important historical events in Boston in 1919?" | |
| ## π οΈ Technical Stack | |
| - **Frontend**: Streamlit | |
| - **Database**: PostgreSQL with pgvector extension | |
| - **Embeddings**: Sentence Transformers | |
| - **LLM**: OpenAI GPT-4 / Anthropic Claude | |
| - **Retrieval**: BM25 + Vector Search with metadata filtering | |
| - **Evaluation**: DeepEval framework | |
| ## π Project Structure | |
| ``` | |
| current_spring2026/ | |
| βββ app.py # Main Streamlit application | |
| βββ pipeline.py # RAG pipeline orchestration | |
| βββ config.py # Configuration management | |
| βββ database/ # Database connection & queries | |
| βββ embedding/ # Vector embeddings | |
| βββ retrieval/ # Document retrieval logic | |
| βββ generation/ # LLM response generation | |
| βββ evaluation/ # Testing & metrics | |
| βββ ingestion/ # Data processing pipeline | |
| βββ scripts/ # Utility scripts | |
| ``` | |
| ## π Environment Variables | |
| Required secrets (set in Space Settings β Repository secrets): | |
| ``` | |
| OPENAI_API_KEY=your_openai_key | |
| DATABASE_URL=postgresql://user:pass@host:port/dbname | |
| ``` | |
| ## π Spring 2026 Improvements | |
| Building on Fall 2025 work, this version adds: | |
| - β Full-text document search (not just metadata) | |
| - β Gold-standard evaluation dataset | |
| - β Structured logging of queries and metrics | |
| - β Improved retrieval metrics (Precision, Recall, MRR) | |
| - β Enhanced UI with developer debug view | |
| ## π Data | |
| - **Source**: Digital Commonwealth (BPL subset) | |
| - **Records**: ~445,000 items | |
| - **Types**: Photographs, maps, newspapers, manuscripts, books | |
| - **Full-text**: 210K items with OCR (~1.5M pages) | |
| ## π₯ Team | |
| **Spring 2026 Spark! Team** | |
| - Boston University Data Science (DS549) | |
| - Client: Eben English, Boston Public Library | |
| **Previous Semesters** | |
| - Fall 2025: Infrastructure & pgvector migration | |
| - Fall 2024: Initial RAG prototype | |
| ## π License | |
| MIT License | |
| ## π Links | |
| - [Digital Commonwealth](https://www.digitalcommonwealth.org/) | |
| - [BPL Digital Repository](https://www.digitalcommonwealth.org/institutions/boston-public-library) | |
| - [Project Documentation](https://github.com/your-repo-link) | |
| --- | |
| *Built with β€οΈ by BU Spark! for Boston Public Library* |