Spaces:
Runtime error
A newer version of the Gradio SDK is available: 6.15.2
Team Handoff: Data Layer β How to Use It
What's Built
Two data stores that work together for the chatbot:
SQLite database (
bps_schools.db) β 1,019 schools after dedup (111 BPS + 908 non-BPS) with structured fields for hard filtering (grade, age, location, provider type, boolean program flags).FAISS vector store (
vector_store/school_index.faiss) β 111 BPS schools with text descriptions embedded as 384-dim vectors for soft filtering / RAG (e.g., "I want a school with strong arts programs").
Source data: raw_data/choice_tool_raw.json.
The two layers combine: hard filter narrows to eligible schools, then semantic search ranks them by the user's preferences.
How to Use in the Chatbot
Everything goes through one class:
from database import BPSDatabase
db = BPSDatabase()
Key methods:
1. Hard filter β concrete constraints:
# BPS schools serving 3rd grade
results = db.hard_filter(grade=3)
# Schools within 1 mile of a location
results = db.hard_filter(grade=1, lat=42.35, lon=-71.06, radius_miles=1.0)
# Non-BPS schools for a 3-year-old (36 months)
results = db.find_schools_by_age(36)
# Boolean filters (UPK, ADA, accepts_ccfa, headstart, etc.)
results = db.find_schools_by_filters(accepts_ccfa=1, headstart=1)
2. Semantic search β fuzzy queries (BPS only):
results = db.semantic_search("bilingual Spanish dual language", top_k=5)
3. Combined search β hard filter first, then rank by semantic match:
results = db.search(
query="arts programs visual arts music",
grade=6,
provider_type="Boston Public School",
top_k=5
)
Other useful methods:
db.get_school_detail("school-id") # Full record + RAG description
db.get_all_provider_types() # List of distinct provider types
db.find_schools_near(42.35, -71.08) # Proximity search
Schema
Primary key is id (TEXT). Key columns:
| Column | Type | Notes |
|---|---|---|
id |
TEXT | Primary key |
school |
TEXT | School name |
provider_type |
TEXT | "Boston Public School", "Family Child Care", etc. |
grade_min / grade_max |
INTEGER | BPS only. K0=-2, K1=-1, K2=0, 1-12 |
grade_min_sped |
INTEGER | BPS special education lower bound |
age_min_months / age_max_months |
INTEGER | Non-BPS only. Age range in months |
latitude / longitude |
REAL | For proximity search |
| Boolean flags | INTEGER | UPK, ADA, accepts_ccfa, headstart, has_language_program, has_advanced_placement, has_international_baccalaureate, uniform, special_admission, surround_care, build_care, tuition |
Grade Encoding
| User says | Pass as |
|---|---|
| K0 / Pre-K | -2 |
| K1 / Kindergarten | -1 |
| K2 | 0 |
| 1st grade | 1 |
| ... | ... |
| 12th grade | 12 |
Non-BPS schools use age in months instead of grades.
File Structure
data/
βββ database.py # BPSDatabase class β import this
βββ build_database.py # Build pipeline (run once to rebuild)
βββ bps_schools.db # SQLite database (don't edit directly)
βββ raw_data/
β βββ choice_tool_raw.json # Source data (1,027 records, 1,019 after dedup)
βββ vector_store/
β βββ school_index.faiss # FAISS vector index (BPS only, 111 schools)
β βββ documents.json # Text descriptions
β βββ metadata.json # Structured metadata
βββ TEAM_HANDOFF.md # This file
Setup
pip install faiss-cpu sentence-transformers numpy
The database and vector store are already built (committed to the repo). Only run build_database.py if you change the source data.
Rebuilding
cd data
python build_database.py # Build both DB and vector store
python build_database.py --db-only # SQLite only
python build_database.py --vector-only # Vector store only
What's NOT in the Data
- Walk zone boundaries (approximated with radius search)
- Transportation eligibility rules
- School capacity / available seats
- MCAS scores / accountability ratings