14c_chatbot / data /TEAM_HANDOFF.md
csong03
Initial Space upload with LFS-tracked binaries
9e118e4

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

Team Handoff: Data Layer β€” How to Use It

What's Built

Two data stores that work together for the chatbot:

  1. SQLite database (bps_schools.db) β€” 1,019 schools after dedup (111 BPS + 908 non-BPS) with structured fields for hard filtering (grade, age, location, provider type, boolean program flags).

  2. FAISS vector store (vector_store/school_index.faiss) β€” 111 BPS schools with text descriptions embedded as 384-dim vectors for soft filtering / RAG (e.g., "I want a school with strong arts programs").

Source data: raw_data/choice_tool_raw.json.

The two layers combine: hard filter narrows to eligible schools, then semantic search ranks them by the user's preferences.

How to Use in the Chatbot

Everything goes through one class:

from database import BPSDatabase

db = BPSDatabase()

Key methods:

1. Hard filter β€” concrete constraints:

# BPS schools serving 3rd grade
results = db.hard_filter(grade=3)

# Schools within 1 mile of a location
results = db.hard_filter(grade=1, lat=42.35, lon=-71.06, radius_miles=1.0)

# Non-BPS schools for a 3-year-old (36 months)
results = db.find_schools_by_age(36)

# Boolean filters (UPK, ADA, accepts_ccfa, headstart, etc.)
results = db.find_schools_by_filters(accepts_ccfa=1, headstart=1)

2. Semantic search β€” fuzzy queries (BPS only):

results = db.semantic_search("bilingual Spanish dual language", top_k=5)

3. Combined search β€” hard filter first, then rank by semantic match:

results = db.search(
    query="arts programs visual arts music",
    grade=6,
    provider_type="Boston Public School",
    top_k=5
)

Other useful methods:

db.get_school_detail("school-id")   # Full record + RAG description
db.get_all_provider_types()          # List of distinct provider types
db.find_schools_near(42.35, -71.08)  # Proximity search

Schema

Primary key is id (TEXT). Key columns:

Column Type Notes
id TEXT Primary key
school TEXT School name
provider_type TEXT "Boston Public School", "Family Child Care", etc.
grade_min / grade_max INTEGER BPS only. K0=-2, K1=-1, K2=0, 1-12
grade_min_sped INTEGER BPS special education lower bound
age_min_months / age_max_months INTEGER Non-BPS only. Age range in months
latitude / longitude REAL For proximity search
Boolean flags INTEGER UPK, ADA, accepts_ccfa, headstart, has_language_program, has_advanced_placement, has_international_baccalaureate, uniform, special_admission, surround_care, build_care, tuition

Grade Encoding

User says Pass as
K0 / Pre-K -2
K1 / Kindergarten -1
K2 0
1st grade 1
... ...
12th grade 12

Non-BPS schools use age in months instead of grades.

File Structure

data/
β”œβ”€β”€ database.py               # BPSDatabase class β€” import this
β”œβ”€β”€ build_database.py         # Build pipeline (run once to rebuild)
β”œβ”€β”€ bps_schools.db            # SQLite database (don't edit directly)
β”œβ”€β”€ raw_data/
β”‚   └── choice_tool_raw.json  # Source data (1,027 records, 1,019 after dedup)
β”œβ”€β”€ vector_store/
β”‚   β”œβ”€β”€ school_index.faiss    # FAISS vector index (BPS only, 111 schools)
β”‚   β”œβ”€β”€ documents.json        # Text descriptions
β”‚   └── metadata.json         # Structured metadata
└── TEAM_HANDOFF.md           # This file

Setup

pip install faiss-cpu sentence-transformers numpy

The database and vector store are already built (committed to the repo). Only run build_database.py if you change the source data.

Rebuilding

cd data
python build_database.py                # Build both DB and vector store
python build_database.py --db-only      # SQLite only
python build_database.py --vector-only  # Vector store only

What's NOT in the Data

  • Walk zone boundaries (approximated with radius search)
  • Transportation eligibility rules
  • School capacity / available seats
  • MCAS scores / accountability ratings