{ "cells": [ { "cell_type": "markdown", "id": "e2192e55", "metadata": {}, "source": [ "# Project: CCSS Standard Alignment using BM25 and SPLADE\n", "\n", "---\n", "\n", "## Background\n", "\n", "### BM25 (Best Matching 25)\n", "\n", "BM25 is a **traditional lexical retrieval model** used in information retrieval systems (like search engines). It ranks documents based on the **term frequency–inverse document frequency (TF-IDF)** concept, with additional normalization for document length.\n", "\n", "**Core Characteristics:**\n", "- Lexical-only: matches exact words (not synonyms/paraphrases)\n", "- Scores documents using a tunable function of:\n", " - **Term frequency (TF)** – how often a query term appears in the doc\n", " - **Inverse Document Frequency (IDF)** – how rare the term is overall\n", " - **Document length normalization**\n", "- Fast and interpretable\n", "\n", "**Strengths:**\n", "- Simple and fast\n", "- Strong for keyword-heavy queries\n", "- Works well on small datasets\n", "\n", "**Limitations:**\n", "- Cannot understand synonyms, rephrasing, or context\n", "\n", "---\n", "\n", "### SPLADE (Sparse Lexical and Expansion Model)\n", "\n", "SPLADE is a **neural sparse retriever** that combines the **interpretability of sparse vectors** with the **semantic power of transformers (like BERT)**.\n", "\n", "**How it works:**\n", "- Instead of dense embeddings (like BERT or SBERT), SPLADE generates **sparse term-weighted vectors**\n", "- These vectors can:\n", " - Activate terms **not explicitly in the query** (semantic expansion)\n", " - Assign importance scores to vocabulary terms\n", "- Supports use of **inverted indexes** like BM25, but with neural knowledge\n", "\n", "**Strengths:**\n", "- Captures paraphrasing and synonyms\n", "- Sparse and interpretable\n", "- Works better on natural language queries\n", "\n", "**Limitations:**\n", "- Slower than BM25\n", "- Requires GPU for efficient inference\n", "\n", "---\n", "\n", "## Project Overview\n", "\n", "### Goal:\n", "\n", "Build a system that **automatically aligns educational content (e.g., lesson descriptions, learning objectives)** to the most relevant **Common Core State Standards (CCSS)** for English Language Arts (ELA).\n", "\n", "---\n", "\n", "### Approach:\n", "\n", "We implement and compare **two retrieval pipelines**:\n", "\n", "| Component | Pipeline 1 | Pipeline 2 |\n", "|---------------|----------------------|------------------------|\n", "| Model | BM25 | SPLADE |\n", "| Representation | Token frequency | Sparse transformer weights |\n", "| Input | Free-form text | Free-form text |\n", "| Output | Top-N most relevant CCSS standards with scores |\n", "\n", "---\n", "\n", "### Dataset:\n", "\n", "- Source: `CCSS Common Core Standards.xlsx`\n", "- Focus: Only **ELA standards**\n", "- Fields used: `ID`, `Sub Category`, `State Standard`\n", "\n", "---\n", "\n", "### Output Format:\n", "\n", "Each pipeline returns a list of matches:\n", "```json\n", "[\n", " {\n", " \"rank\": 1,\n", " \"score\": 10.87,\n", " \"ID\": \"4.RI.2\",\n", " \"Category\": \"Reading Informational\",\n", " \"Sub Category\": \"Key Ideas and Details\",\n", " \"State Standard\": \"Determine the main idea of a text...\"\n", " },\n", " ...\n", "]\n" ] }, { "cell_type": "code", "execution_count": 28, "id": "cfa8b1b6", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "from nltk.tokenize import word_tokenize\n", "from nltk.corpus import stopwords\n", "from nltk.stem import WordNetLemmatizer" ] }, { "cell_type": "code", "execution_count": 46, "id": "748918e3", "metadata": {}, "outputs": [], "source": [ "stop_words = set(stopwords.words('english'))" ] }, { "cell_type": "code", "execution_count": 62, "id": "3cf17d78", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('/Users/shivendragupta/Desktop/internship25/CCSS/data/CCSS Common Core Standards(English Standards).csv')" ] }, { "cell_type": "code", "execution_count": 63, "id": "ee2b47e5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | ID | \n", "Category | \n", "Sub Category | \n", "State Standard | \n", "
|---|---|---|---|---|
| 0 | \n", "K.RL.1 | \n", "Reading Literature | \n", "Key Ideas and Details | \n", "With prompting and support, ask and answer que... | \n", "
| 1 | \n", "K.RL.2 | \n", "Reading Literature | \n", "Key Ideas and Details | \n", "With prompting and support, retell familiar st... | \n", "
| 2 | \n", "K.RL.3 | \n", "Reading Literature | \n", "Key Ideas and Details | \n", "With prompting and support, identify character... | \n", "
| 3 | \n", "K.RL.4 | \n", "Reading Literature | \n", "Craft and Structure | \n", "Ask and answer questions about unknown words i... | \n", "
| 4 | \n", "K.RL.5 | \n", "Reading Literature | \n", "Craft and Structure | \n", "Recognize common types of texts (e.g., storybo... | \n", "