{ "cells": [ { "cell_type": "markdown", "id": "e2192e55", "metadata": {}, "source": [ "# Project: CCSS Standard Alignment using BM25 and SPLADE\n", "\n", "---\n", "\n", "## Background\n", "\n", "### BM25 (Best Matching 25)\n", "\n", "BM25 is a **traditional lexical retrieval model** used in information retrieval systems (like search engines). It ranks documents based on the **term frequency–inverse document frequency (TF-IDF)** concept, with additional normalization for document length.\n", "\n", "**Core Characteristics:**\n", "- Lexical-only: matches exact words (not synonyms/paraphrases)\n", "- Scores documents using a tunable function of:\n", " - **Term frequency (TF)** – how often a query term appears in the doc\n", " - **Inverse Document Frequency (IDF)** – how rare the term is overall\n", " - **Document length normalization**\n", "- Fast and interpretable\n", "\n", "**Strengths:**\n", "- Simple and fast\n", "- Strong for keyword-heavy queries\n", "- Works well on small datasets\n", "\n", "**Limitations:**\n", "- Cannot understand synonyms, rephrasing, or context\n", "\n", "---\n", "\n", "### SPLADE (Sparse Lexical and Expansion Model)\n", "\n", "SPLADE is a **neural sparse retriever** that combines the **interpretability of sparse vectors** with the **semantic power of transformers (like BERT)**.\n", "\n", "**How it works:**\n", "- Instead of dense embeddings (like BERT or SBERT), SPLADE generates **sparse term-weighted vectors**\n", "- These vectors can:\n", " - Activate terms **not explicitly in the query** (semantic expansion)\n", " - Assign importance scores to vocabulary terms\n", "- Supports use of **inverted indexes** like BM25, but with neural knowledge\n", "\n", "**Strengths:**\n", "- Captures paraphrasing and synonyms\n", "- Sparse and interpretable\n", "- Works better on natural language queries\n", "\n", "**Limitations:**\n", "- Slower than BM25\n", "- Requires GPU for efficient inference\n", "\n", "---\n", "\n", "## Project Overview\n", "\n", "### Goal:\n", "\n", "Build a system that **automatically aligns educational content (e.g., lesson descriptions, learning objectives)** to the most relevant **Common Core State Standards (CCSS)** for English Language Arts (ELA).\n", "\n", "---\n", "\n", "### Approach:\n", "\n", "We implement and compare **two retrieval pipelines**:\n", "\n", "| Component | Pipeline 1 | Pipeline 2 |\n", "|---------------|----------------------|------------------------|\n", "| Model | BM25 | SPLADE |\n", "| Representation | Token frequency | Sparse transformer weights |\n", "| Input | Free-form text | Free-form text |\n", "| Output | Top-N most relevant CCSS standards with scores |\n", "\n", "---\n", "\n", "### Dataset:\n", "\n", "- Source: `CCSS Common Core Standards.xlsx`\n", "- Focus: Only **ELA standards**\n", "- Fields used: `ID`, `Sub Category`, `State Standard`\n", "\n", "---\n", "\n", "### Output Format:\n", "\n", "Each pipeline returns a list of matches:\n", "```json\n", "[\n", " {\n", " \"rank\": 1,\n", " \"score\": 10.87,\n", " \"ID\": \"4.RI.2\",\n", " \"Category\": \"Reading Informational\",\n", " \"Sub Category\": \"Key Ideas and Details\",\n", " \"State Standard\": \"Determine the main idea of a text...\"\n", " },\n", " ...\n", "]\n" ] }, { "cell_type": "code", "execution_count": 28, "id": "cfa8b1b6", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re\n", "from nltk.tokenize import word_tokenize\n", "from nltk.corpus import stopwords\n", "from nltk.stem import WordNetLemmatizer" ] }, { "cell_type": "code", "execution_count": 46, "id": "748918e3", "metadata": {}, "outputs": [], "source": [ "stop_words = set(stopwords.words('english'))" ] }, { "cell_type": "code", "execution_count": 62, "id": "3cf17d78", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('/Users/shivendragupta/Desktop/internship25/CCSS/data/CCSS Common Core Standards(English Standards).csv')" ] }, { "cell_type": "code", "execution_count": 63, "id": "ee2b47e5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDCategorySub CategoryState Standard
0K.RL.1Reading LiteratureKey Ideas and DetailsWith prompting and support, ask and answer que...
1K.RL.2Reading LiteratureKey Ideas and DetailsWith prompting and support, retell familiar st...
2K.RL.3Reading LiteratureKey Ideas and DetailsWith prompting and support, identify character...
3K.RL.4Reading LiteratureCraft and StructureAsk and answer questions about unknown words i...
4K.RL.5Reading LiteratureCraft and StructureRecognize common types of texts (e.g., storybo...
\n", "
" ], "text/plain": [ " ID Category Sub Category \\\n", "0 K.RL.1 Reading Literature Key Ideas and Details \n", "1 K.RL.2 Reading Literature Key Ideas and Details \n", "2 K.RL.3 Reading Literature Key Ideas and Details \n", "3 K.RL.4 Reading Literature Craft and Structure \n", "4 K.RL.5 Reading Literature Craft and Structure \n", "\n", " State Standard \n", "0 With prompting and support, ask and answer que... \n", "1 With prompting and support, retell familiar st... \n", "2 With prompting and support, identify character... \n", "3 Ask and answer questions about unknown words i... \n", "4 Recognize common types of texts (e.g., storybo... " ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head() # Display the first few rows of the DataFrame" ] }, { "cell_type": "code", "execution_count": 64, "id": "3958653b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1486, 4)" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 65, "id": "0e747290", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ID 501\n", "Category 501\n", "Sub Category 501\n", "State Standard 501\n", "dtype: int64" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": 66, "id": "34001c04", "metadata": {}, "outputs": [], "source": [ "df.dropna(inplace=True) # Drop rows with any NaN values" ] }, { "cell_type": "markdown", "id": "5e02750b", "metadata": {}, "source": [ "# ```Preprocessing data```" ] }, { "cell_type": "code", "execution_count": 67, "id": "506a332c", "metadata": {}, "outputs": [], "source": [ "def clean_text(text: str) -> str:\n", " text = text.strip()\n", " text = text.replace(\"\\n\", \" \").replace(\"\\xa0\", \" \")\n", " text = text.replace(\"“\", \"\\\"\").replace(\"”\", \"\\\"\").replace(\"–\", \"-\")\n", " return text" ] }, { "cell_type": "markdown", "id": "a97a9cfb", "metadata": {}, "source": [ "## ```Lower Casing```" ] }, { "cell_type": "code", "execution_count": 38, "id": "2f843d49", "metadata": {}, "outputs": [], "source": [ "def lowercase(text: str) -> str:\n", " return text.lower()" ] }, { "cell_type": "markdown", "id": "4bc931f1", "metadata": {}, "source": [ "## ```Removing Punctuation```" ] }, { "cell_type": "code", "execution_count": 39, "id": "734d4b30", "metadata": {}, "outputs": [], "source": [ "def remove_punctuation(text: str) -> str:\n", " return re.sub(r\"[^\\w\\s]\", \"\", text)" ] }, { "cell_type": "markdown", "id": "d12ab012", "metadata": {}, "source": [ "## ``` Removing Stop Words ```" ] }, { "cell_type": "code", "execution_count": 49, "id": "b925980c", "metadata": {}, "outputs": [], "source": [ "def remove_stopwords(text: str) -> str:\n", " tokens = word_tokenize(text)\n", " return ' '.join([word for word in tokens if word not in stop_words])" ] }, { "cell_type": "markdown", "id": "814c3818", "metadata": {}, "source": [ "## ``` Lemmatization ```" ] }, { "cell_type": "code", "execution_count": 41, "id": "70500704", "metadata": {}, "outputs": [], "source": [ "lemmatizer = WordNetLemmatizer()" ] }, { "cell_type": "code", "execution_count": 50, "id": "7e287fc2", "metadata": {}, "outputs": [], "source": [ "def lemmatize_tokens(text: str) -> str:\n", " tokens = word_tokenize(text)\n", " return ' '.join([lemmatizer.lemmatize(token) for token in tokens])\n" ] }, { "cell_type": "markdown", "id": "39bd33a6", "metadata": {}, "source": [ "## ``` PipeLine ```" ] }, { "cell_type": "code", "execution_count": 68, "id": "b443ffec", "metadata": {}, "outputs": [], "source": [ "def preprocessing_pipeline(text: str) -> str:\n", " text = clean_text(text)\n", " text = lowercase(text)\n", " text = remove_punctuation(text)\n", " text = remove_stopwords(text)\n", " text = lemmatize_tokens(text)\n", " return text" ] }, { "cell_type": "code", "execution_count": 69, "id": "13a7d65a", "metadata": {}, "outputs": [], "source": [ "df['State Standard'] = df['State Standard'].apply(preprocessing_pipeline)" ] }, { "cell_type": "code", "execution_count": 70, "id": "f5a8cb5d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 prompting support ask answer question key deta...\n", "1 prompting support retell familiar story includ...\n", "2 prompting support identify character setting m...\n", "3 ask answer question unknown word text\n", "4 recognize common type text eg storybook poem\n", " ... \n", "980 use technology including internet produce publ...\n", "981 conduct short well sustained research project ...\n", "982 gather relevant information multiple authorita...\n", "983 draw evidence informational text support analy...\n", "984 write routinely extended time frame time refle...\n", "Name: State Standard, Length: 985, dtype: object" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['State Standard']" ] }, { "cell_type": "markdown", "id": "ecb1369e", "metadata": {}, "source": [ "## ``` BM25 Retreiver Function ```" ] }, { "cell_type": "code", "execution_count": 73, "id": "41e30b91", "metadata": {}, "outputs": [], "source": [ "from rank_bm25 import BM25Okapi" ] }, { "cell_type": "code", "execution_count": 74, "id": "d34de1f1", "metadata": {}, "outputs": [], "source": [ "tokenized_docs = [doc.lower().split() for doc in df['State Standard']]" ] }, { "cell_type": "code", "execution_count": 77, "id": "32594e27", "metadata": {}, "outputs": [], "source": [ "bm25 = BM25Okapi(tokenized_docs)" ] }, { "cell_type": "code", "execution_count": 155, "id": "3d552a24", "metadata": {}, "outputs": [], "source": [ "def retrieve_top_n_bm25(query: str, top_n=5):\n", " query_tokens = preprocessing_pipeline(query)\n", " tokenized_query = query_tokens.split()\n", " \n", " scores = bm25.get_scores(tokenized_query)\n", "\n", " top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_n]\n", "\n", "\n", " # ID\tCategory\tSub Category\tState Standard\n", "\n", " results = []\n", " for idx in top_indices:\n", " row = df.iloc[idx]\n", " results.append({\n", " \"ID\": row[\"ID\"],\n", " \"Category\": row[\"Category\"],\n", " \"Sub Category\": row[\"Sub Category\"],\n", " \"standard\": row[\"State Standard\"],\n", " \"score\": round(scores[idx], 4)\n", "\n", " })\n", " return results\n" ] }, { "cell_type": "code", "execution_count": 123, "id": "5a18deb6", "metadata": {}, "outputs": [], "source": [ "query = \"Identify the main idea of a text\"" ] }, { "cell_type": "code", "execution_count": 124, "id": "11954a8a", "metadata": {}, "outputs": [], "source": [ "results = retrieve_top_n_bm25(query, top_n=5)" ] }, { "cell_type": "markdown", "id": "49538e11", "metadata": {}, "source": [ "## ``` Top 5 Results from BM25 Retrieval ```" ] }, { "cell_type": "code", "execution_count": 125, "id": "f7d12c4d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'ID': '1.RI.2',\n", " 'Category': 'Reading Informational',\n", " 'Sub Category': 'Key Ideas and Details',\n", " 'State Standard': 'identify main topic retell key detail text',\n", " 'score': 10.666},\n", " {'ID': '3.RI.2',\n", " 'Category': 'Reading Informational',\n", " 'Sub Category': 'Key Ideas and Details',\n", " 'State Standard': 'determine main idea text recount key detail explain support main idea',\n", " 'score': 10.0953},\n", " {'ID': 'K.RI.2',\n", " 'Category': 'Reading Informational',\n", " 'Sub Category': 'Key Ideas and Details',\n", " 'State Standard': 'prompting support identify main topic retell key detail text',\n", " 'score': 9.8043},\n", " {'ID': '2.RI.6',\n", " 'Category': 'Reading Informational',\n", " 'Sub Category': 'Craft and Structure',\n", " 'State Standard': 'identify main purpose text including author want answer explain describe',\n", " 'score': 9.4236},\n", " {'ID': '2.RI.2',\n", " 'Category': 'Reading Informational',\n", " 'Sub Category': 'Key Ideas and Details',\n", " 'State Standard': 'identify main topic multiparagraph text well focus specific paragraph within text',\n", " 'score': 9.3944}]" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results" ] }, { "cell_type": "markdown", "id": "1dd7ac6e", "metadata": {}, "source": [ "## ``` Using Splade sparse retreiver```" ] }, { "cell_type": "code", "execution_count": 126, "id": "f8c3fee3", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "be42ffa9ef0949679ea06670a3436378", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0%| | 0.00/466 [00:00 *\"Identify the main idea of a text\"*\n", "\n", "---\n", "\n", "### Top-5 Results: **BM25**\n", "\n", "| Rank | ID | Category | Sub Category | State Standard | Score |\n", "|------|---------|----------------------|----------------------|----------------------------------------------------------------------------------|---------|\n", "| 1 | 1.RI.2 | Reading Informational| Key Ideas and Details| identify main topic retell key detail text | 10.666 |\n", "| 2 | 3.RI.2 | Reading Informational| Key Ideas and Details| determine main idea text recount key detail explain support main idea | 10.0953 |\n", "| 3 | K.RI.2 | Reading Informational| Key Ideas and Details| prompting support identify main topic retell key detail text | 9.8043 |\n", "| 4 | 2.RI.6 | Reading Informational| Craft and Structure | identify main purpose text including author want answer explain describe | 9.4236 |\n", "| 5 | 2.RI.2 | Reading Informational| Key Ideas and Details| identify main topic multiparagraph text well focus specific paragraph within text | 9.3944 |\n", "\n", "---\n", "\n", "### Top-5 Results: **SPLADE (Sparse Embedding Model)**\n", "\n", "| Rank | ID | Category | Sub Category | State Standard | Score |\n", "|------|---------|----------------------|----------------------|----------------------------------------------------------------------------------|---------|\n", "| 1 | 4.RI.2 | Reading Informational| Key Ideas and Details| determine main idea text explain supported key detail summarize text | 21.3089 |\n", "| 2 | 3.RI.2 | Reading Informational| Key Ideas and Details| determine main idea text recount key detail explain support main idea | 20.8493 |\n", "| 3 | 5.RI.2 | Reading Informational| Key Ideas and Details| determine two main idea text explain supported key detail summarize text | 20.2714 |\n", "| 4 | 3.SL.2 | Speaking & Listening | Comprehension and Collaboration | determine main idea supporting detail text read aloud information presented diverse medium format including visually quantitatively orally | 17.5151 |\n", "| 5 | 2.RI.6 | Reading Informational| Craft and Structure | identify main purpose text including author want answer explain describe | 17.512 |\n", "\n", "---\n", "\n", "### Insights:\n", "\n", "- Both **BM25 and SPLADE** correctly rank **\"3.RI.2\"** and **\"2.RI.6\"** in the top-5.\n", "- **SPLADE ranks more abstract or paraphrased variants** (e.g., \"summarize\", \"supported key detail\") higher due to its semantic understanding.\n", "- SPLADE retrieves **higher-level matches** like **\"5.RI.2\"** and **\"4.RI.2\"**, which are **semantically related** but not lexically identical.\n", "- BM25 relies on **exact term overlap**, favoring simpler phrasings like \"identify main topic\".\n", "\n", "---\n", "\n", "### Conclusion:\n", "\n", "| Feature | BM25 | SPLADE |\n", "|--------------------------|----------------------------|----------------------------------|\n", "| Matching Type | Exact lexical match | Semantic sparse match |\n", "| Interpretability | High (term overlap) | High (per-term weights) |\n", "| Handles Paraphrasing | No | Yes |\n", "| Use Case Fit | Good for short, exact queries | Great for natural language input |\n", "\n", "---\n", "\n", "### Top-1 Accuracy\n", "\n", "| Model | Top-1 Accuracy |\n", "|---------|----------------|\n", "| BM25 | **0.9959** |\n", "| SPLADE | **0.9797** |\n", "\n", "---\n", "\n", "### Insights\n", "\n", "- **BM25** achieves near-perfect accuracy due to exact term matching, especially since queries are identical to indexed documents.\n", "- **SPLADE** performs slightly lower because it may **re-rank paraphrases or semantic neighbors**, even when the original text is present.\n", "\n", "---\n" ] } ], "metadata": { "kernelspec": { "display_name": "venv", "language": "python", "name": "venv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }