#  Project: CCSS Standard Alignment using BM25 and SPLADE

---

## Background

### BM25 (Best Matching 25)

BM25 is a **traditional lexical retrieval model** used in information retrieval systems (like search engines). It ranks documents based on the **term frequency–inverse document frequency (TF-IDF)** concept, with additional normalization for document length.

**Core Characteristics:**
- Lexical-only: matches exact words (not synonyms/paraphrases)
- Scores documents using a tunable function of:
  - **Term frequency (TF)** – how often a query term appears in the doc
  - **Inverse Document Frequency (IDF)** – how rare the term is overall
  - **Document length normalization**
- Fast and interpretable

**Strengths:**
- Simple and fast
- Strong for keyword-heavy queries
- Works well on small datasets

**Limitations:**
- Cannot understand synonyms, rephrasing, or context

---

### SPLADE (Sparse Lexical and Expansion Model)

SPLADE is a **neural sparse retriever** that combines the **interpretability of sparse vectors** with the **semantic power of transformers (like BERT)**.

**How it works:**
- Instead of dense embeddings (like BERT or SBERT), SPLADE generates **sparse term-weighted vectors**
- These vectors can:
  - Activate terms **not explicitly in the query** (semantic expansion)
  - Assign importance scores to vocabulary terms
- Supports use of **inverted indexes** like BM25, but with neural knowledge

**Strengths:**
- Captures paraphrasing and synonyms
- Sparse and interpretable
- Works better on natural language queries

**Limitations:**
- Slower than BM25
- Requires GPU for efficient inference

---

## Project Overview

### Goal:

Build a system that **automatically aligns educational content (e.g., lesson descriptions, learning objectives)** to the most relevant **Common Core State Standards (CCSS)** for English Language Arts (ELA).

---

### Approach:

We implement and compare **two retrieval pipelines**:

| Component     | Pipeline 1           | Pipeline 2             |
|---------------|----------------------|------------------------|
| Model         | BM25                 | SPLADE                 |
| Representation | Token frequency      | Sparse transformer weights |
| Input         | Free-form text       | Free-form text         |
| Output        | Top-N most relevant CCSS standards with scores |

---

### Dataset:

- Source: `CCSS Common Core Standards.xlsx`
- Focus: Only **ELA standards**
- Fields used: `ID`, `Sub Category`, `State Standard`

---

### Output Format:

Each pipeline returns a list of matches:
```json
[
  {
    "rank": 1,
    "score": 10.87,
    "ID": "4.RI.2",
    "Category": "Reading Informational",
    "Sub Category": "Key Ideas and Details",
    "State Standard": "Determine the main idea of a text..."
  },
  ...
]


In [28]:
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [46]:
stop_words = set(stopwords.words('english'))

In [62]:
df = pd.read_csv('/Users/shivendragupta/Desktop/internship25/CCSS/data/CCSS Common Core Standards(English Standards).csv')

In [63]:
df.head()  # Display the first few rows of the DataFrame

Unnamed: 0,ID,Category,Sub Category,State Standard
0,K.RL.1,Reading Literature,Key Ideas and Details,"With prompting and support, ask and answer que..."
1,K.RL.2,Reading Literature,Key Ideas and Details,"With prompting and support, retell familiar st..."
2,K.RL.3,Reading Literature,Key Ideas and Details,"With prompting and support, identify character..."
3,K.RL.4,Reading Literature,Craft and Structure,Ask and answer questions about unknown words i...
4,K.RL.5,Reading Literature,Craft and Structure,"Recognize common types of texts (e.g., storybo..."


In [64]:
df.shape

(1486, 4)

In [65]:
df.isna().sum()

ID                501
Category          501
Sub Category      501
State Standard    501
dtype: int64

In [66]:
df.dropna(inplace=True)  # Drop rows with any NaN values

# ```Preprocessing data```

In [67]:
def clean_text(text: str) -> str:
    text = text.strip()
    text = text.replace("\n", " ").replace("\xa0", " ")
    text = text.replace("“", "\"").replace("”", "\"").replace("–", "-")
    return text

## ```Lower Casing```

In [38]:
def lowercase(text: str) -> str:
    return text.lower()

## ```Removing Punctuation```

In [39]:
def remove_punctuation(text: str) -> str:
    return re.sub(r"[^\w\s]", "", text)

##  ``` Removing Stop Words ```

In [49]:
def remove_stopwords(text: str) -> str:
    tokens = word_tokenize(text)
    return ' '.join([word for word in tokens if word not in stop_words])

## ``` Lemmatization ```

In [41]:
lemmatizer = WordNetLemmatizer()

In [50]:
def lemmatize_tokens(text: str) -> str:
    tokens = word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(token) for token in tokens])


## ``` PipeLine ```

In [68]:
def preprocessing_pipeline(text: str) -> str:
    text = clean_text(text)
    text = lowercase(text)
    text = remove_punctuation(text)
    text = remove_stopwords(text)
    text = lemmatize_tokens(text)
    return text

In [69]:
df['State Standard'] = df['State Standard'].apply(preprocessing_pipeline)

In [70]:
df['State Standard']

0      prompting support ask answer question key deta...
1      prompting support retell familiar story includ...
2      prompting support identify character setting m...
3                  ask answer question unknown word text
4           recognize common type text eg storybook poem
                             ...                        
980    use technology including internet produce publ...
981    conduct short well sustained research project ...
982    gather relevant information multiple authorita...
983    draw evidence informational text support analy...
984    write routinely extended time frame time refle...
Name: State Standard, Length: 985, dtype: object

## ``` BM25 Retreiver Function ```

In [73]:
from rank_bm25 import BM25Okapi

In [74]:
tokenized_docs = [doc.lower().split() for doc in df['State Standard']]

In [77]:
bm25 = BM25Okapi(tokenized_docs)

In [155]:
def retrieve_top_n_bm25(query: str, top_n=5):
    query_tokens = preprocessing_pipeline(query)
    tokenized_query = query_tokens.split()
    
    scores = bm25.get_scores(tokenized_query)

    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_n]


    # ID	Category	Sub Category	State Standard

    results = []
    for idx in top_indices:
        row = df.iloc[idx]
        results.append({
            "ID": row["ID"],
            "Category": row["Category"],
            "Sub Category": row["Sub Category"],
            "standard": row["State Standard"],
            "score": round(scores[idx], 4)

        })
    return results


In [123]:
query = "Identify the main idea of a text"

In [124]:
results = retrieve_top_n_bm25(query, top_n=5)

## ``` Top 5 Results from BM25 Retrieval ```

In [125]:
results

[{'ID': '1.RI.2',
  'Category': 'Reading Informational',
  'Sub Category': 'Key Ideas and Details',
  'State Standard': 'identify main topic retell key detail text',
  'score': 10.666},
 {'ID': '3.RI.2',
  'Category': 'Reading Informational',
  'Sub Category': 'Key Ideas and Details',
  'State Standard': 'determine main idea text recount key detail explain support main idea',
  'score': 10.0953},
 {'ID': 'K.RI.2',
  'Category': 'Reading Informational',
  'Sub Category': 'Key Ideas and Details',
  'State Standard': 'prompting support identify main topic retell key detail text',
  'score': 9.8043},
 {'ID': '2.RI.6',
  'Category': 'Reading Informational',
  'Sub Category': 'Craft and Structure',
  'State Standard': 'identify main purpose text including author want answer explain describe',
  'score': 9.4236},
 {'ID': '2.RI.2',
  'Category': 'Reading Informational',
  'Sub Category': 'Key Ideas and Details',
  'State Standard': 'identify main topic multiparagraph text well focus specific p

## ``` Using Splade sparse retreiver```

In [126]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import torch.nn.functional as F

model_name = "naver/splade-cocondenser-ensembledistil"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()


tokenizer_config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [127]:
def get_splade_sparse_vector(text):
    with torch.no_grad():
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        logits = model(**inputs).logits.squeeze(0)  # [seq_len, vocab_size]
        relu_out = F.relu(logits)
        splade_weights = torch.log1p(relu_out).max(dim=0).values
        indices = torch.nonzero(splade_weights).squeeze()
        return {
            tokenizer.convert_ids_to_tokens([i.item()])[0]: splade_weights[i].item()
            for i in indices
        }


In [128]:
from tqdm import tqdm

# Reset index to align doc IDs
df = df.reset_index(drop=True)

# Get list of standard texts
standard_texts = df["State Standard"].astype(str).tolist()

# Compute sparse vectors
splade_doc_vectors = [get_splade_sparse_vector(text) for text in tqdm(standard_texts)]


100%|██████████| 985/985 [00:39<00:00, 24.77it/s]


In [129]:
def dot_product_sparse(query_vec, doc_vec):
    return sum(query_vec.get(term, 0.0) * doc_vec.get(term, 0.0) for term in query_vec)


In [131]:
def retrieve_top_n_splade(query, top_n=5):
    query_vec = get_splade_sparse_vector(query)
    scores = [
        (dot_product_sparse(query_vec, doc_vec), idx)
        for idx, doc_vec in enumerate(splade_doc_vectors)
    ]
    
    top_matches = sorted(scores, reverse=True)[:top_n]
    
    results = []
    for score, idx in top_matches:
        results.append({
            "rank": len(results) + 1,
            "score": round(score, 4),
            "standard": df.iloc[idx]["State Standard"],
            "ID": df.iloc[idx]["ID"],
            "Category": df.iloc[idx]["Category"],
            "Sub Category": df.iloc[idx]["Sub Category"]
        })
    return results


In [146]:
query = "Identify the main idea of a text"
results = retrieve_top_n_splade(query)


In [147]:
results

[{'rank': 1,
  'score': 21.3089,
  'standard': 'determine main idea text explain supported key detail summarize text',
  'ID': '4.RI.2',
  'Category': 'Reading Informational',
  'Sub Category': 'Key Ideas and Details'},
 {'rank': 2,
  'score': 20.8493,
  'standard': 'determine main idea text recount key detail explain support main idea',
  'ID': '3.RI.2',
  'Category': 'Reading Informational',
  'Sub Category': 'Key Ideas and Details'},
 {'rank': 3,
  'score': 20.2714,
  'standard': 'determine two main idea text explain supported key detail summarize text',
  'ID': '5.RI.2',
  'Category': 'Reading Informational',
  'Sub Category': 'Key Ideas and Details'},
 {'rank': 4,
  'score': 17.5151,
  'standard': 'determine main idea supporting detail text read aloud information presented diverse medium format including visually quantitatively orally',
  'ID': '3.SL.2',
  'Category': 'Speaking & Listening',
  'Sub Category': 'Comprehension and Collaboration'},
 {'rank': 5,
  'score': 17.512,
  's

In [148]:
def evaluate_top1_accuracy(df, retrieve_fn):
    correct = 0
    total = len(df)

    for i in range(total):
        query = df.loc[i, "State Standard"]
        expected = query.strip().lower()

        results = retrieve_fn(query, top_n=1)
        predicted = results[0]["standard"].strip().lower()

        if predicted == expected:
            correct += 1

    accuracy = round(correct / total, 4)
    print(f"Top-1 Accuracy: {accuracy}")
    return accuracy


In [156]:
# For BM25
evaluate_top1_accuracy(df, retrieve_top_n_bm25)

Top-1 Accuracy: 0.9959


0.9959

In [153]:
# For SPLADE
evaluate_top1_accuracy(df, retrieve_top_n_splade)


Top-1 Accuracy: 0.9797


0.9797

## Comparison: BM25 vs SPLADE for CCSS Alignment

**Query:**  
> *"Identify the main idea of a text"*

---

### Top-5 Results: **BM25**

| Rank | ID      | Category             | Sub Category         | State Standard                                                                 | Score   |
|------|---------|----------------------|----------------------|----------------------------------------------------------------------------------|---------|
| 1    | 1.RI.2  | Reading Informational| Key Ideas and Details| identify main topic retell key detail text                                       | 10.666  |
| 2    | 3.RI.2  | Reading Informational| Key Ideas and Details| determine main idea text recount key detail explain support main idea           | 10.0953 |
| 3    | K.RI.2  | Reading Informational| Key Ideas and Details| prompting support identify main topic retell key detail text                    | 9.8043  |
| 4    | 2.RI.6  | Reading Informational| Craft and Structure  | identify main purpose text including author want answer explain describe       | 9.4236  |
| 5    | 2.RI.2  | Reading Informational| Key Ideas and Details| identify main topic multiparagraph text well focus specific paragraph within text | 9.3944  |

---

### Top-5 Results: **SPLADE (Sparse Embedding Model)**

| Rank | ID      | Category             | Sub Category         | State Standard                                                                 | Score   |
|------|---------|----------------------|----------------------|----------------------------------------------------------------------------------|---------|
| 1    | 4.RI.2  | Reading Informational| Key Ideas and Details| determine main idea text explain supported key detail summarize text           | 21.3089 |
| 2    | 3.RI.2  | Reading Informational| Key Ideas and Details| determine main idea text recount key detail explain support main idea           | 20.8493 |
| 3    | 5.RI.2  | Reading Informational| Key Ideas and Details| determine two main idea text explain supported key detail summarize text       | 20.2714 |
| 4    | 3.SL.2  | Speaking & Listening | Comprehension and Collaboration | determine main idea supporting detail text read aloud information presented diverse medium format including visually quantitatively orally | 17.5151 |
| 5    | 2.RI.6  | Reading Informational| Craft and Structure  | identify main purpose text including author want answer explain describe       | 17.512  |

---

### Insights:

- Both **BM25 and SPLADE** correctly rank **"3.RI.2"** and **"2.RI.6"** in the top-5.
- **SPLADE ranks more abstract or paraphrased variants** (e.g., "summarize", "supported key detail") higher due to its semantic understanding.
- SPLADE retrieves **higher-level matches** like **"5.RI.2"** and **"4.RI.2"**, which are **semantically related** but not lexically identical.
- BM25 relies on **exact term overlap**, favoring simpler phrasings like "identify main topic".

---

### Conclusion:

| Feature                  | BM25                       | SPLADE                           |
|--------------------------|----------------------------|----------------------------------|
| Matching Type            | Exact lexical match        | Semantic sparse match            |
| Interpretability         | High (term overlap)     | High (per-term weights)       |
| Handles Paraphrasing     | No                      | Yes                           |
| Use Case Fit             | Good for short, exact queries | Great for natural language input |

---

### Top-1 Accuracy

| Model   | Top-1 Accuracy |
|---------|----------------|
| BM25    | **0.9959**     |
| SPLADE  | **0.9797**     |

---

### Insights

- **BM25** achieves near-perfect accuracy due to exact term matching, especially since queries are identical to indexed documents.
- **SPLADE** performs slightly lower because it may **re-rank paraphrases or semantic neighbors**, even when the original text is present.

---
