---
language: en
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification

tags:
  - text-classification
  - bert
  - student-checkins
  - roadblock-detection
  - nlp
  - active-learning
  - education
  - classification

datasets:
  - custom

metrics:
  - accuracy
  - f1
  - precision
  - recall


---
# 🚧 Roadblock Classification Model (v2)

## 📌 Overview

The **Roadblock Classification Model (v2)** is a fine-tuned transformer-based model built on BERT to classify student check-ins into two categories:

- **ROADBLOCK** → The student cannot move forward  
- **NOT_ROADBLOCK** → The student is still making progress  

This model is designed to understand **semantic meaning**, not just keywords, enabling it to differentiate between **difficulty** and **true blockage**.

---

## 🧠 Motivation

### ❌ Problem with Version 1

The first version of this model attempted to classify:

- struggles  
- confusion  
- being stuck  

**all under one label**

This created a major issue:

> The model could not distinguish between **temporary difficulty** and **actual inability to proceed**

---

### 🔥 Why Version 2 Was Created

Version 2 was developed to **separate definitions clearly**:

| Concept | Meaning |
|--------|--------|
| **Struggle** | The student is experiencing difficulty |
| **Roadblock** | The student cannot move forward |

---

### 💥 Key Insight

> Not all struggles are roadblocks.

Example:

| Check-in | Correct Label |
|--------|--------------|
| "I had problems but made progress" | NOT_ROADBLOCK |
| "I can't fix my code and I'm stuck" | ROADBLOCK |

---

## ⚙️ Model Architecture

- Base Model: `bert-base-uncased`
- Task: Binary Classification
- Framework: Hugging Face Transformers
- Training Environment: Google Colab (GPU)

---

## 📊 Dataset Design

The dataset was **synthetically generated and refined iteratively** to ensure:

### ✅ Semantic Accuracy
- Focus on meaning, not keywords

### ✅ Balanced Classes
- ROADBLOCK vs NOT_ROADBLOCK distribution controlled

### ✅ Language Diversity
- Includes:
  - formal phrasing
  - informal/slang expressions
  - varied sentence structures

---

## 🚨 Bias Identification and Correction

### 🔍 Initial Problem

Early versions of the dataset showed **strong keyword bias**, such as:

- `"problem"` → always NOT_ROADBLOCK  
- `"can't"` → always ROADBLOCK  
- `"stuck"` → always ROADBLOCK  

---

### ⚠️ Why This Was Dangerous

The model learned:

> ❌ keyword → label  
instead of  
> ✅ meaning → label  

This caused incorrect predictions in real-world scenarios.

---

### 🔧 Bias Mitigation Strategy

To eliminate bias, the dataset was redesigned to include:

#### 1. Keyword Symmetry

Each keyword appears in **both labels**:

| Keyword | ROADBLOCK | NOT_ROADBLOCK |
|--------|----------|---------------|
| "problem" | ✔️ | ✔️ |
| "can't" | ✔️ | ✔️ |
| "stuck" | ✔️ | ✔️ |

---

#### 2. Contrastive Examples

Pairs of sentences with similar wording but different meanings:

- "I can't fix it and I'm stuck" → ROADBLOCK  
- "I can't fix it yet but I'm making progress" → NOT_ROADBLOCK  

---

#### 3. Pattern Diversity

Avoided over-reliance on patterns like:

- `"but"` → NOT_ROADBLOCK  

Instead included:
- "and I fixed it"
- "and it's working now"
- "and I solved it"

---

### ✅ Result

The model now learns:

> **progress vs no progress**  
instead of relying on surface-level patterns.

---

## 🧪 Model Evaluation

The model was tested on:

### 1. Clean Synthetic Data
- Achieved near-perfect validation scores (expected due to dataset similarity)

### 2. Edge Cases
- Handled ambiguous phrasing correctly

### 3. Realistic Language
Test examples:

| Input | Prediction |
|------|-----------|
| "lowkey stuck but I think I got it" | NOT_ROADBLOCK |
| "this bug annoying but I fixed it" | NOT_ROADBLOCK |
| "ngl I can't get this working" | ROADBLOCK |
| "still stuck idk what to do" | ROADBLOCK |

---

### ⚠️ Observed Limitation

Minor generalization gap:

- "I was confused but it's working now" → incorrectly predicted ROADBLOCK

---

### 🔧 Fix Approach

Instead of regenerating the dataset:

> Add targeted examples to cover missing language patterns

---

## 🔁 Active Learning Strategy

This model is designed to serve as a **base model for active learning**.

---

### 🔥 Active Learning Workflow

1. Model predicts on real check-ins  
2. Identify incorrect predictions  
3. Collect high-value error samples  
4. Add corrected examples to dataset  
5. Retrain model  

---

### 💥 Key Principle

> High-confidence errors are more valuable than random samples

---

### 🎯 Goal

Continuously improve the model using **real-world feedback**, not just synthetic data.

---

## 🚀 Future Improvements

- Integrate real Slack check-in data  
- Expand dataset with informal and noisy text  
- Add confidence-based filtering for active learning  
- Combine with a **Struggle Detection Model** for multi-signal analysis  

---

## 🧠 Final Insight

This model represents a shift from:

> ❌ pattern-based classification  
to  
> ✅ meaning-based understanding  

---

## 💯 Conclusion

The Roadblock Classification Model (v2):

- Correctly distinguishes **difficulty vs blockage**
- Handles diverse language patterns
- Minimizes keyword bias
- Serves as a strong foundation for **active learning systems**

---

> 🔥 This is not just a model — it is a continuously improving system.