--- language: en license: apache-2.0 library_name: transformers pipeline_tag: text-classification tags: - text-classification - bert - student-checkins - roadblock-detection - nlp - active-learning - education - classification datasets: - custom metrics: - accuracy - f1 - precision - recall --- # ๐Ÿšง Roadblock Classification Model (v2) ## ๐Ÿ“Œ Overview The **Roadblock Classification Model (v2)** is a fine-tuned transformer-based model built on BERT to classify student check-ins into two categories: - **ROADBLOCK** โ†’ The student cannot move forward - **NOT_ROADBLOCK** โ†’ The student is still making progress This model is designed to understand **semantic meaning**, not just keywords, enabling it to differentiate between **difficulty** and **true blockage**. --- ## ๐Ÿง  Motivation ### โŒ Problem with Version 1 The first version of this model attempted to classify: - struggles - confusion - being stuck **all under one label** This created a major issue: > The model could not distinguish between **temporary difficulty** and **actual inability to proceed** --- ### ๐Ÿ”ฅ Why Version 2 Was Created Version 2 was developed to **separate definitions clearly**: | Concept | Meaning | |--------|--------| | **Struggle** | The student is experiencing difficulty | | **Roadblock** | The student cannot move forward | --- ### ๐Ÿ’ฅ Key Insight > Not all struggles are roadblocks. Example: | Check-in | Correct Label | |--------|--------------| | "I had problems but made progress" | NOT_ROADBLOCK | | "I can't fix my code and I'm stuck" | ROADBLOCK | --- ## โš™๏ธ Model Architecture - Base Model: `bert-base-uncased` - Task: Binary Classification - Framework: Hugging Face Transformers - Training Environment: Google Colab (GPU) --- ## ๐Ÿ“Š Dataset Design The dataset was **synthetically generated and refined iteratively** to ensure: ### โœ… Semantic Accuracy - Focus on meaning, not keywords ### โœ… Balanced Classes - ROADBLOCK vs NOT_ROADBLOCK distribution controlled ### โœ… Language Diversity - Includes: - formal phrasing - informal/slang expressions - varied sentence structures --- ## ๐Ÿšจ Bias Identification and Correction ### ๐Ÿ” Initial Problem Early versions of the dataset showed **strong keyword bias**, such as: - `"problem"` โ†’ always NOT_ROADBLOCK - `"can't"` โ†’ always ROADBLOCK - `"stuck"` โ†’ always ROADBLOCK --- ### โš ๏ธ Why This Was Dangerous The model learned: > โŒ keyword โ†’ label instead of > โœ… meaning โ†’ label This caused incorrect predictions in real-world scenarios. --- ### ๐Ÿ”ง Bias Mitigation Strategy To eliminate bias, the dataset was redesigned to include: #### 1. Keyword Symmetry Each keyword appears in **both labels**: | Keyword | ROADBLOCK | NOT_ROADBLOCK | |--------|----------|---------------| | "problem" | โœ”๏ธ | โœ”๏ธ | | "can't" | โœ”๏ธ | โœ”๏ธ | | "stuck" | โœ”๏ธ | โœ”๏ธ | --- #### 2. Contrastive Examples Pairs of sentences with similar wording but different meanings: - "I can't fix it and I'm stuck" โ†’ ROADBLOCK - "I can't fix it yet but I'm making progress" โ†’ NOT_ROADBLOCK --- #### 3. Pattern Diversity Avoided over-reliance on patterns like: - `"but"` โ†’ NOT_ROADBLOCK Instead included: - "and I fixed it" - "and it's working now" - "and I solved it" --- ### โœ… Result The model now learns: > **progress vs no progress** instead of relying on surface-level patterns. --- ## ๐Ÿงช Model Evaluation The model was tested on: ### 1. Clean Synthetic Data - Achieved near-perfect validation scores (expected due to dataset similarity) ### 2. Edge Cases - Handled ambiguous phrasing correctly ### 3. Realistic Language Test examples: | Input | Prediction | |------|-----------| | "lowkey stuck but I think I got it" | NOT_ROADBLOCK | | "this bug annoying but I fixed it" | NOT_ROADBLOCK | | "ngl I can't get this working" | ROADBLOCK | | "still stuck idk what to do" | ROADBLOCK | --- ### โš ๏ธ Observed Limitation Minor generalization gap: - "I was confused but it's working now" โ†’ incorrectly predicted ROADBLOCK --- ### ๐Ÿ”ง Fix Approach Instead of regenerating the dataset: > Add targeted examples to cover missing language patterns --- ## ๐Ÿ” Active Learning Strategy This model is designed to serve as a **base model for active learning**. --- ### ๐Ÿ”ฅ Active Learning Workflow 1. Model predicts on real check-ins 2. Identify incorrect predictions 3. Collect high-value error samples 4. Add corrected examples to dataset 5. Retrain model --- ### ๐Ÿ’ฅ Key Principle > High-confidence errors are more valuable than random samples --- ### ๐ŸŽฏ Goal Continuously improve the model using **real-world feedback**, not just synthetic data. --- ## ๐Ÿš€ Future Improvements - Integrate real Slack check-in data - Expand dataset with informal and noisy text - Add confidence-based filtering for active learning - Combine with a **Struggle Detection Model** for multi-signal analysis --- ## ๐Ÿง  Final Insight This model represents a shift from: > โŒ pattern-based classification to > โœ… meaning-based understanding --- ## ๐Ÿ’ฏ Conclusion The Roadblock Classification Model (v2): - Correctly distinguishes **difficulty vs blockage** - Handles diverse language patterns - Minimizes keyword bias - Serves as a strong foundation for **active learning systems** --- > ๐Ÿ”ฅ This is not just a model โ€” it is a continuously improving system.