DaCrow13
Deploy to HF Spaces (Clean)
225af6a

Behavioral Testing Results Summary

Date: November 15, 2025
Model: Random Forest with TF-IDF Features (SMOTE oversampling)
Test Framework: pytest-based behavioral testing (Ribeiro et al., 2020)
Total Tests: 36 behavioral tests (training tests excluded due to missing PyTorch)


Executive Summary

This document summarizes the results of comprehensive behavioral testing conducted on the skill classification model. The tests verify that the model behaves correctly beyond simple accuracy metrics, following the methodology described in "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" (Ribeiro et al., 2020).

Overall Results

Test Category Tests Run Passed Failed Pass Rate
Invariance Tests 9 9 0 100% [PASS]
Directional Tests 10 10 0 100% [PASS]
Minimum Functionality Tests 17 17 0 100% [PASS]
Model Training Tests 10 N/A N/A Skipped*
TOTAL 36 36 0 100% [PASS]

* Training tests skipped - PyTorch not installed (ModuleNotFoundError: No module named 'torch')


Detailed Test Results

1. Invariance Tests (9/9 passed [PASS])

Purpose: Verify that the model is robust to input perturbations that should not change predictions.

Test Results:

Test Name Status Description Key Finding
test_typo_robustness [PASS] PASS Model handles typos gracefully Similarity ≥ 70% maintained
test_synonym_substitution [PASS] PASS Synonyms don't change predictions Similarity ≥ 60% across 3 test cases
test_case_insensitivity [PASS] PASS Capitalization doesn't affect output 100% identical predictions
test_punctuation_robustness [PASS] PASS Punctuation variations handled well Similarity ≥ 80%
test_neutral_word_addition [PASS] PASS Filler words don't change predictions Similarity ≥ 70%
test_word_order_robustness [PASS] PASS Reasonable word reordering preserved Similarity ≥ 50%
test_whitespace_normalization [PASS] PASS Extra whitespace doesn't affect output 100% identical predictions
test_url_removal_invariance [PASS] PASS URLs don't change skill predictions 100% identical predictions
test_code_snippet_noise_robustness [PASS] PASS Code snippets don't affect core skills Similarity ≥ 60%

Key Insights:

  • [PASS] High robustness to noise: The model maintains stable predictions despite typos, formatting changes, and irrelevant content
  • [PASS] Text normalization works well: Case, whitespace, and URL removal are handled correctly by preprocessing
  • [PASS] Semantic preservation: Synonym substitution doesn't break the model, showing it captures semantic meaning
  • [WARNING] Word order sensitivity: Lower threshold (50%) suggests TF-IDF captures some word position information

2. Directional Tests (10/10 passed [PASS])

Purpose: Verify that specific input changes lead to predictable changes in predictions.

Test Results:

Test Name Status Description Key Finding
test_adding_language_keyword [PASS] PASS Adding "Java"/"Python" keywords Predictions stable (no degradation)
test_adding_data_structure_keyword [PASS] PASS Adding "HashMap"/"tree" keywords Predictions maintained
test_adding_error_handling_context [PASS] PASS Adding error handling keywords Error-related labels present
test_removing_specific_technology [PASS] PASS Removing tech-specific terms Prediction count stable
test_adding_api_context [PASS] PASS Adding REST/GraphQL keywords No prediction degradation
test_adding_testing_keywords [PASS] PASS Adding test-related keywords Testing labels present
test_adding_performance_keywords [PASS] PASS Adding performance context Predictions maintained
test_adding_security_context [PASS] PASS Adding security keywords No degradation (≥50% maintained)
test_adding_devops_keywords [PASS] PASS Adding Docker/CI/CD keywords DevOps labels present
test_increasing_technical_detail [PASS] PASS More specific descriptions Detail doesn't reduce predictions

Key Insights:

  • [PASS] Stable predictions: Model maintains consistent label sets when adding context-specific keywords
  • [PASS] Comprehensive label coverage: Model already predicts general labels (Language, Error Handling, DevOps, etc.)
  • [WARNING] Limited context sensitivity: Model doesn't drastically change predictions based on specific technologies (e.g., "Java" vs "Python")
    • This could indicate: (1) TF-IDF with 1000 features may be too coarse, or (2) training data contains these terms broadly
  • [INFO] Interpretation: The model predicts ~26 consistent labels across most technical descriptions, suggesting it captures general software engineering skills well

3. Minimum Functionality Tests (17/17 passed [PASS])

Purpose: Verify that the model works correctly on basic, straightforward examples.

Test Results:

Test Name Status Description Labels Predicted
test_simple_bug_fix [PASS] PASS Basic bug fix description Multiple relevant skills
test_database_work [PASS] PASS SQL/database operations Database-related skills
test_api_development [PASS] PASS REST API endpoint creation API/web service skills
test_data_structure_implementation [PASS] PASS Binary tree implementation Data structure skills
test_testing_work [PASS] PASS Unit testing with JUnit Testing skills
test_frontend_work [PASS] PASS React UI components Frontend skills
test_security_work [PASS] PASS OAuth2 authentication Security skills
test_performance_optimization [PASS] PASS Algorithm optimization Performance skills
test_devops_deployment [PASS] PASS Docker/CI/CD setup DevOps skills
test_error_handling [PASS] PASS Exception handling Error handling skills
test_refactoring_work [PASS] PASS Code refactoring Code quality skills
test_documentation_work [PASS] PASS API documentation Documentation skills
test_empty_input [PASS] PASS Empty string handling Graceful handling (no crash)
test_minimal_input [PASS] PASS Single word ("bug") Returns valid predictions
test_multiple_skills_in_one_task [PASS] PASS Complex multi-tech task ≥2 skills predicted
test_common_github_issue_format [PASS] PASS Real GitHub issue format Handles markdown formatting
test_consistency_on_similar_inputs [PASS] PASS Identical/similar inputs High consistency (≥70%)

Key Insights:

  • [PASS] Comprehensive skill coverage: Model successfully identifies skills across all major software engineering domains
  • [PASS] Robust to edge cases: Handles empty input, minimal input, and complex formatting without errors
  • [PASS] Multi-label capability: Correctly predicts multiple skills for complex tasks
  • [PASS] Real-world applicability: Works on realistic GitHub issue formats with markdown
  • [PASS] Consistency: Very high consistency on similar/identical inputs (70%+ similarity)

Sample Predictions:

Example 1: "Fixed null pointer exception in user authentication"

  • Predicted Skills: Error Handling, Language, Data Structure, Software Development and IT Operations, etc.

Example 2: "Implemented user authentication API with JWT tokens, PostgreSQL database integration, and Redis caching"

  • Predicted Skills: 26+ skills including Language, Databases, Data Structure, Error Handling, Multi-Thread, etc.
  • [PASS] Successfully identified multiple relevant skills for complex task

4. Model Training Tests (10 tests - TO BE EXECUTED)

Purpose: Verify that the model training process works correctly.

Status: [PENDING] Pending execution on GPU-enabled machine

These tests require PyTorch and are designed to run on a machine with NVIDIA GPU support. They will verify:

Planned Tests:

Test Name Purpose Expected Outcome
test_training_completes_without_errors Training pipeline works [PASS] No exceptions raised
test_decreasing_loss_after_training Model learns during training F1 score > 0.1 (better than random)
test_overfitting_on_single_batch Model has sufficient capacity Training accuracy > 70% on small batch
test_training_on_cpu CPU training works [PASS] Training completes successfully
test_training_on_multiple_cores Parallel processing works [PASS] Multi-core training succeeds
test_training_on_gpu GPU training works (if available) [PASS] GPU detected and used
test_reproducibility_with_random_seed Results are reproducible Identical predictions with same seed
test_model_improves_with_more_data More data helps performance F1(large) ≥ F1(small) * 0.9
test_model_saves_and_loads_correctly Model persistence works Loaded model = original predictions

Note: Results will be updated after execution on GPU-enabled environment.


Model Performance Characteristics

Strengths [PASS]

  1. High Robustness to Noise

    • Excellent handling of typos, formatting variations, and irrelevant content
    • Predictions remain stable across various input perturbations
    • Strong text normalization and preprocessing
  2. Comprehensive Skill Coverage

    • Successfully identifies skills across all major software engineering domains
    • Predicts multiple relevant labels for complex tasks
    • Works on real-world GitHub issue formats
  3. Consistency and Reliability

    • Very consistent predictions on similar inputs
    • No crashes or errors on edge cases (empty input, minimal input)
    • Graceful degradation with low-quality input
  4. Stable Behavior

    • Doesn't oscillate wildly with small input changes
    • Maintains prediction quality when adding context

Areas for Improvement [WARNING]

  1. Limited Context Sensitivity

    • Adding specific technology keywords (Java, Python, Docker) doesn't significantly change predictions
    • Model tends to predict similar broad label sets regardless of specifics
    • Possible causes:
      • TF-IDF with 1000 features may lose fine-grained distinctions
      • Training data may have these terms distributed broadly across labels
      • Model may be learning general patterns rather than specific technologies
  2. Potential Over-Generalization

    • Consistently predicts ~26 labels for most technical descriptions
    • May be predicting "safe" general skills rather than task-specific ones
    • Recommendation: Analyze label co-occurrence and consider more specific feature engineering
  3. Feature Engineering Opportunities

    • Current TF-IDF (max_features=1000) may be limiting
    • Could benefit from:
      • Increasing max_features
      • Adding domain-specific features (technology names, API patterns)
      • Entity recognition for technologies/frameworks
      • Embeddings (Word2Vec, BERT) for semantic understanding

Recommendations

Immediate Actions [PASS]

  1. Accept current behavioral performance

    • Model passes all 36 behavioral tests
    • Robustness and consistency are excellent
    • Ready for deployment in current form
  2. Execute training tests on GPU machine

    • Complete validation of training pipeline
    • Verify GPU utilization and reproducibility

Future Improvements [TODO]

  1. Feature Engineering

    • Experiment with larger max_features (e.g., 5000, 10000)
    • Add technology-specific features
    • Try embedding-based features (BERT, CodeBERT)
  2. Model Tuning

    • Analyze label co-occurrence patterns
    • Consider label-specific classifiers for fine-grained predictions
    • Experiment with different oversampling strategies
  3. Test Expansion

    • Add more directional tests with specific expected label changes
    • Create domain-specific test cases (e.g., frontend vs backend vs DevOps)
    • Add performance benchmarking tests
  4. Monitoring

    • Track prediction consistency over time
    • Monitor label distribution in production
    • Collect user feedback on prediction quality

Test Execution Details

Environment

  • OS: Windows 11
  • Python: 3.10.11
  • Test Framework: pytest 9.0.1
  • Key Libraries:
    • scikit-learn (RandomForest, TF-IDF)
    • numpy
    • joblib
    • imblearn (SMOTE)

Command Used

pytest tests/behavioral/ -v

Execution Time

  • Behavioral Tests (36): 470.52 seconds (~7 minutes 50 seconds)
  • Training Tests (10): Skipped (PyTorch not installed)

Test Data

  • Features: TF-IDF vectors (1000 features)
  • Labels: 142 active skill labels (multi-label binary matrix)
  • Training Set: 7,154 GitHub issues

Conclusion

The skill classification model demonstrates excellent behavioral characteristics across all tested dimensions:

[PASS] Robustness: Handles noise, typos, and formatting variations exceptionally well
[PASS] Consistency: Produces stable, reproducible predictions
[PASS] Coverage: Successfully identifies skills across all major software engineering domains
[PASS] Reliability: No errors or crashes on edge cases

The model is production-ready from a behavioral testing perspective. While there are opportunities for improvement in context sensitivity and fine-grained predictions, the current model provides reliable, comprehensive skill classification for GitHub issues.

Next Steps:

  1. Install PyTorch to enable training tests execution
  2. Consider feature engineering improvements for better context sensitivity
  3. Deploy with monitoring to track real-world performance

Appendix: Test Execution Logs

Sample Test Output

========================= test session starts ==========================
platform win32 -- Python 3.10.11, pytest-9.0.1, pluggy-1.6.0
rootdir: C:\Users\Utente\OneDrive - Università degli Studi di Bari\Universita\Magistrale\II Anno\I Semestre\Software Engineering\Hopcroft
configfile: pyproject.toml
collected 36 items

tests/behavioral/test_directional.py::TestDirectional::test_adding_language_keyword PASSED [  2%]
tests/behavioral/test_directional.py::TestDirectional::test_adding_data_structure_keyword PASSED [  5%]
tests/behavioral/test_directional.py::TestDirectional::test_adding_error_handling_context PASSED [  8%]
...
tests/behavioral/test_minimum_functionality.py::TestMinimumFunctionality::test_consistency_on_similar_inputs PASSED [100%]

===================== 36 passed in 470.52s (0:07:50) ======================

GPU Training Tests Placeholder

[TO BE UPDATED AFTER EXECUTION ON GPU MACHINE]

Expected format:
========================= test session starts ==========================
platform linux -- Python 3.10.x, pytest-9.0.1
collected 10 items

tests/behavioral/test_model_training.py::TestModelTraining::test_training_completes_without_errors PASSED
tests/behavioral/test_model_training.py::TestModelTraining::test_decreasing_loss_after_training PASSED
...
tests/behavioral/test_model_training.py::TestModelTraining::test_training_on_gpu PASSED

==================== 10 passed in XXXs ====================

Document Version: 1.1
Last Updated: November 15, 2025
Author: Hopcroft Team
Repository: se4ai2526-uniba/Hopcroft
Branch: behavioral-testing