Spaces:

DaCrow13
/

Hopcroft-Skill-Classification

Running

App Files Files Community

Hopcroft-Skill-Classification / tests /behavioral /TEST_RESULTS_SUMMARY.md

DaCrow13

Deploy to HF Spaces (Clean)

225af6a 2 months ago

preview code

raw

history blame contribute delete

15.7 kB

Behavioral Testing Results Summary

Date: November 15, 2025
Model: Random Forest with TF-IDF Features (SMOTE oversampling)
Test Framework: pytest-based behavioral testing (Ribeiro et al., 2020)
Total Tests: 36 behavioral tests (training tests excluded due to missing PyTorch)

Executive Summary

This document summarizes the results of comprehensive behavioral testing conducted on the skill classification model. The tests verify that the model behaves correctly beyond simple accuracy metrics, following the methodology described in "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" (Ribeiro et al., 2020).

Overall Results

Test Category	Tests Run	Passed	Failed	Pass Rate
Invariance Tests	9	9	0	100% [PASS]
Directional Tests	10	10	0	100% [PASS]
Minimum Functionality Tests	17	17	0	100% [PASS]
Model Training Tests	10	N/A	N/A	Skipped*
TOTAL	36	36	0	100% [PASS]

* Training tests skipped - PyTorch not installed (ModuleNotFoundError: No module named 'torch')

Detailed Test Results

1. Invariance Tests (9/9 passed [PASS])

Purpose: Verify that the model is robust to input perturbations that should not change predictions.

Test Results:

Test Name	Status	Description	Key Finding
`test_typo_robustness`	[PASS] PASS	Model handles typos gracefully	Similarity ≥ 70% maintained
`test_synonym_substitution`	[PASS] PASS	Synonyms don't change predictions	Similarity ≥ 60% across 3 test cases
`test_case_insensitivity`	[PASS] PASS	Capitalization doesn't affect output	100% identical predictions
`test_punctuation_robustness`	[PASS] PASS	Punctuation variations handled well	Similarity ≥ 80%
`test_neutral_word_addition`	[PASS] PASS	Filler words don't change predictions	Similarity ≥ 70%
`test_word_order_robustness`	[PASS] PASS	Reasonable word reordering preserved	Similarity ≥ 50%
`test_whitespace_normalization`	[PASS] PASS	Extra whitespace doesn't affect output	100% identical predictions
`test_url_removal_invariance`	[PASS] PASS	URLs don't change skill predictions	100% identical predictions
`test_code_snippet_noise_robustness`	[PASS] PASS	Code snippets don't affect core skills	Similarity ≥ 60%

Key Insights:

[PASS] High robustness to noise: The model maintains stable predictions despite typos, formatting changes, and irrelevant content
[PASS] Text normalization works well: Case, whitespace, and URL removal are handled correctly by preprocessing
[PASS] Semantic preservation: Synonym substitution doesn't break the model, showing it captures semantic meaning
[WARNING] Word order sensitivity: Lower threshold (50%) suggests TF-IDF captures some word position information

2. Directional Tests (10/10 passed [PASS])

Purpose: Verify that specific input changes lead to predictable changes in predictions.

Test Results:

Test Name	Status	Description	Key Finding
`test_adding_language_keyword`	[PASS] PASS	Adding "Java"/"Python" keywords	Predictions stable (no degradation)
`test_adding_data_structure_keyword`	[PASS] PASS	Adding "HashMap"/"tree" keywords	Predictions maintained
`test_adding_error_handling_context`	[PASS] PASS	Adding error handling keywords	Error-related labels present
`test_removing_specific_technology`	[PASS] PASS	Removing tech-specific terms	Prediction count stable
`test_adding_api_context`	[PASS] PASS	Adding REST/GraphQL keywords	No prediction degradation
`test_adding_testing_keywords`	[PASS] PASS	Adding test-related keywords	Testing labels present
`test_adding_performance_keywords`	[PASS] PASS	Adding performance context	Predictions maintained
`test_adding_security_context`	[PASS] PASS	Adding security keywords	No degradation (≥50% maintained)
`test_adding_devops_keywords`	[PASS] PASS	Adding Docker/CI/CD keywords	DevOps labels present
`test_increasing_technical_detail`	[PASS] PASS	More specific descriptions	Detail doesn't reduce predictions

Key Insights:

[PASS] Stable predictions: Model maintains consistent label sets when adding context-specific keywords
[PASS] Comprehensive label coverage: Model already predicts general labels (Language, Error Handling, DevOps, etc.)
[WARNING] Limited context sensitivity: Model doesn't drastically change predictions based on specific technologies (e.g., "Java" vs "Python")
- This could indicate: (1) TF-IDF with 1000 features may be too coarse, or (2) training data contains these terms broadly
[INFO] Interpretation: The model predicts ~26 consistent labels across most technical descriptions, suggesting it captures general software engineering skills well

3. Minimum Functionality Tests (17/17 passed [PASS])

Purpose: Verify that the model works correctly on basic, straightforward examples.

Test Results:

Test Name	Status	Description	Labels Predicted
`test_simple_bug_fix`	[PASS] PASS	Basic bug fix description	Multiple relevant skills
`test_database_work`	[PASS] PASS	SQL/database operations	Database-related skills
`test_api_development`	[PASS] PASS	REST API endpoint creation	API/web service skills
`test_data_structure_implementation`	[PASS] PASS	Binary tree implementation	Data structure skills
`test_testing_work`	[PASS] PASS	Unit testing with JUnit	Testing skills
`test_frontend_work`	[PASS] PASS	React UI components	Frontend skills
`test_security_work`	[PASS] PASS	OAuth2 authentication	Security skills
`test_performance_optimization`	[PASS] PASS	Algorithm optimization	Performance skills
`test_devops_deployment`	[PASS] PASS	Docker/CI/CD setup	DevOps skills
`test_error_handling`	[PASS] PASS	Exception handling	Error handling skills
`test_refactoring_work`	[PASS] PASS	Code refactoring	Code quality skills
`test_documentation_work`	[PASS] PASS	API documentation	Documentation skills
`test_empty_input`	[PASS] PASS	Empty string handling	Graceful handling (no crash)
`test_minimal_input`	[PASS] PASS	Single word ("bug")	Returns valid predictions
`test_multiple_skills_in_one_task`	[PASS] PASS	Complex multi-tech task	≥2 skills predicted
`test_common_github_issue_format`	[PASS] PASS	Real GitHub issue format	Handles markdown formatting
`test_consistency_on_similar_inputs`	[PASS] PASS	Identical/similar inputs	High consistency (≥70%)

Key Insights:

[PASS] Comprehensive skill coverage: Model successfully identifies skills across all major software engineering domains
[PASS] Robust to edge cases: Handles empty input, minimal input, and complex formatting without errors
[PASS] Multi-label capability: Correctly predicts multiple skills for complex tasks
[PASS] Real-world applicability: Works on realistic GitHub issue formats with markdown
[PASS] Consistency: Very high consistency on similar/identical inputs (70%+ similarity)

Sample Predictions:

Example 1: "Fixed null pointer exception in user authentication"

Predicted Skills: Error Handling, Language, Data Structure, Software Development and IT Operations, etc.

Example 2: "Implemented user authentication API with JWT tokens, PostgreSQL database integration, and Redis caching"

Predicted Skills: 26+ skills including Language, Databases, Data Structure, Error Handling, Multi-Thread, etc.
[PASS] Successfully identified multiple relevant skills for complex task

4. Model Training Tests (10 tests - TO BE EXECUTED)

Purpose: Verify that the model training process works correctly.

Status: [PENDING] Pending execution on GPU-enabled machine

These tests require PyTorch and are designed to run on a machine with NVIDIA GPU support. They will verify:

Planned Tests:

Test Name	Purpose	Expected Outcome
`test_training_completes_without_errors`	Training pipeline works	[PASS] No exceptions raised
`test_decreasing_loss_after_training`	Model learns during training	F1 score > 0.1 (better than random)
`test_overfitting_on_single_batch`	Model has sufficient capacity	Training accuracy > 70% on small batch
`test_training_on_cpu`	CPU training works	[PASS] Training completes successfully
`test_training_on_multiple_cores`	Parallel processing works	[PASS] Multi-core training succeeds
`test_training_on_gpu`	GPU training works (if available)	[PASS] GPU detected and used
`test_reproducibility_with_random_seed`	Results are reproducible	Identical predictions with same seed
`test_model_improves_with_more_data`	More data helps performance	F1(large) ≥ F1(small) * 0.9
`test_model_saves_and_loads_correctly`	Model persistence works	Loaded model = original predictions

Note: Results will be updated after execution on GPU-enabled environment.

Model Performance Characteristics

Strengths [PASS]

High Robustness to Noise
- Excellent handling of typos, formatting variations, and irrelevant content
- Predictions remain stable across various input perturbations
- Strong text normalization and preprocessing
Comprehensive Skill Coverage
- Successfully identifies skills across all major software engineering domains
- Predicts multiple relevant labels for complex tasks
- Works on real-world GitHub issue formats
Consistency and Reliability
- Very consistent predictions on similar inputs
- No crashes or errors on edge cases (empty input, minimal input)
- Graceful degradation with low-quality input
Stable Behavior
- Doesn't oscillate wildly with small input changes
- Maintains prediction quality when adding context

Areas for Improvement [WARNING]

Limited Context Sensitivity
- Adding specific technology keywords (Java, Python, Docker) doesn't significantly change predictions
- Model tends to predict similar broad label sets regardless of specifics
- Possible causes:
  - TF-IDF with 1000 features may lose fine-grained distinctions
  - Training data may have these terms distributed broadly across labels
  - Model may be learning general patterns rather than specific technologies
Potential Over-Generalization
- Consistently predicts ~26 labels for most technical descriptions
- May be predicting "safe" general skills rather than task-specific ones
- Recommendation: Analyze label co-occurrence and consider more specific feature engineering
Feature Engineering Opportunities
- Current TF-IDF (max_features=1000) may be limiting
- Could benefit from:
  - Increasing max_features
  - Adding domain-specific features (technology names, API patterns)
  - Entity recognition for technologies/frameworks
  - Embeddings (Word2Vec, BERT) for semantic understanding

Recommendations

Immediate Actions [PASS]

Accept current behavioral performance
- Model passes all 36 behavioral tests
- Robustness and consistency are excellent
- Ready for deployment in current form
Execute training tests on GPU machine
- Complete validation of training pipeline
- Verify GPU utilization and reproducibility

Future Improvements [TODO]

Feature Engineering
- Experiment with larger max_features (e.g., 5000, 10000)
- Add technology-specific features
- Try embedding-based features (BERT, CodeBERT)
Model Tuning
- Analyze label co-occurrence patterns
- Consider label-specific classifiers for fine-grained predictions
- Experiment with different oversampling strategies
Test Expansion
- Add more directional tests with specific expected label changes
- Create domain-specific test cases (e.g., frontend vs backend vs DevOps)
- Add performance benchmarking tests
Monitoring
- Track prediction consistency over time
- Monitor label distribution in production
- Collect user feedback on prediction quality

Test Execution Details

Environment

OS: Windows 11
Python: 3.10.11
Test Framework: pytest 9.0.1
Key Libraries:
- scikit-learn (RandomForest, TF-IDF)
- numpy
- joblib
- imblearn (SMOTE)

Command Used

pytest tests/behavioral/ -v

Execution Time

Behavioral Tests (36): 470.52 seconds (~7 minutes 50 seconds)
Training Tests (10): Skipped (PyTorch not installed)

Test Data

Features: TF-IDF vectors (1000 features)
Labels: 142 active skill labels (multi-label binary matrix)
Training Set: 7,154 GitHub issues

Conclusion

The skill classification model demonstrates excellent behavioral characteristics across all tested dimensions:

[PASS] Robustness: Handles noise, typos, and formatting variations exceptionally well
[PASS] Consistency: Produces stable, reproducible predictions
[PASS] Coverage: Successfully identifies skills across all major software engineering domains
[PASS] Reliability: No errors or crashes on edge cases

The model is production-ready from a behavioral testing perspective. While there are opportunities for improvement in context sensitivity and fine-grained predictions, the current model provides reliable, comprehensive skill classification for GitHub issues.

Next Steps:

Install PyTorch to enable training tests execution
Consider feature engineering improvements for better context sensitivity
Deploy with monitoring to track real-world performance

Appendix: Test Execution Logs

Sample Test Output

========================= test session starts ==========================
platform win32 -- Python 3.10.11, pytest-9.0.1, pluggy-1.6.0
rootdir: C:\Users\Utente\OneDrive - Università degli Studi di Bari\Universita\Magistrale\II Anno\I Semestre\Software Engineering\Hopcroft
configfile: pyproject.toml
collected 36 items

tests/behavioral/test_directional.py::TestDirectional::test_adding_language_keyword PASSED [  2%]
tests/behavioral/test_directional.py::TestDirectional::test_adding_data_structure_keyword PASSED [  5%]
tests/behavioral/test_directional.py::TestDirectional::test_adding_error_handling_context PASSED [  8%]
...
tests/behavioral/test_minimum_functionality.py::TestMinimumFunctionality::test_consistency_on_similar_inputs PASSED [100%]

===================== 36 passed in 470.52s (0:07:50) ======================

GPU Training Tests Placeholder

[TO BE UPDATED AFTER EXECUTION ON GPU MACHINE]

Expected format:
========================= test session starts ==========================
platform linux -- Python 3.10.x, pytest-9.0.1
collected 10 items

tests/behavioral/test_model_training.py::TestModelTraining::test_training_completes_without_errors PASSED
tests/behavioral/test_model_training.py::TestModelTraining::test_decreasing_loss_after_training PASSED
...
tests/behavioral/test_model_training.py::TestModelTraining::test_training_on_gpu PASSED

==================== 10 passed in XXXs ====================

Document Version: 1.1
Last Updated: November 15, 2025
Author: Hopcroft Team
Repository: se4ai2526-uniba/Hopcroft
Branch: behavioral-testing