--- tags: - influence-guided-training - dataset-curation - distilgpt2 datasets: - DamarJati/indocorpus-sastra - crmamede/vulnerability_detection__explainability - jason-oneal/mitre-stix-cve-exploitdb-dataset-alpaca language: - en license: apache-2.0 --- # gpt-2-vuln-code This model was trained using **influence-guided dataset selection**, a technique that uses influence scores to identify the most impactful training data for specific concepts. ## Model Description - **Base Model**: distilgpt2 - **Training Concepts**: vulnerability detection, static code analysis, SAST, secure coding practices, CWE, CVE, automated security testing, code review tools, threat modeling - **Training Method**: Influence-guided data selection - **Compute Budget**: 100 steps per condition - **Total Datasets**: 3 ## Training Approach This model was trained using three different data selection strategies to validate the effectiveness of influence-guided training: 1. **Positive Influence**: Datasets with high positive influence scores (most aligned with target concepts) 2. **Random Baseline**: Randomly sampled datasets 3. **Negative Influence**: Datasets with high negative influence scores (least aligned) ## Benchmark Results | Condition | Perplexity ↓ | Train Loss ↓ | Eval Loss ↓ | |-----------|-------------|--------------|-------------| | Positive | 12.17 | 2.9640 | 2.4989 | | Random | 4.81 | 1.9605 | 1.5703 | *Lower is better for all metrics* ## Training Datasets The model was trained on datasets selected through influence scoring: - `DamarJati/indocorpus-sastra` (Influence: -0.867) - `crmamede/vulnerability_detection__explainability` (Influence: 0.621) - `jason-oneal/mitre-stix-cve-exploitdb-dataset-alpaca` (Influence: -0.526) ## Intended Use This model demonstrates the effectiveness of influence-guided training for: - Concept-specific language modeling - Data-efficient training - Dataset curation research ## Limitations - Trained on a limited compute budget for benchmarking purposes - May not generalize well outside the target concepts: vulnerability detection, static code analysis, SAST, secure coding practices, CWE, CVE, automated security testing, code review tools, threat modeling - Performance depends on the quality of influence score estimation ## Citation If you use this model or the influence-guided training approach, please cite: ```bibtex @software{influence_guided_training, title = {Influence-Guided Dataset Selection for Language Models}, author = {Learning Curator by Durinn}, year = {2025}, url = {https://huggingface.co/durinn/gpt-2-vuln-code} } ``` ## Model Card Contact For questions or feedback, visit [Durinn](https://durinn.ai/contact) --- *Generated by Learning Curator - AI-powered dataset discovery and training plan optimization*