durinn
/

gpt-2-vuln-code

+---
+tags:
+- influence-guided-training
+- dataset-curation
+- distilgpt2
+datasets:
+- DamarJati/indocorpus-sastra
+- crmamede/vulnerability_detection__explainability
+- jason-oneal/mitre-stix-cve-exploitdb-dataset-alpaca
+language:
+- en
+license: apache-2.0
+---
+# gpt-2-vuln-code
+This model was trained using **influence-guided dataset selection**, a technique that uses influence scores to identify the most impactful training data for specific concepts.
+## Model Description
+- **Base Model**: distilgpt2
+- **Training Concepts**: vulnerability detection, static code analysis, SAST, secure coding practices, CWE, CVE, automated security testing, code review tools, threat modeling
+- **Training Method**: Influence-guided data selection
+- **Compute Budget**: 100 steps per condition
+- **Total Datasets**: 3
+## Training Approach
+This model was trained using three different data selection strategies to validate the effectiveness of influence-guided training:
+1. **Positive Influence**: Datasets with high positive influence scores (most aligned with target concepts)
+2. **Random Baseline**: Randomly sampled datasets
+3. **Negative Influence**: Datasets with high negative influence scores (least aligned)
+## Benchmark Results
+| Condition | Perplexity ↓ | Train Loss ↓ | Eval Loss ↓ |
+|-----------|-------------|--------------|-------------|
+| Positive | 12.17 | 2.9640 | 2.4989 |
+| Random | 4.81 | 1.9605 | 1.5703 |
+*Lower is better for all metrics*
+## Training Datasets
+The model was trained on datasets selected through influence scoring:
+- `DamarJati/indocorpus-sastra` (Influence: -0.867)
+- `crmamede/vulnerability_detection__explainability` (Influence: 0.621)
+- `jason-oneal/mitre-stix-cve-exploitdb-dataset-alpaca` (Influence: -0.526)
+## Intended Use
+This model demonstrates the effectiveness of influence-guided training for:
+- Concept-specific language modeling
+- Data-efficient training
+- Dataset curation research
+## Limitations
+- Trained on a limited compute budget for benchmarking purposes
+- May not generalize well outside the target concepts: vulnerability detection, static code analysis, SAST, secure coding practices, CWE, CVE, automated security testing, code review tools, threat modeling
+- Performance depends on the quality of influence score estimation
+## Citation
+If you use this model or the influence-guided training approach, please cite:
+```bibtex
+@software{influence_guided_training,
+  title = {Influence-Guided Dataset Selection for Language Models},
+  author = {Learning Curator by Durinn},
+  year = {2025},
+  url = {https://huggingface.co/durinn/gpt-2-vuln-code}
+}
+```
+## Model Card Contact
+For questions or feedback, visit [Durinn](https://durinn.ai/contact)
+---
+*Generated by Learning Curator - AI-powered dataset discovery and training plan optimization*