mojad121's picture
Upload 8 files
68dd6f7 verified
# Sentinel-D spaCy NER Model (Stage 1 — NVD Parsing)
## Model Details
- **Base Model**: spaCy blank English (`en_core_web_blank`)
- **Task**: Named Entity Recognition (NER)
- **Training Date**: 2026-03-04T21:49:41.890810
- **Framework**: spaCy 3.x
- **Training Data Size**: 550 descriptions + 50-example test set
- **Training Epochs**: 20
- **Dropout**: 0.35
## Custom NER Labels
1. **VERSION_RANGE**: Semantic version strings or version constraints (e.g., "1.2.3", "< 2.0.0")
2. **API_SYMBOL**: Method, class, or function names (e.g., "queryset.filter()", "X.509")
3. **BREAKING_CHANGE**: References to incompatible API changes or deprecations
4. **FIX_ACTION**: Specific remediation steps or upgrade instructions
## Evaluation Metrics
| Metric | Value |
|--------|-------|
| Precision | 0.9111 |
| Recall | 0.7885 |
| F1 Score | 0.8454 |
| True Positives | 41 |
| False Positives | 4 |
| False Negatives | 11 |
## Usage
```python
import spacy
nlp = spacy.load("./spacy-nvd-ner-v1")
text = "OpenSSL versions before 1.1.1n contain a buffer overflow in the X.509 verifier."
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text} -> {ent.label_}")
# Output:
# 1.1.1n -> VERSION_RANGE
# X.509 -> API_SYMBOL
```
## Installation
1. Extract the zip archive to your project directory
2. Load the model using spaCy:
```python
import spacy
nlp = spacy.load("./spacy-nvd-ner-v1")
```
## Architecture
The model consists of:
- **Input Layer**: Vectorized token representations
- **Hidden Layer**: Feed-forward network with 0.35 dropout
- **Output Layer**: 4-class NER tagger (softmax)
## Training Configuration
- **Optimizer**: SGD
- **Batch Size Range**: 8-32 (compounding)
- **Training Data**: Real NVD descriptions auto-annotated with GLiNER teacher model
- **Constraint**: Exactly 50-example held-out test set (Master Document requirement)
## Known Limitations
- Model trained on NVD descriptions only; may not generalize to other security domains
- Entity boundaries may not align perfectly with whitespace
- Requires English text input
## License
MIT