paiml's picture
Upload folder using huggingface_hub
9f6447d verified
---
license: mit
pipeline_tag: text-classification
tags:
- shell-safety
- classifier
- aprender
- rust
- bashrs
model-index:
- name: paiml/shell-safety-classifier
results:
- task:
type: text-classification
dataset:
name: bashrs-corpus
type: custom
metrics:
- name: Train Accuracy
type: accuracy
value: 0.966
- name: Validation Accuracy
type: accuracy
value: 0.632
- name: Training Samples
type: custom
value: "17942"
---
# Shell Safety Classifier
Classifies shell scripts into 5 safety categories using a lightweight MLP trained on the [bashrs](https://github.com/paiml/bashrs) corpus.
## Labels
| Index | Label | Description |
|-------|-------|-------------|
| 0 | safe | Script is deterministic, idempotent, and properly quoted |
| 1 | needs-quoting | Contains unquoted variables susceptible to word splitting |
| 2 | non-deterministic | Uses `$RANDOM`, timestamps, process IDs, or other non-deterministic sources |
| 3 | non-idempotent | Operations not safe to re-run (missing `-p`, `-f` flags) |
| 4 | unsafe | Security issues (injection vectors, privilege escalation) |
## Architecture
- **Model**: MLP classifier (ShellVocabulary token embeddings -> 128 -> 64 -> 5)
- **Tokenizer**: ShellVocabulary (250 shell-specific tokens, max_seq_len=64)
- **Format**: SafeTensors (model.safetensors) + JSON config + vocab
- **Framework**: [aprender](https://github.com/paiml/aprender) (pure Rust ML, no Python dependencies)
## Training
- **Corpus**: bashrs v2 corpus (17,942 entries: 16,431 Bash + 804 Makefile + 707 Dockerfile)
- **Split**: 80/20 train/validation (14,353 / 3,589)
- **Epochs**: 50
- **Optimizer**: Adam (lr=0.01)
- **Loss**: CrossEntropyLoss
- **Train accuracy**: 96.6%
- **Validation accuracy**: 63.2%
### Class Distribution
| Label | Count | Percentage |
|-------|-------|------------|
| safe | 16,126 | 89.9% |
| needs-quoting | 1,814 | 10.1% |
| unsafe | 2 | 0.01% |
## Usage
### With bashrs CLI
```bash
# Classify a single script
bashrs classify script.sh
# Classify with format detection
bashrs classify Makefile --format makefile
# Multi-label classification
bashrs classify script.sh --multi-label
```
### With aprender (Rust)
```rust
use aprender::models::shell_safety::{ShellSafetyClassifier, SafetyClass};
let classifier = ShellSafetyClassifier::load("/path/to/model")?;
let result = classifier.predict("echo $HOME")?;
// result: SafetyClass::NeedsQuoting
```
## Files
| File | Size | Description |
|------|------|-------------|
| model.safetensors | 68 KB | Model weights |
| vocab.json | 3.6 KB | Shell tokenizer vocabulary |
| config.json | 371 B | Model architecture config |
## Limitations
- The v2.0 MLP architecture has limited validation accuracy (63.2%) due to class imbalance and simple architecture
- Best suited for binary safe/unsafe classification (96%+ accuracy when collapsing to 2 classes)
- A Qwen2.5-Coder fine-tuned version is planned for higher accuracy on minority classes
## License
MIT