Constant Edge 0.5 β 1.46 MB Sentiment Analysis for Edge Devices
A 288x compressed sentiment classifier distilled from BERT. Runs on microcontrollers, mobile devices, and edge hardware with 0.14ms inference latency.
| Metric | Value |
|---|---|
| Accuracy | 83.03% (SST-2) |
| F1 | 0.830 |
| Model Size | 1.46 MB (INT8 quantized) |
| Parameters | 383,618 |
| Inference | 0.14ms (ONNX Runtime, CPU) |
| Compression | 288x vs. BERT teacher (420 MB) |
| Teacher Accuracy | 92.32% |
Quick Start
import onnxruntime as ort
import numpy as np
# Load model
session = ort.InferenceSession("model_edge.onnx")
# Tokenize (simple whitespace + vocabulary lookup)
# For production use: pip install aure
from aure import Aure
model = Aure("edge")
result = model.predict("I love this product!")
print(result) # SentimentResult(label='positive', score=0.91)
Standalone ONNX Inference (No Dependencies)
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("model_edge.onnx")
# Input: token IDs as int64 array, shape [batch_size, seq_length]
# Max sequence length: 128
# Vocabulary: pruned to 10,907 tokens (from BERT's 30,522)
input_ids = np.array([[101, 1045, 2293, 2023, 3185, 999, 102]], dtype=np.int64)
logits = session.run(None, {"input_ids": input_ids})[0]
# Softmax
exp = np.exp(logits - logits.max(axis=1, keepdims=True))
probs = exp / exp.sum(axis=1, keepdims=True)
labels = ["negative", "positive"]
pred = labels[np.argmax(probs[0])]
confidence = float(probs[0].max())
print(f"{pred} ({confidence:.1%})")
Architecture
NanoCNN β a compact convolutional architecture optimized for sub-2MB deployment:
- Embedding: 32-dimensional, pruned vocabulary (10,907 tokens)
- Convolutions: 4 parallel Conv1d banks (filter sizes 2, 3, 4, 5), 64 filters each
- Compression: Linear bottleneck (256 β 16)
- Classifier: 16 β 48 β 2 (with dropout 0.3)
- Quantization: INT8 (post-training, ONNX)
Distillation Pipeline
Distilled from a BERT-base-uncased teacher through systematic experimentation:
BERT Teacher (92.32%, 420 MB)
β Knowledge Distillation (T=6.39, Ξ±=0.69)
β NanoCNN Student (83.03%, 1.46 MB)
Key distillation parameters (optimized via Optuna, 20 trials):
- Temperature: 6.39
- Distillation weight (Ξ±): 0.69
- Learning rate: 2e-3
- Epochs: 30
- Batch size: 128
Ablation Results
We tested multiple compression approaches. Linear projection consistently won:
Teacher Compression (on BERT)
| Method | Accuracy | Params |
|---|---|---|
| Linear | 92.32% | 49K |
| Graph Laplacian | 92.20% | 639K |
| MLP (2-layer) | 92.09% | 213K |
Student Compression (NanoCNN)
| Method | FP32 Accuracy | INT8 Accuracy | Size |
|---|---|---|---|
| Linear | 82.04% | 83.03% | 1.46 MB |
| MLP | 81.54% | 82.11% | 1.47 MB |
| Spectral | 81.15% | 82.00% | 1.48 MB |
Architecture Comparison
| Model | Accuracy | Size | Compression |
|---|---|---|---|
| BERT Teacher | 92.32% | 420 MB | 1x |
| CNN Large | 83.94% | 31.8 MB | 13x |
| CNN TinyML | 83.14% | 3.4 MB | 124x |
| NanoCNN INT8 | 83.03% | 1.46 MB | 288x |
| Tiny Transformer | 80.16% | 6.4 MB | 66x |
The transformer student performs worse despite 4x more parameters, confirming CNN inductive biases are better suited to small-scale text classification.
Multilingual Support
The Aure SDK supports 6 languages. Non-English models are downloaded on first use:
from aure import Aure
# German
model = Aure("edge", lang="de")
model.predict("Das ist wunderbar!") # positive
# Japanese
model = Aure("edge", lang="ja")
model.predict("η΄ ζ΄γγγζ η»γ§γγ") # positive
# French, Spanish, Chinese also supported
Supported: en, de, fr, es, zh, ja
Model Variants
| Variant | File | Size | Accuracy | Use Case |
|---|---|---|---|---|
| Edge (this model) | model_edge.onnx |
1.46 MB | 83.03% | MCUs, wearables, IoT |
| Edge 3-Class | model_edge_3class.onnx |
1.47 MB | ~82% | Pos/neutral/neg classification |
| Mobile | model_mobile.onnx |
4.0 MB | 83% | Mobile apps, Raspberry Pi |
Hardware Targets
Tested on:
- NVIDIA Jetson Nano β 0.08ms inference
- Raspberry Pi 4 β 0.9ms inference
- x86 CPU (i7) β 0.14ms inference
- ARM Cortex-M7 (STM32H7) β target <10ms (ONNX Micro Runtime)
Training Details
- Dataset: SST-2 (Stanford Sentiment Treebank, binary), 67,349 train / 872 validation
- Teacher: BERT-base-uncased + linear compression head, fine-tuned 12 epochs
- Hardware: NVIDIA RTX 4090 Laptop GPU (16 GB), Windows 11
- Framework: PyTorch 2.x β ONNX export β INT8 quantization
- Reproducibility: 5-seed evaluation with standard deviations reported
Negative Results (Published for Transparency)
- Graph Laplacian spectral compression provides no benefit over linear projection at either teacher or student level
- Progressive distillation (BERT β DistilBERT β Student) does not improve student quality vs. direct distillation
- Transformer students perform worse than CNN students at sub-2MB scale despite using 4x more parameters
Citation
@misc{constantone2026aure,
title={Aure: Pareto-Optimal Knowledge Distillation for Sub-2MB Sentiment Classification},
author={ConstantOne AI},
year={2026},
url={https://huggingface.co/ConstantQJ/constant-edge-0.5}
}
License
Apache 2.0 β use freely in commercial and non-commercial projects.
Links
Datasets used to train ConstantQJ/constant-edge-0.5
Evaluation results
- accuracy on SST-2validation set self-reported83.030
- f1 on SST-2validation set self-reported0.830