Constant Edge 0.5 — 1.46 MB Sentiment Analysis for Edge Devices

A 288x compressed sentiment classifier distilled from BERT. Runs on microcontrollers, mobile devices, and edge hardware with 0.14ms inference latency.

Metric	Value
Accuracy	83.03% (SST-2)
F1	0.830
Model Size	1.46 MB (INT8 quantized)
Parameters	383,618
Inference	0.14ms (ONNX Runtime, CPU)
Compression	288x vs. BERT teacher (420 MB)
Teacher Accuracy	92.32%

Quick Start

import onnxruntime as ort
import numpy as np

# Load model
session = ort.InferenceSession("model_edge.onnx")

# Tokenize (simple whitespace + vocabulary lookup)
# For production use: pip install aure
from aure import Aure
model = Aure("edge")
result = model.predict("I love this product!")
print(result)  # SentimentResult(label='positive', score=0.91)

Standalone ONNX Inference (No Dependencies)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model_edge.onnx")

# Input: token IDs as int64 array, shape [batch_size, seq_length]
# Max sequence length: 128
# Vocabulary: pruned to 10,907 tokens (from BERT's 30,522)
input_ids = np.array([[101, 1045, 2293, 2023, 3185, 999, 102]], dtype=np.int64)

logits = session.run(None, {"input_ids": input_ids})[0]

# Softmax
exp = np.exp(logits - logits.max(axis=1, keepdims=True))
probs = exp / exp.sum(axis=1, keepdims=True)

labels = ["negative", "positive"]
pred = labels[np.argmax(probs[0])]
confidence = float(probs[0].max())
print(f"{pred} ({confidence:.1%})")

Architecture

NanoCNN — a compact convolutional architecture optimized for sub-2MB deployment:

Embedding: 32-dimensional, pruned vocabulary (10,907 tokens)
Convolutions: 4 parallel Conv1d banks (filter sizes 2, 3, 4, 5), 64 filters each
Compression: Linear bottleneck (256 → 16)
Classifier: 16 → 48 → 2 (with dropout 0.3)
Quantization: INT8 (post-training, ONNX)

Distillation Pipeline

Distilled from a BERT-base-uncased teacher through systematic experimentation:

BERT Teacher (92.32%, 420 MB)
    → Knowledge Distillation (T=6.39, α=0.69)
        → NanoCNN Student (83.03%, 1.46 MB)

Key distillation parameters (optimized via Optuna, 20 trials):

Temperature: 6.39
Distillation weight (α): 0.69
Learning rate: 2e-3
Epochs: 30
Batch size: 128

Ablation Results

We tested multiple compression approaches. Linear projection consistently won:

Teacher Compression (on BERT)

Method	Accuracy	Params
Linear	92.32%	49K
Graph Laplacian	92.20%	639K
MLP (2-layer)	92.09%	213K

Student Compression (NanoCNN)

Method	FP32 Accuracy	INT8 Accuracy	Size
Linear	82.04%	83.03%	1.46 MB
MLP	81.54%	82.11%	1.47 MB
Spectral	81.15%	82.00%	1.48 MB

Architecture Comparison

Model	Accuracy	Size	Compression
BERT Teacher	92.32%	420 MB	1x
CNN Large	83.94%	31.8 MB	13x
CNN TinyML	83.14%	3.4 MB	124x
NanoCNN INT8	83.03%	1.46 MB	288x
Tiny Transformer	80.16%	6.4 MB	66x

The transformer student performs worse despite 4x more parameters, confirming CNN inductive biases are better suited to small-scale text classification.

Multilingual Support

The Aure SDK supports 6 languages. Non-English models are downloaded on first use:

from aure import Aure

# German
model = Aure("edge", lang="de")
model.predict("Das ist wunderbar!")  # positive

# Japanese
model = Aure("edge", lang="ja")
model.predict("素晴らしい映画でした")  # positive

# French, Spanish, Chinese also supported

Supported: en, de, fr, es, zh, ja

Model Variants

Variant	File	Size	Accuracy	Use Case
Edge (this model)	`model_edge.onnx`	1.46 MB	83.03%	MCUs, wearables, IoT
Edge 3-Class	`model_edge_3class.onnx`	1.47 MB	~82%	Pos/neutral/neg classification
Mobile	`model_mobile.onnx`	4.0 MB	83%	Mobile apps, Raspberry Pi

Hardware Targets

Tested on:

NVIDIA Jetson Nano — 0.08ms inference
Raspberry Pi 4 — 0.9ms inference
x86 CPU (i7) — 0.14ms inference
ARM Cortex-M7 (STM32H7) — target <10ms (ONNX Micro Runtime)

Training Details

Dataset: SST-2 (Stanford Sentiment Treebank, binary), 67,349 train / 872 validation
Teacher: BERT-base-uncased + linear compression head, fine-tuned 12 epochs
Hardware: NVIDIA RTX 4090 Laptop GPU (16 GB), Windows 11
Framework: PyTorch 2.x → ONNX export → INT8 quantization
Reproducibility: 5-seed evaluation with standard deviations reported

Negative Results (Published for Transparency)

Graph Laplacian spectral compression provides no benefit over linear projection at either teacher or student level
Progressive distillation (BERT → DistilBERT → Student) does not improve student quality vs. direct distillation
Transformer students perform worse than CNN students at sub-2MB scale despite using 4x more parameters

Citation

@misc{constantone2026aure,
  title={Aure: Pareto-Optimal Knowledge Distillation for Sub-2MB Sentiment Classification},
  author={ConstantOne AI},
  year={2026},
  url={https://huggingface.co/ConstantQJ/constant-edge-0.5}
}

ConstantQJ
/

constant-edge-0.5