Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +178 -3
config.json +14 -0
model.onnx +3 -0
model.safetensors +3 -0
model_info.json +21 -0

README.md CHANGED Viewed

@@ -1,3 +1,178 @@
----
-license: mit
----

+---
+language: multilingual
+license: mit
+library_name: pytorch
+tags:
+- text-classification
+- language-detection
+- byte-level
+- multilingual
+- english-detection
+- cnn
+pipeline_tag: text-classification
+datasets:
+- custom
+metrics:
+- accuracy
+model-index:
+- name: innit
+  results:
+  - task:
+      type: text-classification
+      name: English vs Non-English Detection
+    metrics:
+    - type: accuracy
+      value: 99.94
+      name: Validation Accuracy
+    - type: accuracy
+      value: 100.0
+      name: Challenge Set Accuracy
+---
+# innit: Fast English vs Non-English Text Detection
+A lightweight byte-level CNN for fast binary language detection (English vs Non-English).
+## Model Details
+- **Model Type**: Byte-level Convolutional Neural Network
+- **Task**: Binary text classification (English vs Non-English)
+- **Architecture**: TinyByteCNN_EN with 6 convolutional blocks
+- **Parameters**: 156,642
+- **Input**: Raw UTF-8 bytes (max 256 bytes)
+- **Output**: Binary classification (0=Non-English, 1=English)
+## Performance
+- **Validation Accuracy**: 99.94%
+- **Challenge Set Accuracy**: 100% (14/14 test cases)
+- **Inference Speed**: Sub-millisecond on modern CPUs
+- **Model Size**: ~600KB
+## Supported Languages
+Trained to distinguish English from 52+ languages across diverse scripts:
+- **Latin scripts**: Spanish, French, German, Italian, Dutch, Portuguese, etc.
+- **CJK scripts**: Chinese (Simplified/Traditional), Japanese, Korean
+- **Cyrillic scripts**: Russian, Ukrainian, Bulgarian, Serbian
+- **Other scripts**: Arabic, Hindi, Bengali, Thai, Hebrew, etc.
+## Architecture
+```
+TinyByteCNN_EN:
+├── Embedding: 257 → 80 dimensions (256 bytes + padding)
+├── 6x Convolutional Blocks:
+│   ├── Conv1D (kernel=3, residual connections)
+│   ├── GELU activation
+│   ├── BatchNorm1D
+│   └── Dropout (0.15)
+├── Enhanced Pooling: mean + max + std
+└── Classification Head: 240 → 80 → 2
+```
+## Training Data
+- **Total samples**: 17,543 balanced samples
+- **English**: 8,772 samples from diverse sources
+- **Non-English**: 8,771 samples across 52+ languages
+- **Text lengths**: 3-276 characters (optimized for short texts)
+- **Special coverage**: Emoji handling, mathematical formulas, scientific notation
+## Quick Start
+### Option 1: ONNX Runtime (Recommended)
+```python
+import onnxruntime as ort
+import numpy as np
+# Load ONNX model
+session = ort.InferenceSession("model.onnx")
+def predict(text):
+    # Prepare input
+    bytes_data = text.encode('utf-8', errors='ignore')[:256]
+    padded = np.zeros(256, dtype=np.int64)
+    padded[:len(bytes_data)] = list(bytes_data)
+    # Run inference
+    outputs = session.run(['logits'], {'input_bytes': padded.reshape(1, -1)})
+    logits = outputs[0][0]
+    # Apply softmax
+    exp_logits = np.exp(logits - np.max(logits))
+    probs = exp_logits / np.sum(exp_logits)
+    return probs[1]  # English probability
+# Examples
+print(predict("Hello world!"))           # ~1.0 (English)
+print(predict("Bonjour le monde"))       # ~0.0 (French)
+print(predict("Check our sale! 🎉"))     # ~1.0 (English with emoji)
+```
+### Option 2: Python Package
+```bash
+# Install the utility package
+pip install innit-detector
+# CLI usage
+innit "Hello world!"                    # → English (confidence: 0.974)
+innit --download                        # Download model first
+innit "Hello" "Bonjour" "你好"          # Multiple texts
+# Library usage
+from innit_detector import InnitDetector
+detector = InnitDetector()
+result = detector.predict("Hello world!")
+print(result['is_english'])  # True
+```
+### Option 3: PyTorch (Advanced)
+```python
+import torch
+import torch.nn.functional as F
+from safetensors.torch import load_file
+import numpy as np
+# Load model (requires TinyByteCNN_EN class definition)
+state_dict = load_file("model.safetensors")
+model = TinyByteCNN_EN(emb=80, blocks=6, dropout=0.15)
+model.load_state_dict(state_dict)
+model.eval()
+def predict(text):
+    bytes_data = text.encode('utf-8', errors='ignore')[:256]
+    padded = np.zeros(256, dtype=np.long)
+    padded[:len(bytes_data)] = list(bytes_data)
+    with torch.no_grad():
+        logits = model(torch.tensor(padded).unsqueeze(0))
+        probs = F.softmax(logits, dim=1)
+        return probs[0][1].item()
+```
+## ONNX Support
+ONNX version available for cross-platform deployment:
+- `model.onnx` - Full precision (FP32) for maximum compatibility
+## Challenge Set Results
+Perfect 100% accuracy on comprehensive test cases:
+- Ultra-short texts: "Good morning!" ✅
+- Emoji handling: "Check out our sale! 🎉" ✅
+- Mathematical formulas: "x = (-b ± √(b²-4ac))/2a" ✅
+- Scientific notation: "CO₂ + H₂O → C₆H₁₂O₆" ✅
+- Diverse scripts: Arabic, CJK, Cyrillic, Devanagari ✅
+- English-like languages: Dutch, German ✅
+## Limitations
+- Binary classification only (English vs Non-English)
+- Optimized for texts up to 256 UTF-8 bytes
+- May have reduced accuracy on very rare languages not in training data
+- Not suitable for multilingual text (mixed languages in single input)
+## License
+MIT License - free for commercial use.

config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "architectures": [
+    "TinyByteCNN_EN"
+  ],
+  "model_type": "byte_cnn",
+  "emb_dim": 80,
+  "num_blocks": 6,
+  "dropout": 0.15,
+  "vocab_size": 257,
+  "num_classes": 2,
+  "max_length": 256,
+  "validation_accuracy": 99.94301994301995,
+  "torch_dtype": "float32"
+}

model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:692e33fc0d94ab5ec9436c8b84853c4662e739b0a6f28110894c383a06f913ac
+size 643861

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dcc8aae0bf9626072b33569b6097c73763029e62eaae3f6b0d571fbb426a061c
+size 634264

model_info.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "model_name": "innit",
+  "version": "1.0",
+  "task": "english_detection",
+  "architecture": "TinyByteCNN_EN",
+  "parameters": 156642,
+  "input_format": "utf8_bytes",
+  "max_length": 256,
+  "output_classes": [
+    "NOT-EN",
+    "EN"
+  ],
+  "validation_accuracy": 99.94,
+  "challenge_accuracy": 100.0,
+  "files": {
+    "pytorch": "model.safetensors",
+    "config": "config.json",
+    "onnx": "model.onnx",
+    "readme": "README.md"
+  }
+}