|
|
--- |
|
|
language: |
|
|
- en |
|
|
- de |
|
|
- fr |
|
|
- es |
|
|
- it |
|
|
tags: |
|
|
- word-segmentation |
|
|
- onnx |
|
|
- bilstm-crf |
|
|
- text-processing |
|
|
- domain-names |
|
|
library_name: onnxruntime |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# DKSplit |
|
|
|
|
|
Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Architecture:** BiLSTM-CRF (384 embedding, 768 hidden, 3 layers) |
|
|
- **Format:** ONNX with INT8 quantization |
|
|
- **Size:** ~9MB |
|
|
- **Input:** Lowercase a-z, 0-9 (max 64 characters) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Install |
|
|
```bash |
|
|
pip install dksplit |
|
|
``` |
|
|
|
|
|
### Python |
|
|
```python |
|
|
import dksplit |
|
|
|
|
|
dksplit.split("chatgptlogin") |
|
|
# ['chatgpt', 'login'] |
|
|
|
|
|
dksplit.split_batch(["openaikey", "microsoftoffice"]) |
|
|
# [['openai', 'key'], ['microsoft', 'office']] |
|
|
``` |
|
|
|
|
|
### Direct ONNX |
|
|
```python |
|
|
import onnxruntime as ort |
|
|
import numpy as np |
|
|
|
|
|
session = ort.InferenceSession("dksplit-int8.onnx") |
|
|
# See GitHub for full inference code |
|
|
``` |
|
|
|
|
|
## Files |
|
|
|
|
|
- `dksplit-int8.onnx` - ONNX model (INT8 quantized) |
|
|
- `dksplit.npz` - CRF parameters |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Input: a-z, 0-9 only |
|
|
- Max length: 64 characters |
|
|
- Non-Latin scripts: use Romanized form |
|
|
|
|
|
## Links |
|
|
|
|
|
- [PyPI](https://pypi.org/project/dksplit/) |
|
|
- [GitHub](https://github.com/ABTdomain/dksplit) |
|
|
- [DomainKits](https://domainkits.com) |
|
|
- [ABTdomain](https://ABTdomain.com) |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |