dksplit / README.md
ABTdomain's picture
Update README.md
fbea672 verified
---
language:
- en
- de
- fr
- es
- it
tags:
- word-segmentation
- onnx
- bilstm-crf
- text-processing
- domain-names
library_name: onnxruntime
pipeline_tag: token-classification
---
# DKSplit
Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words.
## Model Description
- **Architecture:** BiLSTM-CRF (384 embedding, 768 hidden, 3 layers)
- **Format:** ONNX with INT8 quantization
- **Size:** ~9MB
- **Input:** Lowercase a-z, 0-9 (max 64 characters)
## Usage
### Install
```bash
pip install dksplit
```
### Python
```python
import dksplit
dksplit.split("chatgptlogin")
# ['chatgpt', 'login']
dksplit.split_batch(["openaikey", "microsoftoffice"])
# [['openai', 'key'], ['microsoft', 'office']]
```
### Direct ONNX
```python
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("dksplit-int8.onnx")
# See GitHub for full inference code
```
## Files
- `dksplit-int8.onnx` - ONNX model (INT8 quantized)
- `dksplit.npz` - CRF parameters
## Limitations
- Input: a-z, 0-9 only
- Max length: 64 characters
- Non-Latin scripts: use Romanized form
## Links
- [PyPI](https://pypi.org/project/dksplit/)
- [GitHub](https://github.com/ABTdomain/dksplit)
- [DomainKits](https://domainkits.com)
- [ABTdomain](https://ABTdomain.com)
## License
MIT