ABTdomain
/

dksplit

Token Classification

word-segmentation

text-processing

Model card Files Files and versions

dksplit / README.md

ABTdomain's picture

Update README.md

fbea672 verified 29 days ago

|

history blame contribute delete

1.34 kB

	---
	language:
	- en
	- de
	- fr
	- es
	- it
	tags:
	- word-segmentation
	- onnx
	- bilstm-crf
	- text-processing
	- domain-names
	library_name: onnxruntime
	pipeline_tag: token-classification
	---

	# DKSplit

	Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words.

	## Model Description

	- Architecture: BiLSTM-CRF (384 embedding, 768 hidden, 3 layers)
	- Format: ONNX with INT8 quantization
	- Size: ~9MB
	- Input: Lowercase a-z, 0-9 (max 64 characters)

	## Usage

	### Install
	```bash
	pip install dksplit
	```

	### Python
	```python
	import dksplit

	dksplit.split("chatgptlogin")
	# ['chatgpt', 'login']

	dksplit.split_batch(["openaikey", "microsoftoffice"])
	# [['openai', 'key'], ['microsoft', 'office']]
	```

	### Direct ONNX
	```python
	import onnxruntime as ort
	import numpy as np

	session = ort.InferenceSession("dksplit-int8.onnx")
	# See GitHub for full inference code
	```

	## Files

	- `dksplit-int8.onnx` - ONNX model (INT8 quantized)
	- `dksplit.npz` - CRF parameters

	## Limitations

	- Input: a-z, 0-9 only
	- Max length: 64 characters
	- Non-Latin scripts: use Romanized form

	## Links

	- [PyPI](https://pypi.org/project/dksplit/)
	- [GitHub](https://github.com/ABTdomain/dksplit)
	- [DomainKits](https://domainkits.com)
	- [ABTdomain](https://ABTdomain.com)

	## License

	MIT