ABTdomain commited on
Commit
fbea672
·
verified ·
1 Parent(s): 2d13fa6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -3
README.md CHANGED
@@ -1,3 +1,76 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - fr
6
+ - es
7
+ - it
8
+ tags:
9
+ - word-segmentation
10
+ - onnx
11
+ - bilstm-crf
12
+ - text-processing
13
+ - domain-names
14
+ library_name: onnxruntime
15
+ pipeline_tag: token-classification
16
+ ---
17
+
18
+ # DKSplit
19
+
20
+ Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words.
21
+
22
+ ## Model Description
23
+
24
+ - **Architecture:** BiLSTM-CRF (384 embedding, 768 hidden, 3 layers)
25
+ - **Format:** ONNX with INT8 quantization
26
+ - **Size:** ~9MB
27
+ - **Input:** Lowercase a-z, 0-9 (max 64 characters)
28
+
29
+ ## Usage
30
+
31
+ ### Install
32
+ ```bash
33
+ pip install dksplit
34
+ ```
35
+
36
+ ### Python
37
+ ```python
38
+ import dksplit
39
+
40
+ dksplit.split("chatgptlogin")
41
+ # ['chatgpt', 'login']
42
+
43
+ dksplit.split_batch(["openaikey", "microsoftoffice"])
44
+ # [['openai', 'key'], ['microsoft', 'office']]
45
+ ```
46
+
47
+ ### Direct ONNX
48
+ ```python
49
+ import onnxruntime as ort
50
+ import numpy as np
51
+
52
+ session = ort.InferenceSession("dksplit-int8.onnx")
53
+ # See GitHub for full inference code
54
+ ```
55
+
56
+ ## Files
57
+
58
+ - `dksplit-int8.onnx` - ONNX model (INT8 quantized)
59
+ - `dksplit.npz` - CRF parameters
60
+
61
+ ## Limitations
62
+
63
+ - Input: a-z, 0-9 only
64
+ - Max length: 64 characters
65
+ - Non-Latin scripts: use Romanized form
66
+
67
+ ## Links
68
+
69
+ - [PyPI](https://pypi.org/project/dksplit/)
70
+ - [GitHub](https://github.com/ABTdomain/dksplit)
71
+ - [DomainKits](https://domainkits.com)
72
+ - [ABTdomain](https://ABTdomain.com)
73
+
74
+ ## License
75
+
76
+ MIT