Sanatbek commited on
Commit
6618b00
·
verified ·
1 Parent(s): a9cbdc5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -56
README.md CHANGED
@@ -1,57 +1,60 @@
1
- ---
2
- language:
3
- - uz
4
- tags:
5
- - dependency-parsing
6
- - pos-tagging
7
- - stanza
8
- - uzbek
9
- - universal-dependencies
10
- license: mit
11
- datasets:
12
- - UD_Uzbek-UzUDT
13
- metrics:
14
- - uas
15
- - las
16
- - upos
17
- base_model: elmurod1202/bertbek-news-big-cased
18
- ---
19
-
20
- # UzUDT: Robust Uzbek Neural Dependency Parsing
21
-
22
- This repository contains the trained **Stanza-style neural models** for Uzbek morphosyntactic tagging and dependency parsing, as described in the paper *Towards Robust Uzbek Neural Dependency Parsing*.
23
-
24
- ## Model Description
25
- The system is designed to handle the agglutinative morphology and resource scarcity of Uzbek. It utilizes a **Stanza-like pipeline** augmented with:
26
- 1. **BERTbek Contextual Embeddings**: Utilizing the `elmurod1202/bertbek-news-big-cased` model with subword-to-word "super-token" fusion.
27
- 2. **Morphology-Aware Preprocessing**: An improved Apertium-based normalization layer to reduce sparsity.
28
-
29
- ## Performance (UzUDT Test Set)
30
- Evaluated on the 3-star **UzUDT treebank** (681 sentences).
31
-
32
- | Metric | Score (%) |
33
- | :--- | :--- |
34
- | **UPOS** | 86.10 |
35
- | **XPOS** | 83.96 |
36
- | **UAS** | 74.21 |
37
- | **LAS** | 66.90 |
38
- | **UFeats** | 70.06 |
39
-
40
- ## Usage
41
- Since the models are stored in custom directories (`pos/` and `depparse/`), you must specify the paths when loading the pipeline:
42
-
43
- ```python
44
- import stanza
45
-
46
- # configuration to point to the specific model files
47
- config = {
48
- 'pos_model_path': './pos/uz_uzudt-base_tagger.pt',
49
- 'depparse_model_path': './depparse/uz_uzudt_nocharlm_parser.pt',
50
- 'use_gpu': True
51
- }
52
-
53
- # Initialize the pipeline
54
- nlp = stanza.Pipeline(lang='uz', processors='tokenize,pos,lemma,depparse', **config)
55
-
56
- doc = nlp("Oʻzbekistonning poytaxti Toshkent shahridir.")
 
 
 
57
  doc.sentences[0].print_dependencies()
 
1
+ ---
2
+ language:
3
+ - uz
4
+ tags:
5
+ - dependency-parsing
6
+ - pos-tagging
7
+ - tokenization
8
+ - stanza
9
+ - uzbek
10
+ - universal-dependencies
11
+ license: mit
12
+ datasets:
13
+ - UD_Uzbek-UzUDT
14
+ metrics:
15
+ - uas
16
+ - las
17
+ - upos
18
+ base_model: elmurod1202/bertbek-news-big-cased
19
+ ---
20
+
21
+ # UzUDT: Robust Uzbek Neural Dependency Parsing
22
+
23
+ This repository contains the trained **Stanza-style neural models** for Uzbek tokenization, morphosyntactic tagging, and dependency parsing, as described in the paper *Towards Robust Uzbek Neural Dependency Parsing*.
24
+
25
+ ## Model Description
26
+ The system is designed to handle the agglutinative morphology and resource scarcity of Uzbek. It utilizes a **Stanza-like pipeline** augmented with:
27
+ 1. **BERTbek Contextual Embeddings**: Utilizing the `elmurod1202/bertbek-news-big-cased` model with subword-to-word "super-token" fusion.
28
+ 2. **Morphology-Aware Preprocessing**: An improved Apertium-based normalization layer to reduce sparsity.
29
+
30
+ ## Performance (UzUDT Test Set)
31
+ Evaluated on the 3-star **UzUDT treebank** (681 sentences).
32
+
33
+ | Metric | Score (%) |
34
+ | :--- | :--- |
35
+ | **UPOS** | 86.10 |
36
+ | **XPOS** | 83.96 |
37
+ | **UAS** | 74.21 |
38
+ | **LAS** | 66.90 |
39
+ | **UFeats** | 70.06 |
40
+
41
+ ## Usage
42
+ To use these models, download the `.pt` files to your local directory. You must specify the path to each model component (Tokenizer, POS, DepParse) in the configuration.
43
+
44
+ ```python
45
+ import stanza
46
+
47
+ # Configuration pointing to the local .pt files
48
+ config = {
49
+ 'tokenize_model_path': './uz_uzudt_tokenizer.pt',
50
+ 'pos_model_path': './uz_uzudt-base_tagger.pt',
51
+ 'depparse_model_path': './uz_uzudt_nocharlm_parser.pt',
52
+ 'use_gpu': True
53
+ }
54
+
55
+ # Initialize the pipeline
56
+ # Note: 'lemma' is excluded as it requires a separate model or external Apertium integration
57
+ nlp = stanza.Pipeline(lang='uz', processors='tokenize,pos,depparse', **config)
58
+
59
+ doc = nlp("Oʻzbekistonning poytaxti Toshkent shahridir.")
60
  doc.sentences[0].print_dependencies()