umutckmk2
/

tr_tokenizer

Turkish

Model card Files Files and versions

xet

Community

umutckmk2 commited on Dec 4, 2024

Commit

2e71ee6

1 Parent(s): 72772cb

Refactor README.md to remove licensing section and unsupported tasks, streamlining content for clarity

Browse files

Files changed (1) hide show

README.md +0 -10

README.md CHANGED Viewed

@@ -1,11 +1,8 @@
 ---
-license: cc-by-nc-4.0
 language:
 - tr
 ---
-# TR Tokenizer: Turkish Word Segmentation Tool Based on Semantic Integrity
 ## Tokenizer Summary
 TR Tokenizer is an innovative FastTokenizer that splits Turkish words according to their semantic integrity, using both current natural language processing methods and Turkish grammar rules. This fast and efficient tokenizer provides accurate and detailed results by analyzing words morphologically and semantically. For example, the sentence "akademisyenler ve aileleri ile birlikte aktif çalışıyorlar" (academics and their families are actively working together) is split into the following parts:
@@ -13,13 +10,6 @@ TR Tokenizer is an innovative FastTokenizer that splits Turkish words according
 ['akademisyen', 'ler', 've', 'aile', 'leri', 'ile', 'birlikte', 'aktif', 'çalış', 'ı', 'yor', 'lar']
 ```
-## Supported Tasks and Applications
-TR Tokenizer can be used for the following NLP tasks:
-- **Morphological Analysis**: Analyzes the root and suffix structures of words.
-- **Language Model Training and Fine-tuning**: Processes words according to their semantic integrity during the preprocessing phase of Turkish language model training.
-- **Frequency Analysis**: Assists in determining word frequencies in texts.
-- **Natural Language Processing (NLP) Research**: Used in research studying the morphological structure and word formations of the Turkish language.
 ## Languages
 This tokenizer focuses on the **Turkish** language and is designed to support Turkish's rich morphological structure.

 ---
 language:
 - tr
 ---
 ## Tokenizer Summary
 TR Tokenizer is an innovative FastTokenizer that splits Turkish words according to their semantic integrity, using both current natural language processing methods and Turkish grammar rules. This fast and efficient tokenizer provides accurate and detailed results by analyzing words morphologically and semantically. For example, the sentence "akademisyenler ve aileleri ile birlikte aktif çalışıyorlar" (academics and their families are actively working together) is split into the following parts:
 ['akademisyen', 'ler', 've', 'aile', 'leri', 'ile', 'birlikte', 'aktif', 'çalış', 'ı', 'yor', 'lar']
 ```
 ## Languages
 This tokenizer focuses on the **Turkish** language and is designed to support Turkish's rich morphological structure.