| --- |
| license: mit |
| datasets: |
| - lc_quad |
| --- |
| |
| This repo contains a custom tokenizer for SPARQL. Here is an example. |
|
|
| ``` |
| Query: SELECT ?answer WHERE { wd:Q825946 wdt:P371 ?X . ?X wdt:P2048 ?answer} |
| ``` |
|
|
| Result from default T5 tokenizer: |
| ``` |
| ['▁', 'SEL', 'ECT', '▁', '?', 'ans', 'wer', '▁W', 'HER', 'E', '▁', '{', '▁', 'w', 'd', ':', 'Q', '82', '59', '46', '▁', |
| 'w', 'd', 't', ':', 'P', '37', '1', '▁', '?', 'X', '▁', '.', '▁', '?', 'X', '▁', 'w', 'd', 't', ':', 'P', '20', '48', |
| '▁', '?', 'ans', 'wer', '}'] |
| ``` |
|
|
| Result from this tokenizer: |
| ``` |
| ['▁SELECT', '▁?answer', '▁WHERE', '▁{', '▁wd:Q8', '259', '46', '▁wdt:P371', '▁?X', '▁.', '▁?X', '▁wdt:P2048', '▁?answer', '}'] |
| ``` |