Update README.md
Browse files
README.md
CHANGED
|
@@ -4,13 +4,14 @@ datasets:
|
|
| 4 |
- lc_quad
|
| 5 |
---
|
| 6 |
|
| 7 |
-
This repo contains a custom tokenizer for SPARQL. Here is an example.
|
| 8 |
|
|
|
|
| 9 |
```
|
| 10 |
-
|
| 11 |
```
|
| 12 |
|
| 13 |
-
Result from default T5 tokenizer:
|
| 14 |
```
|
| 15 |
['β', 'SEL', 'ECT', 'β', '?', 'ans', 'wer', 'βW', 'HER', 'E', 'β', '{', 'β', 'w', 'd', ':', 'Q', '82', '59', '46', 'β',
|
| 16 |
'w', 'd', 't', ':', 'P', '37', '1', 'β', '?', 'X', 'β', '.', 'β', '?', 'X', 'β', 'w', 'd', 't', ':', 'P', '20', '48',
|
|
|
|
| 4 |
- lc_quad
|
| 5 |
---
|
| 6 |
|
| 7 |
+
This repo contains a custom tokenizer for SPARQL. Here is an example. It is a SentencePieceBPE tokenizer trained on lc_quad.
|
| 8 |
|
| 9 |
+
Original query:
|
| 10 |
```
|
| 11 |
+
SELECT ?answer WHERE { wd:Q825946 wdt:P371 ?X . ?X wdt:P2048 ?answer}
|
| 12 |
```
|
| 13 |
|
| 14 |
+
Result from default T5 tokenizer (just as an example):
|
| 15 |
```
|
| 16 |
['β', 'SEL', 'ECT', 'β', '?', 'ans', 'wer', 'βW', 'HER', 'E', 'β', '{', 'β', 'w', 'd', ':', 'Q', '82', '59', '46', 'β',
|
| 17 |
'w', 'd', 't', ':', 'P', '37', '1', 'β', '?', 'X', 'β', '.', 'β', '?', 'X', 'β', 'w', 'd', 't', ':', 'P', '20', '48',
|