Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
OiQ
/
daa-tokenizers
like
0
Follow
OiQ Labs
Arabic
Latin
tokenizer
moroccan-darija
arabic
bpe
unigram
wordpiece
bbpe
benchmark
License:
mit
Model card
Files
Files and versions
xet
Community
Copy to bucket
new
main
daa-tokenizers
Ctrl+K
Ctrl+K
1 contributor
History:
302 commits
Ouaill
Upload results/external_datasets_eval.json with huggingface_hub
765c3cc
verified
5 days ago
figures
Upload figures/dataset_comparison.png with huggingface_hub
5 days ago
plots
Upload plots/external_comparison.png with huggingface_hub
12 days ago
results
Upload results/external_datasets_eval.json with huggingface_hub
5 days ago
tokenizers
Upload tokenizers/concat_az_bbpe_55000.json with huggingface_hub
13 days ago
transformers_tokenizers
Upload transformers_tokenizers/concat_bbpe_32000_tokenizer_az/tokenizer.json with huggingface_hub
15 days ago
.gitattributes
4.34 kB
Upload results/plots/dataset_comparison.png with huggingface_hub
5 days ago
README.md
9.58 kB
Upload README.md with huggingface_hub
6 days ago
benchmark_report.md
Safe
8.47 kB
Upload benchmark_report.md with huggingface_hub
15 days ago
bootstrap_ci.csv
Safe
3.23 kB
Upload bootstrap_ci.csv with huggingface_hub
15 days ago
bootstrap_ci_test_set.csv
3.1 kB
Upload bootstrap_ci_test_set.csv with huggingface_hub
14 days ago
bootstrap_test_set.py
6.66 kB
Add bootstrap_test_set.py (test-set-only consistent eval)
14 days ago
code.md
13.6 kB
Upload code.md with huggingface_hub
12 days ago
codeswitch_results.csv
1.43 kB
Upload codeswitch_results.csv with huggingface_hub
14 days ago
compare_with_external.py
10.5 kB
Upload compare_with_external.py with huggingface_hub
15 days ago
dataset_stats.py
9.51 kB
Upload dataset_stats.py with huggingface_hub
5 days ago
doda_independent_results.csv
1.31 kB
Upload doda_independent_results.csv with huggingface_hub
14 days ago
eval_all_externals.py
11.9 kB
Add eval_all_externals.py (12 tokenizer comparison)
15 days ago
eval_and_compare.py
11.5 kB
Upload eval_and_compare.py with huggingface_hub
15 days ago
eval_codeswitch_and_new_baselines.py
15.1 kB
Upload eval_codeswitch_and_new_baselines.py with huggingface_hub
6 days ago
eval_darijabert_mix.py
4.69 kB
Upload eval_darijabert_mix.py with huggingface_hub
6 days ago
eval_doda_independent.py
7.72 kB
Upload eval_doda_independent.py with huggingface_hub
6 days ago
eval_external_datasets.py
12.8 kB
Upload eval_external_datasets.py with huggingface_hub
5 days ago
eval_morph_large.py
11.6 kB
Upload eval_morph_large.py with huggingface_hub
12 days ago
eval_test_set.py
8.15 kB
Add eval_test_set.py (test-set-only consistent eval)
14 days ago
external_comparison.csv
2.53 kB
Upload external_comparison.csv with huggingface_hub
13 days ago
external_comparison.json
5.47 kB
Update external_comparison.json with all 9 external tokenizers
15 days ago
gen_dataset_figure.py
4.05 kB
Upload gen_dataset_figure.py with huggingface_hub
5 days ago
latest_main.pdf
2.2 MB
xet
Upload latest_main.pdf with huggingface_hub
6 days ago
latest_main.tex
49.3 kB
Upload latest_main.tex with huggingface_hub
6 days ago
morph_large_vocab_results.csv
1.29 kB
Upload morph_large_vocab_results.csv with huggingface_hub
12 days ago
script.py
Safe
81 kB
Upload script.py with huggingface_hub
15 days ago
test_set_results.csv
11.1 kB
Upload test_set_results.csv with huggingface_hub
13 days ago
test_set_results.json
15 kB
Add test_set_results.json (single source of truth for all tables)
14 days ago
tokenizer_results.csv
9.19 kB
Upload tokenizer_results.csv with huggingface_hub
14 days ago
tokenizer_results.json
Safe
20.8 kB
Upload tokenizer_results.json with huggingface_hub
15 days ago