tokenizer-parity-v1 / bert /strings.json
dollspace's picture
feat: pin ferrotorch-tokenize parity fixtures v1 (#1168)
f41659a verified
[
"Hello, world!",
"The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. ",
"日本語のテスト 🎉 émoji",
"\n\tindented\n text\n",
"def foo(x):\n return x + 1",
"<|begin_of_text|>Hello<|end_of_text|>",
"[CLS] sentence A [SEP] sentence B [SEP]",
"",
"a",
" leading and trailing ",
"Mixed 123 with NUMBERS 4567 and symbols !@#$%^&*()",
"Newline\n\n\nthree",
"Tab\ttab\ttab",
"Quote \"double\" and 'single' and `backtick`",
"URL: https://example.com/path?query=value&other=1",
"Email: alice@example.com, BOB@FOO.IO",
"中文测试 with English mixed",
"Repeating aaaaaaaaaaaa and bbbbbbbbbbbb",
"Emoji rain 🌈🌈🌈 and stars ✨✨",
"Code: `int main() { return 0; }`"
]