Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"WordPiece tokenization","local":"wordpiece-tokenization","sections":[{"title":"Training algorithm","local":"training-algorithm","sections":[],"depth":2},{"title":"Tokenization algorithm","local":"tokenization-algorithm","sections":[],"depth":2},{"title":"Implementing WordPiece","local":"implementing-wordpiece","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/course/pr_1069/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/entry/start.c5306bb2.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/scheduler.37c15a92.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/singletons.bc78d867.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/index.18351ede.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/paths.76894643.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/entry/app.4264f5f8.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/index.7cb9c9b8.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/nodes/0.f5347c47.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/nodes/73.b6e68db2.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/Tip.d10b3fc9.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/Youtube.8666c400.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/CodeBlock.abae2786.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/CourseFloatingBanner.df82c153.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/en/_app/immutable/chunks/getInferenceSnippets.f9350a3f.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"WordPiece tokenization","local":"wordpiece-tokenization","sections":[{"title":"Training algorithm","local":"training-algorithm","sections":[],"depth":2},{"title":"Tokenization algorithm","local":"tokenization-algorithm","sections":[],"depth":2},{"title":"Implementing WordPiece","local":"implementing-wordpiece","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="wordpiece-tokenization" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#wordpiece-tokenization"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>WordPiece tokenization</span></h1> <div class="flex space-x-1 absolute z-10 right-0 top-0"><a href="https://discuss.huggingface.co/t/chapter-6-questions" target="_blank"><img alt="Ask a Question" class="!m-0" src="https://img.shields.io/badge/Ask%20a%20question-ffcb4c.svg?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgLTEgMTA0IDEwNiI+PGRlZnM+PHN0eWxlPi5jbHMtMXtmaWxsOiMyMzFmMjA7fS5jbHMtMntmaWxsOiNmZmY5YWU7fS5jbHMtM3tmaWxsOiMwMGFlZWY7fS5jbHMtNHtmaWxsOiMwMGE5NGY7fS5jbHMtNXtmaWxsOiNmMTVkMjI7fS5jbHMtNntmaWxsOiNlMzFiMjM7fTwvc3R5bGU+PC9kZWZzPjx0aXRsZT5EaXNjb3Vyc2VfbG9nbzwvdGl0bGU+PGcgaWQ9IkxheWVyXzIiPjxnIGlkPSJMYXllcl8zIj48cGF0aCBjbGFzcz0iY2xzLTEiIGQ9Ik01MS44NywwQzIzLjcxLDAsMCwyMi44MywwLDUxYzAsLjkxLDAsNTIuODEsMCw1Mi44MWw1MS44Ni0uMDVjMjguMTYsMCw1MS0yMy43MSw1MS01MS44N1M4MCwwLDUxLjg3LDBaIi8+PHBhdGggY2xhc3M9ImNscy0yIiBkPSJNNTIuMzcsMTkuNzRBMzEuNjIsMzEuNjIsMCwwLDAsMjQuNTgsNjYuNDFsLTUuNzIsMTguNEwzOS40LDgwLjE3YTMxLjYxLDMxLjYxLDAsMSwwLDEzLTYwLjQzWiIvPjxwYXRoIGNsYXNzPSJjbHMtMyIgZD0iTTc3LjQ1LDMyLjEyYTMxLjYsMzEuNiwwLDAsMS0zOC4wNSw0OEwxOC44Niw4NC44MmwyMC45MS0yLjQ3QTMxLjYsMzEuNiwwLDAsMCw3Ny40NSwzMi4xMloiLz48cGF0aCBjbGFzcz0iY2xzLTQiIGQ9Ik03MS42MywyNi4yOUEzMS42LDMxLjYsMCwwLDEsMzguOCw3OEwxOC44Niw4NC44MiwzOS40LDgwLjE3QTMxLjYsMzEuNiwwLDAsMCw3MS42MywyNi4yOVoiLz48cGF0aCBjbGFzcz0iY2xzLTUiIGQ9Ik0yNi40Nyw2Ny4xMWEzMS42MSwzMS42MSwwLDAsMSw1MS0zNUEzMS42MSwzMS42MSwwLDAsMCwyNC41OCw2Ni40MWwtNS43MiwxOC40WiIvPjxwYXRoIGNsYXNzPSJjbHMtNiIgZD0iTTI0LjU4LDY2LjQxQTMxLjYxLDMxLjYxLDAsMCwxLDcxLjYzLDI2LjI5YTMxLjYxLDMxLjYxLDAsMCwwLTQ5LDM5LjYzbC0zLjc2LDE4LjlaIi8+PC9nPjwvZz48L3N2Zz4="></a> <a href="https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter6/section6.ipynb" target="_blank"><img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter6/section6.ipynb" target="_blank"><img alt="Open In Studio Lab" class="!m-0" src="https://studiolab.sagemaker.aws/studiolab.svg"></a></div> <p data-svelte-h="svelte-1qentm6">WordPiece is the tokenization algorithm Google developed to pretrain BERT. It has since been reused in quite a few Transformer models based on BERT, such as DistilBERT, MobileBERT, Funnel Transformers, and MPNET. It’s very similar to BPE in terms of the training, but the actual tokenization is done differently.</p> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/qpv6ms_t_1A" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-7v3wq0">💡 This section covers WordPiece in depth, going as far as showing a full implementation. You can skip to the end if you just want a general overview of the tokenization algorithm.</p></div> <h2 class="relative group"><a id="training-algorithm" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#training-algorithm"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Training algorithm</span></h2> <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"><p data-svelte-h="svelte-rd0zod">⚠️ Google never open-sourced its implementation of the training algorithm of WordPiece, so what follows is our best guess based on the published literature. It may not be 100% accurate.</p></div> <p data-svelte-h="svelte-103h849">Like BPE, WordPiece starts from a small vocabulary including the special tokens used by the model and the initial alphabet. Since it identifies subwords by adding a prefix (like <code>##</code> for BERT), each word is initially split by adding that prefix to all the characters inside the word. So, for instance, <code>"word"</code> gets split like this:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->w ##o ##r ##d<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-pvxu6x">Thus, the initial alphabet contains all the characters present at the beginning of a word and the characters present inside a word preceded by the WordPiece prefix.</p> <p>Then, again like BPE, WordPiece learns merge rules. The main difference is the way the pair to be merged is selected. Instead of selecting the most frequent pair, WordPiece computes a score for each pair, using the following formula: | |
| <!-- HTML_TAG_START --><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mrow><mi mathvariant="normal">s</mi><mi mathvariant="normal">c</mi><mi mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">e</mi></mrow><mo>=</mo><mo stretchy="false">(</mo><mrow><mi mathvariant="normal">f</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">q</mi><mi mathvariant="normal">_</mi><mi mathvariant="normal">o</mi><mi mathvariant="normal">f</mi><mi mathvariant="normal">_</mi><mi mathvariant="normal">p</mi><mi mathvariant="normal">a</mi><mi mathvariant="normal">i</mi><mi mathvariant="normal">r</mi></mrow><mo stretchy="false">)</mo><mi mathvariant="normal">/</mi><mo stretchy="false">(</mo><mrow><mi mathvariant="normal">f</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">q</mi><mi mathvariant="normal">_</mi><mi mathvariant="normal">o</mi><mi mathvariant="normal">f</mi><mi mathvariant="normal">_</mi><mi mathvariant="normal">f</mi><mi mathvariant="normal">i</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">s</mi><mi mathvariant="normal">t</mi><mi mathvariant="normal">_</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">l</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">m</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi mathvariant="normal">t</mi></mrow><mo>×</mo><mrow><mi mathvariant="normal">f</mi><mi mathvariant="normal">r</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">q</mi><mi mathvariant="normal">_</mi><mi mathvariant="normal">o</mi><mi mathvariant="normal">f</mi><mi mathvariant="normal">_</mi><mi mathvariant="normal">s</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">c</mi><mi mathvariant="normal">o</mi><mi mathvariant="normal">n</mi><mi mathvariant="normal">d</mi><mi mathvariant="normal">_</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">l</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">m</mi><mi mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi mathvariant="normal">t</mi></mrow><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">\mathrm{score} = (\mathrm{freq\_of\_pair}) / (\mathrm{freq\_of\_first\_element} \times \mathrm{freq\_of\_second\_element})</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.4306em;"></span><span class="mord"><span class="mord mathrm">score</span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathrm">freq_of_pair</span></span><span class="mclose">)</span><span class="mord">/</span><span class="mopen">(</span><span class="mord"><span class="mord mathrm">freq_of_first_element</span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1.06em;vertical-align:-0.31em;"></span><span class="mord"><span class="mord mathrm">freq_of_second_element</span></span><span class="mclose">)</span></span></span></span></span><!-- HTML_TAG_END --></p> <p data-svelte-h="svelte-7xl8bf">By dividing the frequency of the pair by the product of the frequencies of each of its parts, the algorithm prioritizes the merging of pairs where the individual parts are less frequent in the vocabulary. For instance, it won’t necessarily merge <code>("un", "##able")</code> even if that pair occurs very frequently in the vocabulary, because the two pairs <code>"un"</code> and <code>"##able"</code> will likely each appear in a lot of other words and have a high frequency. In contrast, a pair like <code>("hu", "##gging")</code> will probably be merged faster (assuming the word “hugging” appears often in the vocabulary) since <code>"hu"</code> and <code>"##gging"</code> are likely to be less frequent individually.</p> <p data-svelte-h="svelte-1reb4z4">Let’s look at the same vocabulary we used in the BPE training example:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">"hug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">10</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"pug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"pun"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">12</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"bun"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">4</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"hugs"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-pb62xd">The splits here will be:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">"h"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-number">10</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-number">12</span>), (<span class="hljs-string">"b"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">"h"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#s</span>"</span>, <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-q0haar">so the initial vocabulary will be <code>["b", "h", "p", "##g", "##n", "##s", "##u"]</code> (if we forget about special tokens for now). The most frequent pair is <code>("##u", "##g")</code> (present 20 times), but the individual frequency of <code>"##u"</code> is very high, so its score is not the highest (it’s 1 / 36). All pairs with a <code>"##u"</code> actually have that same score (1 / 36), so the best score goes to the pair <code>("##g", "##s")</code> — the only one without a <code>"##u"</code> — at 1 / 20, and the first merge learned is <code>("##g", "##s") -> ("##gs")</code>.</p> <p data-svelte-h="svelte-12rhmr5">Note that when we merge, we remove the <code>##</code> between the two tokens, so we add <code>"##gs"</code> to the vocabulary and apply the merge in the words of the corpus:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Vocabulary: [<span class="hljs-string">"b"</span>, <span class="hljs-string">"h"</span>, <span class="hljs-string">"p"</span>, <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#s</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#gs</span>"</span>] | |
| Corpus: (<span class="hljs-string">"h"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-number">10</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-number">12</span>), (<span class="hljs-string">"b"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">"h"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#gs</span>"</span>, <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-151eli3">At this point, <code>"##u"</code> is in all the possible pairs, so they all end up with the same score. Let’s say that in this case, the first pair is merged, so <code>("h", "##u") -> "hu"</code>. This takes us to:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Vocabulary: [<span class="hljs-string">"b"</span>, <span class="hljs-string">"h"</span>, <span class="hljs-string">"p"</span>, <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#s</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#gs</span>"</span>, <span class="hljs-string">"hu"</span>] | |
| Corpus: (<span class="hljs-string">"hu"</span> <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-number">10</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-number">12</span>), (<span class="hljs-string">"b"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">"hu"</span> <span class="hljs-string">"#<span class="hljs-subst">#gs</span>"</span>, <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-svusn0">Then the next best score is shared by <code>("hu", "##g")</code> and <code>("hu", "##gs")</code> (with 1/15, compared to 1/21 for all the other pairs), so the first pair with the biggest score is merged:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Vocabulary: [<span class="hljs-string">"b"</span>, <span class="hljs-string">"h"</span>, <span class="hljs-string">"p"</span>, <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#s</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span>, <span class="hljs-string">"#<span class="hljs-subst">#gs</span>"</span>, <span class="hljs-string">"hu"</span>, <span class="hljs-string">"hug"</span>] | |
| Corpus: (<span class="hljs-string">"hug"</span>, <span class="hljs-number">10</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#g</span>"</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-number">12</span>), (<span class="hljs-string">"b"</span> <span class="hljs-string">"#<span class="hljs-subst">#u</span>"</span> <span class="hljs-string">"#<span class="hljs-subst">#n</span>"</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">"hu"</span> <span class="hljs-string">"#<span class="hljs-subst">#gs</span>"</span>, <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1vaqaxm">and we continue like this until we reach the desired vocabulary size.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-4fg9cy">✏️ <strong>Now your turn!</strong> What will the next merge rule be?</p></div> <h2 class="relative group"><a id="tokenization-algorithm" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#tokenization-algorithm"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Tokenization algorithm</span></h2> <p data-svelte-h="svelte-v4c1z1">Tokenization differs in WordPiece and BPE in that WordPiece only saves the final vocabulary, not the merge rules learned. Starting from the word to tokenize, WordPiece finds the longest subword that is in the vocabulary, then splits on it. For instance, if we use the vocabulary learned in the example above, for the word <code>"hugs"</code> the longest subword starting from the beginning that is inside the vocabulary is <code>"hug"</code>, so we split there and get <code>["hug", "##s"]</code>. We then continue with <code>"##s"</code>, which is in the vocabulary, so the tokenization of <code>"hugs"</code> is <code>["hug", "##s"]</code>.</p> <p data-svelte-h="svelte-x3syme">With BPE, we would have applied the merges learned in order and tokenized this as <code>["hu", "##gs"]</code>, so the encoding is different.</p> <p data-svelte-h="svelte-gkhun5">As another example, let’s see how the word <code>"bugs"</code> would be tokenized. <code>"b"</code> is the longest subword starting at the beginning of the word that is in the vocabulary, so we split there and get <code>["b", "##ugs"]</code>. Then <code>"##u"</code> is the longest subword starting at the beginning of <code>"##ugs"</code> that is in the vocabulary, so we split there and get <code>["b", "##u, "##gs"]</code>. Finally, <code>"##gs"</code> is in the vocabulary, so this last list is the tokenization of <code>"bugs"</code>.</p> <p data-svelte-h="svelte-19tvv0p">When the tokenization gets to a stage where it’s not possible to find a subword in the vocabulary, the whole word is tokenized as unknown — so, for instance, <code>"mug"</code> would be tokenized as <code>["[UNK]"]</code>, as would <code>"bum"</code> (even if we can begin with <code>"b"</code> and <code>"##u"</code>, <code>"##m"</code> is not the vocabulary, and the resulting tokenization will just be <code>["[UNK]"]</code>, not <code>["b", "##u", "[UNK]"]</code>). This is another difference from BPE, which would only classify the individual characters not in the vocabulary as unknown.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1r124bw">✏️ <strong>Now your turn!</strong> How will the word <code>"pugs"</code> be tokenized?</p></div> <h2 class="relative group"><a id="implementing-wordpiece" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#implementing-wordpiece"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Implementing WordPiece</span></h2> <p data-svelte-h="svelte-148jv4w">Now let’s take a look at an implementation of the WordPiece algorithm. Like with BPE, this is just pedagogical, and you won’t able to use this on a big corpus.</p> <p data-svelte-h="svelte-j22yiv">We will use the same corpus as in the BPE example:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->corpus = [ | |
| <span class="hljs-string">"This is the Hugging Face Course."</span>, | |
| <span class="hljs-string">"This chapter is about tokenization."</span>, | |
| <span class="hljs-string">"This section shows several tokenizer algorithms."</span>, | |
| <span class="hljs-string">"Hopefully, you will be able to understand how they are trained and generate tokens."</span>, | |
| ]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1kypsn8">First, we need to pre-tokenize the corpus into words. Since we are replicating a WordPiece tokenizer (like BERT), we will use the <code>bert-base-cased</code> tokenizer for the pre-tokenization:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"bert-base-cased"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1piuede">Then we compute the frequencies of each word in the corpus as we do the pre-tokenization:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> defaultdict | |
| word_freqs = defaultdict(<span class="hljs-built_in">int</span>) | |
| <span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> corpus: | |
| words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text) | |
| new_words = [word <span class="hljs-keyword">for</span> word, offset <span class="hljs-keyword">in</span> words_with_offsets] | |
| <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> new_words: | |
| word_freqs[word] += <span class="hljs-number">1</span> | |
| word_freqs<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->defaultdict( | |
| <span class="hljs-built_in">int</span>, {<span class="hljs-string">'This'</span>: <span class="hljs-number">3</span>, <span class="hljs-string">'is'</span>: <span class="hljs-number">2</span>, <span class="hljs-string">'the'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Hugging'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Face'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Course'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'.'</span>: <span class="hljs-number">4</span>, <span class="hljs-string">'chapter'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'about'</span>: <span class="hljs-number">1</span>, | |
| <span class="hljs-string">'tokenization'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'section'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'shows'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'several'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'tokenizer'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'algorithms'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Hopefully'</span>: <span class="hljs-number">1</span>, | |
| <span class="hljs-string">','</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'you'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'will'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'be'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'able'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'to'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'understand'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'how'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'they'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'are'</span>: <span class="hljs-number">1</span>, | |
| <span class="hljs-string">'trained'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'and'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'generate'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'tokens'</span>: <span class="hljs-number">1</span>})<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-g6680z">As we saw before, the alphabet is the unique set composed of all the first letters of words, and all the other letters that appear in words prefixed by <code>##</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->alphabet = [] | |
| <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> word_freqs.keys(): | |
| <span class="hljs-keyword">if</span> word[<span class="hljs-number">0</span>] <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> alphabet: | |
| alphabet.append(word[<span class="hljs-number">0</span>]) | |
| <span class="hljs-keyword">for</span> letter <span class="hljs-keyword">in</span> word[<span class="hljs-number">1</span>:]: | |
| <span class="hljs-keyword">if</span> <span class="hljs-string">f"##<span class="hljs-subst">{letter}</span>"</span> <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> alphabet: | |
| alphabet.append(<span class="hljs-string">f"##<span class="hljs-subst">{letter}</span>"</span>) | |
| alphabet.sort() | |
| alphabet | |
| <span class="hljs-built_in">print</span>(alphabet)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'##a'</span>, <span class="hljs-string">'##b'</span>, <span class="hljs-string">'##c'</span>, <span class="hljs-string">'##d'</span>, <span class="hljs-string">'##e'</span>, <span class="hljs-string">'##f'</span>, <span class="hljs-string">'##g'</span>, <span class="hljs-string">'##h'</span>, <span class="hljs-string">'##i'</span>, <span class="hljs-string">'##k'</span>, <span class="hljs-string">'##l'</span>, <span class="hljs-string">'##m'</span>, <span class="hljs-string">'##n'</span>, <span class="hljs-string">'##o'</span>, <span class="hljs-string">'##p'</span>, <span class="hljs-string">'##r'</span>, <span class="hljs-string">'##s'</span>, | |
| <span class="hljs-string">'##t'</span>, <span class="hljs-string">'##u'</span>, <span class="hljs-string">'##v'</span>, <span class="hljs-string">'##w'</span>, <span class="hljs-string">'##y'</span>, <span class="hljs-string">'##z'</span>, <span class="hljs-string">','</span>, <span class="hljs-string">'.'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'F'</span>, <span class="hljs-string">'H'</span>, <span class="hljs-string">'T'</span>, <span class="hljs-string">'a'</span>, <span class="hljs-string">'b'</span>, <span class="hljs-string">'c'</span>, <span class="hljs-string">'g'</span>, <span class="hljs-string">'h'</span>, <span class="hljs-string">'i'</span>, <span class="hljs-string">'s'</span>, <span class="hljs-string">'t'</span>, <span class="hljs-string">'u'</span>, | |
| <span class="hljs-string">'w'</span>, <span class="hljs-string">'y'</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-9qkygh">We also add the special tokens used by the model at the beginning of that vocabulary. In the case of BERT, it’s the list <code>["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->vocab = [<span class="hljs-string">"[PAD]"</span>, <span class="hljs-string">"[UNK]"</span>, <span class="hljs-string">"[CLS]"</span>, <span class="hljs-string">"[SEP]"</span>, <span class="hljs-string">"[MASK]"</span>] + alphabet.copy()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1k9a5j8">Next we need to split each word, with all the letters that are not the first prefixed by <code>##</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->splits = { | |
| word: [c <span class="hljs-keyword">if</span> i == <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">f"##<span class="hljs-subst">{c}</span>"</span> <span class="hljs-keyword">for</span> i, c <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(word)] | |
| <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> word_freqs.keys() | |
| }<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mtlptv">Now that we are ready for training, let’s write a function that computes the score of each pair. We’ll need to use this at each step of the training:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">compute_pair_scores</span>(<span class="hljs-params">splits</span>): | |
| letter_freqs = defaultdict(<span class="hljs-built_in">int</span>) | |
| pair_freqs = defaultdict(<span class="hljs-built_in">int</span>) | |
| <span class="hljs-keyword">for</span> word, freq <span class="hljs-keyword">in</span> word_freqs.items(): | |
| split = splits[word] | |
| <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(split) == <span class="hljs-number">1</span>: | |
| letter_freqs[split[<span class="hljs-number">0</span>]] += freq | |
| <span class="hljs-keyword">continue</span> | |
| <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">len</span>(split) - <span class="hljs-number">1</span>): | |
| pair = (split[i], split[i + <span class="hljs-number">1</span>]) | |
| letter_freqs[split[i]] += freq | |
| pair_freqs[pair] += freq | |
| letter_freqs[split[-<span class="hljs-number">1</span>]] += freq | |
| scores = { | |
| pair: freq / (letter_freqs[pair[<span class="hljs-number">0</span>]] * letter_freqs[pair[<span class="hljs-number">1</span>]]) | |
| <span class="hljs-keyword">for</span> pair, freq <span class="hljs-keyword">in</span> pair_freqs.items() | |
| } | |
| <span class="hljs-keyword">return</span> scores<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-h8brnl">Let’s have a look at a part of this dictionary after the initial splits:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pair_scores = compute_pair_scores(splits) | |
| <span class="hljs-keyword">for</span> i, key <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(pair_scores.keys()): | |
| <span class="hljs-built_in">print</span>(<span class="hljs-string">f"<span class="hljs-subst">{key}</span>: <span class="hljs-subst">{pair_scores[key]}</span>"</span>) | |
| <span class="hljs-keyword">if</span> i >= <span class="hljs-number">5</span>: | |
| <span class="hljs-keyword">break</span><!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">'T'</span>, <span class="hljs-string">'##h'</span>): <span class="hljs-number">0.125</span> | |
| (<span class="hljs-string">'##h'</span>, <span class="hljs-string">'##i'</span>): <span class="hljs-number">0.03409090909090909</span> | |
| (<span class="hljs-string">'##i'</span>, <span class="hljs-string">'##s'</span>): <span class="hljs-number">0.02727272727272727</span> | |
| (<span class="hljs-string">'i'</span>, <span class="hljs-string">'##s'</span>): <span class="hljs-number">0.1</span> | |
| (<span class="hljs-string">'t'</span>, <span class="hljs-string">'##h'</span>): <span class="hljs-number">0.03571428571428571</span> | |
| (<span class="hljs-string">'##h'</span>, <span class="hljs-string">'##e'</span>): <span class="hljs-number">0.011904761904761904</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-drj1oh">Now, finding the pair with the best score only takes a quick loop:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->best_pair = <span class="hljs-string">""</span> | |
| max_score = <span class="hljs-literal">None</span> | |
| <span class="hljs-keyword">for</span> pair, score <span class="hljs-keyword">in</span> pair_scores.items(): | |
| <span class="hljs-keyword">if</span> max_score <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">or</span> max_score < score: | |
| best_pair = pair | |
| max_score = score | |
| <span class="hljs-built_in">print</span>(best_pair, max_score)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">'a'</span>, <span class="hljs-string">'##b'</span>) <span class="hljs-number">0.2</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mufquv">So the first merge to learn is <code>('a', '##b') -> 'ab'</code>, and we add <code>'ab'</code> to the vocabulary:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->vocab.append(<span class="hljs-string">"ab"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1pmzgqr">To continue, we need to apply that merge in our <code>splits</code> dictionary. Let’s write another function for this:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">merge_pair</span>(<span class="hljs-params">a, b, splits</span>): | |
| <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> word_freqs: | |
| split = splits[word] | |
| <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(split) == <span class="hljs-number">1</span>: | |
| <span class="hljs-keyword">continue</span> | |
| i = <span class="hljs-number">0</span> | |
| <span class="hljs-keyword">while</span> i < <span class="hljs-built_in">len</span>(split) - <span class="hljs-number">1</span>: | |
| <span class="hljs-keyword">if</span> split[i] == a <span class="hljs-keyword">and</span> split[i + <span class="hljs-number">1</span>] == b: | |
| merge = a + b[<span class="hljs-number">2</span>:] <span class="hljs-keyword">if</span> b.startswith(<span class="hljs-string">"##"</span>) <span class="hljs-keyword">else</span> a + b | |
| split = split[:i] + [merge] + split[i + <span class="hljs-number">2</span> :] | |
| <span class="hljs-keyword">else</span>: | |
| i += <span class="hljs-number">1</span> | |
| splits[word] = split | |
| <span class="hljs-keyword">return</span> splits<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-d7zdjw">And we can have a look at the result of the first merge:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->splits = merge_pair(<span class="hljs-string">"a"</span>, <span class="hljs-string">"##b"</span>, splits) | |
| splits[<span class="hljs-string">"about"</span>]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'ab'</span>, <span class="hljs-string">'##o'</span>, <span class="hljs-string">'##u'</span>, <span class="hljs-string">'##t'</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-vl065q">Now we have everything we need to loop until we have learned all the merges we want. Let’s aim for a vocab size of 70:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->vocab_size = <span class="hljs-number">70</span> | |
| <span class="hljs-keyword">while</span> <span class="hljs-built_in">len</span>(vocab) < vocab_size: | |
| scores = compute_pair_scores(splits) | |
| best_pair, max_score = <span class="hljs-string">""</span>, <span class="hljs-literal">None</span> | |
| <span class="hljs-keyword">for</span> pair, score <span class="hljs-keyword">in</span> scores.items(): | |
| <span class="hljs-keyword">if</span> max_score <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">or</span> max_score < score: | |
| best_pair = pair | |
| max_score = score | |
| splits = merge_pair(*best_pair, splits) | |
| new_token = ( | |
| best_pair[<span class="hljs-number">0</span>] + best_pair[<span class="hljs-number">1</span>][<span class="hljs-number">2</span>:] | |
| <span class="hljs-keyword">if</span> best_pair[<span class="hljs-number">1</span>].startswith(<span class="hljs-string">"##"</span>) | |
| <span class="hljs-keyword">else</span> best_pair[<span class="hljs-number">0</span>] + best_pair[<span class="hljs-number">1</span>] | |
| ) | |
| vocab.append(new_token)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-15nhah9">We can then look at the generated vocabulary:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-built_in">print</span>(vocab)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'[PAD]'</span>, <span class="hljs-string">'[UNK]'</span>, <span class="hljs-string">'[CLS]'</span>, <span class="hljs-string">'[SEP]'</span>, <span class="hljs-string">'[MASK]'</span>, <span class="hljs-string">'##a'</span>, <span class="hljs-string">'##b'</span>, <span class="hljs-string">'##c'</span>, <span class="hljs-string">'##d'</span>, <span class="hljs-string">'##e'</span>, <span class="hljs-string">'##f'</span>, <span class="hljs-string">'##g'</span>, <span class="hljs-string">'##h'</span>, <span class="hljs-string">'##i'</span>, <span class="hljs-string">'##k'</span>, | |
| <span class="hljs-string">'##l'</span>, <span class="hljs-string">'##m'</span>, <span class="hljs-string">'##n'</span>, <span class="hljs-string">'##o'</span>, <span class="hljs-string">'##p'</span>, <span class="hljs-string">'##r'</span>, <span class="hljs-string">'##s'</span>, <span class="hljs-string">'##t'</span>, <span class="hljs-string">'##u'</span>, <span class="hljs-string">'##v'</span>, <span class="hljs-string">'##w'</span>, <span class="hljs-string">'##y'</span>, <span class="hljs-string">'##z'</span>, <span class="hljs-string">','</span>, <span class="hljs-string">'.'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'F'</span>, <span class="hljs-string">'H'</span>, | |
| <span class="hljs-string">'T'</span>, <span class="hljs-string">'a'</span>, <span class="hljs-string">'b'</span>, <span class="hljs-string">'c'</span>, <span class="hljs-string">'g'</span>, <span class="hljs-string">'h'</span>, <span class="hljs-string">'i'</span>, <span class="hljs-string">'s'</span>, <span class="hljs-string">'t'</span>, <span class="hljs-string">'u'</span>, <span class="hljs-string">'w'</span>, <span class="hljs-string">'y'</span>, <span class="hljs-string">'ab'</span>, <span class="hljs-string">'##fu'</span>, <span class="hljs-string">'Fa'</span>, <span class="hljs-string">'Fac'</span>, <span class="hljs-string">'##ct'</span>, <span class="hljs-string">'##ful'</span>, <span class="hljs-string">'##full'</span>, <span class="hljs-string">'##fully'</span>, | |
| <span class="hljs-string">'Th'</span>, <span class="hljs-string">'ch'</span>, <span class="hljs-string">'##hm'</span>, <span class="hljs-string">'cha'</span>, <span class="hljs-string">'chap'</span>, <span class="hljs-string">'chapt'</span>, <span class="hljs-string">'##thm'</span>, <span class="hljs-string">'Hu'</span>, <span class="hljs-string">'Hug'</span>, <span class="hljs-string">'Hugg'</span>, <span class="hljs-string">'sh'</span>, <span class="hljs-string">'th'</span>, <span class="hljs-string">'is'</span>, <span class="hljs-string">'##thms'</span>, <span class="hljs-string">'##za'</span>, <span class="hljs-string">'##zat'</span>, | |
| <span class="hljs-string">'##ut'</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-15fyq8b">As we can see, compared to BPE, this tokenizer learns parts of words as tokens a bit faster.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-166hjxq">💡 Using <code>train_new_from_iterator()</code> on the same corpus won’t result in the exact same vocabulary. This is because the 🤗 Tokenizers library does not implement WordPiece for the training (since we are not completely sure of its internals), but uses BPE instead.</p></div> <p data-svelte-h="svelte-1pemxmn">To tokenize a new text, we pre-tokenize it, split it, then apply the tokenization algorithm on each word. That is, we look for the biggest subword starting at the beginning of the first word and split it, then we repeat the process on the second part, and so on for the rest of that word and the following words in the text:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">encode_word</span>(<span class="hljs-params">word</span>): | |
| tokens = [] | |
| <span class="hljs-keyword">while</span> <span class="hljs-built_in">len</span>(word) > <span class="hljs-number">0</span>: | |
| i = <span class="hljs-built_in">len</span>(word) | |
| <span class="hljs-keyword">while</span> i > <span class="hljs-number">0</span> <span class="hljs-keyword">and</span> word[:i] <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> vocab: | |
| i -= <span class="hljs-number">1</span> | |
| <span class="hljs-keyword">if</span> i == <span class="hljs-number">0</span>: | |
| <span class="hljs-keyword">return</span> [<span class="hljs-string">"[UNK]"</span>] | |
| tokens.append(word[:i]) | |
| word = word[i:] | |
| <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(word) > <span class="hljs-number">0</span>: | |
| word = <span class="hljs-string">f"##<span class="hljs-subst">{word}</span>"</span> | |
| <span class="hljs-keyword">return</span> tokens<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1hqbv66">Let’s test it on one word that’s in the vocabulary, and another that isn’t:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-built_in">print</span>(encode_word(<span class="hljs-string">"Hugging"</span>)) | |
| <span class="hljs-built_in">print</span>(encode_word(<span class="hljs-string">"HOgging"</span>))<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'Hugg'</span>, <span class="hljs-string">'##i'</span>, <span class="hljs-string">'##n'</span>, <span class="hljs-string">'##g'</span>] | |
| [<span class="hljs-string">'[UNK]'</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1euf4wm">Now, let’s write a function that tokenizes a text:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">tokenize</span>(<span class="hljs-params">text</span>): | |
| pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text) | |
| pre_tokenized_text = [word <span class="hljs-keyword">for</span> word, offset <span class="hljs-keyword">in</span> pre_tokenize_result] | |
| encoded_words = [encode_word(word) <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> pre_tokenized_text] | |
| <span class="hljs-keyword">return</span> <span class="hljs-built_in">sum</span>(encoded_words, [])<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-mvn7u4">We can try it on any text:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->tokenize(<span class="hljs-string">"This is the Hugging Face course!"</span>)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'Th'</span>, <span class="hljs-string">'##i'</span>, <span class="hljs-string">'##s'</span>, <span class="hljs-string">'is'</span>, <span class="hljs-string">'th'</span>, <span class="hljs-string">'##e'</span>, <span class="hljs-string">'Hugg'</span>, <span class="hljs-string">'##i'</span>, <span class="hljs-string">'##n'</span>, <span class="hljs-string">'##g'</span>, <span class="hljs-string">'Fac'</span>, <span class="hljs-string">'##e'</span>, <span class="hljs-string">'c'</span>, <span class="hljs-string">'##o'</span>, <span class="hljs-string">'##u'</span>, <span class="hljs-string">'##r'</span>, <span class="hljs-string">'##s'</span>, | |
| <span class="hljs-string">'##e'</span>, <span class="hljs-string">'[UNK]'</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1acligk">That’s it for the WordPiece algorithm! Now let’s take a look at Unigram.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/course/blob/main/chapters/en/chapter6/6.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_1y0degu = { | |
| assets: "/docs/course/pr_1069/en", | |
| base: "/docs/course/pr_1069/en", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/course/pr_1069/en/_app/immutable/entry/start.c5306bb2.js"), | |
| import("/docs/course/pr_1069/en/_app/immutable/entry/app.4264f5f8.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 73], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 93.8 kB
- Xet hash:
- a1818f4824c540172f3175535c63ff28de37a622508013b6c3111c1a7f788b73
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.