Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Unigram Tokenization","local":"unigram-tokenization","sections":[{"title":"Training Algorithm","local":"training-algorithm","sections":[],"depth":2},{"title":"Tokenization Algorithm","local":"tokenization-algorithm","sections":[],"depth":2},{"title":"Training သို့ ပြန်သွားခြင်း","local":"back-to-training","sections":[],"depth":2},{"title":"Unigram ကို Implement လုပ်ခြင်း","local":"implementing-unigram","sections":[],"depth":2},{"title":"ဝေါဟာရ ရှင်းလင်းချက် (Glossary)","local":"ဝဟရ-ရငလငခက-glossary","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/course/pr_1114/my/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/entry/start.14794ee9.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/scheduler.893fe8c9.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/singletons.10fda3ce.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/index.bce52c8a.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/paths.89c82153.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/entry/app.a133f5c6.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/preload-helper.b1a719fd.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/index.b1df2166.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/nodes/0.510afdc1.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/nodes/52.00394302.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/MermaidChart.svelte_svelte_type_style_lang.762ed9cc.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/Youtube.ec5d7916.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/CodeBlock.6cef0479.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1114/my/_app/immutable/chunks/CourseFloatingBanner.c1c08878.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Unigram Tokenization","local":"unigram-tokenization","sections":[{"title":"Training Algorithm","local":"training-algorithm","sections":[],"depth":2},{"title":"Tokenization Algorithm","local":"tokenization-algorithm","sections":[],"depth":2},{"title":"Training သို့ ပြန်သွားခြင်း","local":"back-to-training","sections":[],"depth":2},{"title":"Unigram ကို Implement လုပ်ခြင်း","local":"implementing-unigram","sections":[],"depth":2},{"title":"ဝေါဟာရ ရှင်းလင်းချက် (Glossary)","local":"ဝဟရ-ရငလငခက-glossary","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <div class="items-center shrink-0 min-w-[100px] max-sm:min-w-[50px] justify-end ml-auto flex" style="float: right; margin-left: 10px; display: inline-flex; position: relative; z-index: 10;"><div class="inline-flex rounded-md max-sm:rounded-sm"><button class="inline-flex items-center gap-1 max-sm:gap-0.5 h-6 max-sm:h-5 px-2 max-sm:px-1.5 text-[11px] max-sm:text-[9px] font-medium text-gray-800 border border-r-0 rounded-l-md max-sm:rounded-l-sm border-gray-200 bg-white hover:shadow-inner dark:border-gray-850 dark:bg-gray-950 dark:text-gray-200 dark:hover:bg-gray-800" aria-live="polite"><span class="inline-flex items-center justify-center rounded-md p-0.5 max-sm:p-0"><svg class="w-3 h-3 max-sm:w-2.5 max-sm:h-2.5" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg></span> <span>Copy page</span></button> <button class="inline-flex items-center justify-center w-6 max-sm:w-5 h-6 max-sm:h-5 disabled:pointer-events-none text-sm text-gray-500 hover:text-gray-700 dark:hover:text-white rounded-r-md max-sm:rounded-r-sm border border-l transition border-gray-200 bg-white hover:shadow-inner dark:border-gray-850 dark:bg-gray-950 dark:text-gray-200 dark:hover:bg-gray-800" aria-haspopup="menu" aria-expanded="false" aria-label="Open copy menu"><svg class="transition-transform text-gray-400 overflow-visible w-3 h-3 max-sm:w-2.5 max-sm:h-2.5 rotate-0" width="1em" height="1em" viewBox="0 0 12 7" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M1 1L6 6L11 1" stroke="currentColor"></path></svg></button></div> </div> <h1 class="relative group"><a id="unigram-tokenization" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#unigram-tokenization"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Unigram Tokenization</span></h1> <div class="flex space-x-1 absolute z-10 right-0 top-0" style=""><a href="https://discuss.huggingface.co/t/chapter-6-questions" target="_blank"><img alt="Ask a Question" class="!m-0" src="https://img.shields.io/badge/Ask%20a%20question-ffcb4c.svg?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgLTEgMTA0IDEwNiI+PGRlZnM+PHN0eWxlPi5jbHMtMXtmaWxsOiMyMzFmMjA7fS5jbHMtMntmaWxsOiNmZmY5YWU7fS5jbHMtM3tmaWxsOiMwMGFlZWY7fS5jbHMtNHtmaWxsOiMwMGE5NGY7fS5jbHMtNXtmaWxsOiNmMTVkMjI7fS5jbHMtNntmaWxsOiNlMzFiMjM7fTwvc3R5bGU+PC9kZWZzPjx0aXRsZT5EaXNjb3Vyc2VfbG9nbzwvdGl0bGU+PGcgaWQ9IkxheWVyXzIiPjxnIGlkPSJMYXllcl8zIj48cGF0aCBjbGFzcz0iY2xzLTEiIGQ9Ik01MS44NywwQzIzLjcxLDAsMCwyMi44MywwLDUxYzAsLjkxLDAsNTIuODEsMCw1Mi44MWw1MS44Ni0uMDVjMjguMTYsMCw1MS0yMy43MSw1MS01MS44N1M4MCwwLDUxLjg3LDBaIi8+PHBhdGggY2xhc3M9ImNscy0yIiBkPSJNNTIuMzcsMTkuNzRBMzEuNjIsMzEuNjIsMCwwLDAsMjQuNTgsNjYuNDFsLTUuNzIsMTguNEwzOS40LDgwLjE3YTMxLjYxLDMxLjYxLDAsMSwwLDEzLTYwLjQzWiIvPjxwYXRoIGNsYXNzPSJjbHMtMyIgZD0iTTc3LjQ1LDMyLjEyYTMxLjYsMzEuNiwwLDAsMS0zOC4wNSw0OEwxOC44Niw4NC44MmwyMC45MS0yLjQ3QTMxLjYsMzEuNiwwLDAsMCw3Ny40NSwzMi4xMloiLz48cGF0aCBjbGFzcz0iY2xzLTQiIGQ9Ik03MS42MywyNi4yOUEzMS42LDMxLjYsMCwwLDEsMzguOCw3OEwxOC44Niw4NC44MiwzOS40LDgwLjE3QTMxLjYsMzEuNiwwLDAsMCw3MS42MywyNi4yOVoiLz48cGF0aCBjbGFzcz0iY2xzLTUiIGQ9Ik0yNi40Nyw2Ny4xMWEzMS42MSwzMS42MSwwLDAsMSw1MS0zNUEzMS42MSwzMS42MSwwLDAsMCwyNC41OCw2Ni40MWwtNS43MiwxOC40WiIvPjxwYXRoIGNsYXNzPSJjbHMtNiIgZD0iTTI0LjU4LDY2LjQxQTMxLjYxLDMxLjYxLDAsMCwxLDcxLjYzLDI2LjI5YTMxLjYxLDMxLjYxLDAsMCwwLTQ5LDM5LjYzbC0zLjc2LDE4LjlaIi8+PC9nPjwvZz48L3N2Zz4="></a> <a href="https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter6/section7.ipynb" target="_blank"><img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter6/section7.ipynb" target="_blank"><img alt="Open In Studio Lab" class="!m-0" src="https://studiolab.sagemaker.aws/studiolab.svg"></a></div> <p data-svelte-h="svelte-1ec1otk">Unigram algorithm ကို <a href="https://huggingface.co/papers/1808.06226" rel="nofollow">SentencePiece</a> နဲ့ ပေါင်းစပ်အသုံးပြုပါတယ်။ SentencePiece ကတော့ AlBERT, T5, mBART, Big Bird, နဲ့ XLNet လို models တွေ အသုံးပြုတဲ့ tokenization algorithm ဖြစ်ပါတယ်။</p> <p data-svelte-h="svelte-1niyljy">SentencePiece က ဘာသာစကားအားလုံးက စကားလုံးတွေကို ခွဲခြားဖို့ spaces တွေကို မသုံးဘူးဆိုတဲ့ အချက်ကို ဖြေရှင်းပေးပါတယ်။ အဲဒီအစား၊ SentencePiece က input ကို raw input stream တစ်ခုလို သတ်မှတ်ပြီး၊ အသုံးပြုမယ့် characters တွေထဲမှာ space ကိုလည်း ထည့်သွင်းပေးပါတယ်။ ပြီးမှ Unigram algorithm ကို အသုံးပြုပြီး သင့်လျော်တဲ့ vocabulary ကို တည်ဆောက်နိုင်ပါတယ်။</p> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/TGZfZVuF9Yc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <blockquote class="tip" data-svelte-h="svelte-140b825"><p>💡 ဒီအပိုင်းက Unigram ကို အပြည့်အဝ ဖော်ပြထားပြီး၊ အပြည့်အဝ implement လုပ်ထားတာကိုလည်း ပြသထားပါတယ်။ tokenization algorithm ရဲ့ အထွေထွေ overview ကိုပဲ လိုချင်တယ်ဆိုရင် အဆုံးထိ ကျော်သွားနိုင်ပါတယ်။</p></blockquote> <h2 class="relative group"><a id="training-algorithm" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#training-algorithm"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Training Algorithm</span></h2> <p data-svelte-h="svelte-qb1qy0">BPE နဲ့ WordPiece တို့နဲ့ ယှဉ်ရင် Unigram က အခြားတစ်ဘက်ကနေ အလုပ်လုပ်ပါတယ်၊ ဒါက ကြီးမားတဲ့ vocabulary ကနေ စတင်ပြီး လိုချင်တဲ့ vocabulary size ကို ရောက်တဲ့အထိ tokens တွေကို ဖယ်ရှားပါတယ်။ အဲဒီ base vocabulary ကို တည်ဆောက်ဖို့ နည်းလမ်းများစွာ ရှိပါတယ်၊ ဥပမာ၊ pre-tokenized words တွေထဲက အများဆုံး common substrings တွေကို ယူနိုင်ပါတယ်၊ ဒါမှမဟုတ် large vocabulary size နဲ့ initial corpus ပေါ်မှာ BPE ကို အသုံးချနိုင်ပါတယ်။</p> <p data-svelte-h="svelte-12s4im2">training ရဲ့ အဆင့်တိုင်းမှာ၊ Unigram algorithm က လက်ရှိ vocabulary ကို ပေးပြီး corpus တစ်ခုလုံးပေါ်မှာ loss တစ်ခုကို တွက်ချက်ပါတယ်။ ပြီးမှ၊ vocabulary ထဲက symbol တစ်ခုစီအတွက်၊ အဲဒီ symbol ကို ဖယ်ရှားလိုက်ရင် overall loss ဘယ်လောက်တိုးလာမလဲဆိုတာ algorithm က တွက်ချက်ပြီး၊ အနည်းဆုံးတိုးလာမယ့် symbols တွေကို ရှာဖွေပါတယ်။ အဲဒီ symbols တွေက corpus တစ်ခုလုံးပေါ်က overall loss အပေါ် သက်ရောက်မှု အနည်းဆုံးဖြစ်ပြီး၊ တစ်နည်းအားဖြင့် ၎င်းတို့ဟာ “လိုအပ်မှု နည်းပါး” တာကြောင့် ဖယ်ရှားဖို့ အကောင်းဆုံး candidates တွေ ဖြစ်ပါတယ်။</p> <p>ဒါက အလွန်ကုန်ကျစရိတ်များတဲ့ လုပ်ဆောင်ချက်ဖြစ်တာကြောင့်၊ အနည်းဆုံး loss တိုးလာမှုနဲ့ ဆက်စပ်နေတဲ့ single symbol ကို ဖယ်ရှားရုံနဲ့ မလုံလောက်ပါဘူး၊ ဒါပေမယ့် အနည်းဆုံး loss တိုးလာမှုနဲ့ ဆက်စပ်နေတဲ့<!-- HTML_TAG_START --><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>p</mi></mrow><annotation encoding="application/x-tex">p</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">p</span></span></span></span><!-- HTML_TAG_END --> (\(p\) ကတော့ သင်ထိန်းချုပ်နိုင်တဲ့ hyperparameter တစ်ခုပါ၊ ပုံမှန်အားဖြင့် 10 ဒါမှမဟုတ် 20) ရာခိုင်နှုန်း symbols တွေကို ဖယ်ရှားပါတယ်။ ဒီလုပ်ငန်းစဉ်ကို vocabulary က လိုချင်တဲ့ size ကို ရောက်တဲ့အထိ ထပ်ခါတလဲလဲ လုပ်ဆောင်ပါတယ်။</p> <p data-svelte-h="svelte-enwyi9">မည်သည့် word ကိုမဆို tokenize လုပ်နိုင်ဖို့ သေချာစေရန် base characters တွေကို ဘယ်တော့မှ မဖယ်ရှားဘူးဆိုတာ သတိပြုပါ။</p> <p data-svelte-h="svelte-110lprh">အခု ဒါက နည်းနည်းတော့ ဝိုးတဝါးဖြစ်နေပါသေးတယ်၊ algorithm ရဲ့ အဓိကအပိုင်းက corpus တစ်ခုလုံးပေါ်မှာ loss တစ်ခုကို တွက်ချက်ပြီး၊ vocabulary ကနေ tokens အချို့ကို ဖယ်ရှားတဲ့အခါ ဘယ်လိုပြောင်းလဲလဲဆိုတာ ကြည့်ဖို့ပါပဲ။ ဒါပေမယ့် ဒါကို ဘယ်လိုလုပ်ရမယ်ဆိုတာ ကျွန်တော်တို့ မရှင်းပြရသေးပါဘူး။ ဒီအဆင့်က Unigram model ရဲ့ tokenization algorithm ပေါ်မှာ မှီခိုနေတာကြောင့်၊ ဒါကို နောက်မှာ လေ့လာသွားပါမယ်။</p> <p data-svelte-h="svelte-mp3e7p">ယခင်ဥပမာတွေက corpus ကို ကျွန်တော်တို့ ပြန်လည်အသုံးပြုပါမယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">"hug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">10</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"pug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"pun"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">12</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"bun"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">4</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"hugs"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ri4331">ပြီးတော့ ဒီဥပမာအတွက်၊ initial vocabulary အတွက် strict substrings အားလုံးကို ယူပါမယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-selector-attr">[<span class="hljs-string">"h"</span>, <span class="hljs-string">"u"</span>, <span class="hljs-string">"g"</span>, <span class="hljs-string">"hu"</span>, <span class="hljs-string">"ug"</span>, <span class="hljs-string">"p"</span>, <span class="hljs-string">"pu"</span>, <span class="hljs-string">"n"</span>, <span class="hljs-string">"un"</span>, <span class="hljs-string">"b"</span>, <span class="hljs-string">"bu"</span>, <span class="hljs-string">"s"</span>, <span class="hljs-string">"hug"</span>, <span class="hljs-string">"gs"</span>, <span class="hljs-string">"ugs"</span>]</span><!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="tokenization-algorithm" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#tokenization-algorithm"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Tokenization Algorithm</span></h2> <p data-svelte-h="svelte-77e8uw">Unigram model ဆိုတာ language model အမျိုးအစားတစ်ခုဖြစ်ပြီး၊ token တစ်ခုစီကို ၎င်းရဲ့ရှေ့က tokens တွေနဲ့ လွတ်လပ်တယ်လို့ သတ်မှတ်ပါတယ်။ ဒါဟာ အလွယ်ကူဆုံး language model ဖြစ်ပြီး၊ ယခင် context ကို ပေးထားတဲ့ token X ရဲ့ probability က token X ရဲ့ probability သက်သက်ပဲ ဖြစ်ပါတယ်။ ဒါကြောင့်၊ Unigram language model ကို text generate လုပ်ဖို့ အသုံးပြုမယ်ဆိုရင်၊ ကျွန်တော်တို့ဟာ အများဆုံး common token ကို အမြဲတမ်း ခန့်မှန်းပါလိမ့်မယ်။</p> <p data-svelte-h="svelte-xq0bs5">ပေးထားတဲ့ token တစ်ခုရဲ့ probability က original corpus ထဲမှာ ၎င်းရဲ့ frequency (ကျွန်တော်တို့ ဘယ်နှစ်ကြိမ် တွေ့ရသလဲ) ကို vocabulary ထဲက tokens အားလုံးရဲ့ frequencies ပေါင်းလဒ်နဲ့ စားတာပါ (probabilities တွေ ပေါင်းလဒ် ၁ ဖြစ်ဖို့ သေချာစေရန်)။ ဥပမာ၊ <code>"ug"</code> က <code>"hug"</code>, <code>"pug"</code>, နဲ့ <code>"hugs"</code> ထဲမှာ ပါဝင်တာကြောင့်၊ ကျွန်တော်တို့ corpus မှာ 20 ရဲ့ frequency ရှိပါတယ်။</p> <p data-svelte-h="svelte-9rnivu">vocabulary ထဲမှာရှိတဲ့ ဖြစ်နိုင်ခြေရှိတဲ့ subwords အားလုံးရဲ့ frequencies တွေကတော့ ဒီမှာပါ။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">"h"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">15</span>) (<span class="hljs-string">"u"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">36</span>) (<span class="hljs-string">"g"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">20</span>) (<span class="hljs-string">"hu"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">15</span>) (<span class="hljs-string">"ug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">20</span>) (<span class="hljs-string">"p"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">17</span>) (<span class="hljs-string">"pu"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">17</span>) (<span class="hljs-string">"n"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">16</span>) | |
| (<span class="hljs-string">"un"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">16</span>) (<span class="hljs-string">"b"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">4</span>) (<span class="hljs-string">"bu"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">4</span>) (<span class="hljs-string">"s"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>) (<span class="hljs-string">"hug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">15</span>) (<span class="hljs-string">"gs"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>) (<span class="hljs-string">"ugs"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1m4zxej">ဒါကြောင့် frequencies အားလုံးရဲ့ ပေါင်းလဒ်က 210 ဖြစ်ပြီး၊ subword <code>"ug"</code> ရဲ့ probability က 20/210 ဖြစ်ပါတယ်။</p> <blockquote class="tip" data-svelte-h="svelte-1glvnsj"><p>✏️ <strong>အခု သင့်အလှည့်!</strong> အထက်ပါ frequencies တွေကို တွက်ချက်ဖို့ code ကို ရေးပြီး၊ ပြသထားတဲ့ ရလဒ်တွေ မှန်ကန်ခြင်းရှိမရှိ၊ ပြီးတော့ စုစုပေါင်းပေါင်းလဒ် မှန်ကန်ခြင်းရှိမရှိ ထပ်မံစစ်ဆေးပါ။</p></blockquote> <p>အခု၊ ပေးထားတဲ့ word တစ်ခုကို tokenize လုပ်ဖို့၊ tokens တွေအဖြစ် ဖြစ်နိုင်ခြေရှိတဲ့ segmentations အားလုံးကို ကြည့်ပြီး Unigram model အရ တစ်ခုစီရဲ့ probability ကို တွက်ချက်ပါတယ်။ tokens အားလုံးကို လွတ်လပ်တယ်လို့ ယူဆတာကြောင့်၊ ဒီ probability က token တစ်ခုစီရဲ့ probability တွေရဲ့ product သက်သက်ပဲ ဖြစ်ပါတယ်။ ဥပမာ၊ <code data-svelte-h="svelte-1gjdq76">"pug"</code> ကို tokenize လုပ်တဲ့ <code data-svelte-h="svelte-1n2m4po">["p", "u", "g"]</code> က အောက်ပါ probability ရှိပါတယ်။ | |
| <!-- HTML_TAG_START --><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>P</mi><mo stretchy="false">(</mo><mo stretchy="false">[</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>p</mi><mi mathvariant="normal">"</mi><mo separator="true">,</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>u</mi><mi mathvariant="normal">"</mi><mo separator="true">,</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>g</mi><mi mathvariant="normal">"</mi><mo stretchy="false">]</mo><mo stretchy="false">)</mo><mo>=</mo><mi>P</mi><mo stretchy="false">(</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>p</mi><mi mathvariant="normal">"</mi><mo stretchy="false">)</mo><mo>×</mo><mi>P</mi><mo stretchy="false">(</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>u</mi><mi mathvariant="normal">"</mi><mo stretchy="false">)</mo><mo>×</mo><mi>P</mi><mo stretchy="false">(</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>g</mi><mi mathvariant="normal">"</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mn>5</mn><mn>210</mn></mfrac><mo>×</mo><mfrac><mn>36</mn><mn>210</mn></mfrac><mo>×</mo><mfrac><mn>20</mn><mn>210</mn></mfrac><mo>=</mo><mn>0.000389</mn></mrow><annotation encoding="application/x-tex">P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{5}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.000389</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mopen">([</span><span class="mord">‘‘</span><span class="mord mathnormal">p</span><span class="mord">"</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">‘‘</span><span class="mord mathnormal">u</span><span class="mord">"</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">‘‘</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord">"</span><span class="mclose">])</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mopen">(</span><span class="mord">‘‘</span><span class="mord mathnormal">p</span><span class="mord">"</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mopen">(</span><span class="mord">‘‘</span><span class="mord mathnormal">u</span><span class="mord">"</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mopen">(</span><span class="mord">‘‘</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord">"</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.0074em;vertical-align:-0.686em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3214em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">210</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">5</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:2.0074em;vertical-align:-0.686em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3214em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">210</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">36</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:2.0074em;vertical-align:-0.686em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3214em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">210</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">20</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.000389</span></span></span></span></span><!-- HTML_TAG_END --></p> <p>နှိုင်းယှဉ်ကြည့်မယ်ဆိုရင်၊ <code data-svelte-h="svelte-42m5r0">["pu", "g"]</code> ကို tokenize လုပ်တာက အောက်ပါ probability ရှိပါတယ်- | |
| <!-- HTML_TAG_START --><span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>P</mi><mo stretchy="false">(</mo><mo stretchy="false">[</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>p</mi><mi>u</mi><mi mathvariant="normal">"</mi><mo separator="true">,</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>g</mi><mi mathvariant="normal">"</mi><mo stretchy="false">]</mo><mo stretchy="false">)</mo><mo>=</mo><mi>P</mi><mo stretchy="false">(</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>p</mi><mi>u</mi><mi mathvariant="normal">"</mi><mo stretchy="false">)</mo><mo>×</mo><mi>P</mi><mo stretchy="false">(</mo><mi mathvariant="normal">‘</mi><mi mathvariant="normal">‘</mi><mi>g</mi><mi mathvariant="normal">"</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mn>5</mn><mn>210</mn></mfrac><mo>×</mo><mfrac><mn>20</mn><mn>210</mn></mfrac><mo>=</mo><mn>0.0022676</mn></mrow><annotation encoding="application/x-tex">P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{5}{210} \times \frac{20}{210} = 0.0022676</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mopen">([</span><span class="mord">‘‘</span><span class="mord mathnormal">p</span><span class="mord mathnormal">u</span><span class="mord">"</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">‘‘</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord">"</span><span class="mclose">])</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mopen">(</span><span class="mord">‘‘</span><span class="mord mathnormal">p</span><span class="mord mathnormal">u</span><span class="mord">"</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mopen">(</span><span class="mord">‘‘</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord">"</span><span class="mclose">)</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:2.0074em;vertical-align:-0.686em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3214em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">210</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">5</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">×</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:2.0074em;vertical-align:-0.686em;"></span><span class="mord"><span class="mopen nulldelimiter"></span><span class="mfrac"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:1.3214em;"><span style="top:-2.314em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">210</span></span></span><span style="top:-3.23em;"><span class="pstrut" style="height:3em;"></span><span class="frac-line" style="border-bottom-width:0.04em;"></span></span><span style="top:-3.677em;"><span class="pstrut" style="height:3em;"></span><span class="mord"><span class="mord">20</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.686em;"><span></span></span></span></span></span><span class="mclose nulldelimiter"></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">0.0022676</span></span></span></span></span><!-- HTML_TAG_END --></p> <p data-svelte-h="svelte-17k816n">ဒါကြောင့် အဲဒီတစ်ခုက ဖြစ်နိုင်ခြေ ပိုများပါတယ်။ ယေဘုယျအားဖြင့်၊ ဖြစ်နိုင်ခြေ အနည်းဆုံး tokens များပါဝင်တဲ့ tokenizations တွေက အမြင့်ဆုံး probability ကို ရရှိပါလိမ့်မယ် (token တစ်ခုစီအတွက် 210 နဲ့ စားတာ ထပ်ခါတလဲလဲ လုပ်ရလို့ပါ)၊ ဒါက ကျွန်တော်တို့ ပုံမှန်အားဖြင့် လိုချင်တာနဲ့ ကိုက်ညီပါတယ်၊ word တစ်ခုကို ဖြစ်နိုင်ခြေ အနည်းဆုံး tokens အရေအတွက်အဖြစ် ပိုင်းခြားဖို့ပါ။</p> <p data-svelte-h="svelte-2d92zz">Unigram model နဲ့ word တစ်ခုကို tokenize လုပ်တာကတော့ အမြင့်ဆုံး probability ရှိတဲ့ tokenization ပါပဲ။ <code>"pug"</code> ဥပမာမှာ၊ ဖြစ်နိုင်ခြေရှိတဲ့ segmentation တစ်ခုစီအတွက် ကျွန်တော်တို့ ရရှိမယ့် probabilities တွေကတော့…</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">"p"</span>, <span class="hljs-string">"u"</span>, <span class="hljs-string">"g"</span>] : 0.000389 | |
| [<span class="hljs-string">"p"</span>, <span class="hljs-string">"ug"</span>] : 0.0022676 | |
| [<span class="hljs-string">"pu"</span>, <span class="hljs-string">"g"</span>] : 0.0022676<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-q7cxcf">ဒါကြောင့် <code>"pug"</code> ကို <code>["p", "ug"]</code> ဒါမှမဟုတ် <code>["pu", "g"]</code> အဖြစ် tokenize လုပ်ပါလိမ့်မယ် (ဒီလို တူညီတဲ့ကိစ္စမျိုးတွေက ပိုကြီးတဲ့ corpus မှာ ရှားပါးမယ်ဆိုတာ သတိပြုပါ)။</p> <p data-svelte-h="svelte-kdnobe">ဒီကိစ္စမှာ၊ ဖြစ်နိုင်ခြေရှိတဲ့ segmentations အားလုံးကို ရှာဖွေပြီး ၎င်းတို့ရဲ့ probabilities တွေကို တွက်ချက်တာ လွယ်ကူခဲ့ပါတယ်၊ ဒါပေမယ့် ယေဘုယျအားဖြင့်တော့ နည်းနည်း ပိုခက်ပါလိမ့်မယ်။ ဒီအတွက် အသုံးပြုတဲ့ classic algorithm တစ်ခုရှိပါတယ်၊ ဒါကို <em>Viterbi algorithm</em> လို့ ခေါ်ပါတယ်။ အနှစ်သာရအားဖြင့်၊ word တစ်ခုရဲ့ ဖြစ်နိုင်ခြေရှိတဲ့ segmentations တွေကို ရှာဖွေဖို့ graph တစ်ခု တည်ဆောက်နိုင်ပါတယ်။ အကယ်၍ character <em>a</em> ကနေ character <em>b</em> အထိ subword က vocabulary ထဲမှာ ပါဝင်တယ်ဆိုရင်၊ အဲဒီ branch ကို subword ရဲ့ probability ကို သတ်မှတ်ပေးပြီး၊ character <em>a</em> ကနေ character <em>b</em> အထိ branch တစ်ခု ရှိတယ်လို့ ပြောနိုင်ပါတယ်။</p> <p data-svelte-h="svelte-1ynb9yz">အဲဒီ graph ထဲမှာ အကောင်းဆုံး score ရှိမယ့် path ကို ရှာဖွေဖို့ Viterbi algorithm က word ထဲက position တစ်ခုစီအတွက်၊ အဲဒီ position မှာ အဆုံးသတ်ပြီး အကောင်းဆုံး score ရှိတဲ့ segmentation ကို ဆုံးဖြတ်ပါတယ်။ ကျွန်တော်တို့က အစကနေ အဆုံးထိ သွားတာကြောင့်၊ အကောင်းဆုံး score ကို လက်ရှိ position မှာ အဆုံးသတ်တဲ့ subwords အားလုံးကို loop လုပ်ပြီး၊ အဲဒီ subword စတင်တဲ့ position ကနေ အကောင်းဆုံး tokenization score ကို အသုံးပြုခြင်းဖြင့် ရှာဖွေနိုင်ပါတယ်။ ပြီးမှ၊ အဆုံးထိရောက်ဖို့ ယူခဲ့တဲ့ path ကို ပြန်ဖွင့်ဖို့ပဲ လိုအပ်ပါတယ်။</p> <p data-svelte-h="svelte-bydt9v">ကျွန်တော်တို့ရဲ့ vocabulary နဲ့ <code>"unhug"</code> word ကို အသုံးပြုပြီး ဥပမာတစ်ခု ကြည့်ရအောင်။ position တစ်ခုစီအတွက်၊ အဲဒီမှာ အဆုံးသတ်ပြီး အကောင်းဆုံး scores ရှိတဲ့ subwords တွေက အောက်ပါအတိုင်းပါ။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-attribute">Character</span> <span class="hljs-number">0</span> (u): <span class="hljs-string">"u"</span> (score <span class="hljs-number">0</span>.<span class="hljs-number">171429</span>) | |
| <span class="hljs-attribute">Character</span> <span class="hljs-number">1</span> (n): <span class="hljs-string">"un"</span> (score <span class="hljs-number">0</span>.<span class="hljs-number">076191</span>) | |
| <span class="hljs-attribute">Character</span> <span class="hljs-number">2</span> (h): <span class="hljs-string">"un"</span> <span class="hljs-string">"h"</span> (score <span class="hljs-number">0</span>.<span class="hljs-number">005442</span>) | |
| <span class="hljs-attribute">Character</span> <span class="hljs-number">3</span> (u): <span class="hljs-string">"un"</span> <span class="hljs-string">"hu"</span> (score <span class="hljs-number">0</span>.<span class="hljs-number">005442</span>) | |
| <span class="hljs-attribute">Character</span> <span class="hljs-number">4</span> (g): <span class="hljs-string">"un"</span> <span class="hljs-string">"hug"</span> (score <span class="hljs-number">0</span>.<span class="hljs-number">005442</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1acl0d1">ဒါကြောင့် <code>"unhug"</code> ကို <code>["un", "hug"]</code> အဖြစ် tokenize လုပ်ပါလိမ့်မယ်။</p> <blockquote class="tip" data-svelte-h="svelte-ius2fj"><p>✏️ <strong>အခု သင့်အလှည့်!</strong> <code>"huggun"</code> ဆိုတဲ့ word ရဲ့ tokenization နဲ့ ၎င်းရဲ့ score ကို ဆုံးဖြတ်ပါ။</p></blockquote> <h2 class="relative group"><a id="back-to-training" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#back-to-training"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Training သို့ ပြန်သွားခြင်း</span></h2> <p data-svelte-h="svelte-4931fu">tokenization ဘယ်လိုအလုပ်လုပ်လဲဆိုတာ မြင်ခဲ့ရပြီဆိုတော့၊ training လုပ်နေစဉ် အသုံးပြုတဲ့ loss ကို နည်းနည်းပိုနက်နက်နဲနဲ လေ့လာကြည့်နိုင်ပါပြီ။ မည်သည့်အဆင့်မှာမဆို၊ ဒီ loss ကို corpus ထဲက word တိုင်းကို tokenize လုပ်ခြင်းဖြင့် တွက်ချက်ပါတယ်။ လက်ရှိ vocabulary နဲ့ corpus ထဲက token တစ်ခုစီရဲ့ frequencies (အရင်က တွေ့ခဲ့ရတဲ့အတိုင်း) နဲ့ ဆုံးဖြတ်ထားတဲ့ Unigram model ကို အသုံးပြုပါတယ်။</p> <p data-svelte-h="svelte-ok4dvk">corpus ထဲက word တိုင်းမှာ score တစ်ခုရှိပြီး၊ loss က အဲဒီ scores တွေရဲ့ negative log likelihood ပါ — ဒါက corpus ထဲက words အားလုံးအတွက် <code>-log(P(word))</code> ရဲ့ ပေါင်းလဒ်ပါပဲ။</p> <p data-svelte-h="svelte-xwzvj3">အောက်ပါ corpus နဲ့ ကျွန်တော်တို့ရဲ့ ဥပမာကို ပြန်သွားကြစို့။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">"hug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">10</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"pug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"pun"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">12</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"bun"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">4</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"hugs"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-iqqwp">၎င်းတို့ရဲ့ သက်ဆိုင်ရာ scores တွေနဲ့ word တစ်ခုစီရဲ့ tokenization ကတော့…</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-string">"hug"</span>: [<span class="hljs-string">"hug"</span>] <span class="hljs-comment">(score 0.071428)</span> | |
| <span class="hljs-string">"pug"</span>: [<span class="hljs-string">"pu"</span>, <span class="hljs-string">"g"</span>] <span class="hljs-comment">(score 0.007710)</span> | |
| <span class="hljs-string">"pun"</span>: [<span class="hljs-string">"pu"</span>, <span class="hljs-string">"n"</span>] <span class="hljs-comment">(score 0.006168)</span> | |
| <span class="hljs-string">"bun"</span>: [<span class="hljs-string">"bu"</span>, <span class="hljs-string">"n"</span>] <span class="hljs-comment">(score 0.001451)</span> | |
| <span class="hljs-string">"hugs"</span>: [<span class="hljs-string">"hug"</span>, <span class="hljs-string">"s"</span>] <span class="hljs-comment">(score 0.001701)</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-40wkrm">ဒါကြောင့် loss ကတော့…</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-attribute">10</span> * (-log(<span class="hljs-number">0</span>.<span class="hljs-number">071428</span>)) + <span class="hljs-number">5</span> * (-log(<span class="hljs-number">0</span>.<span class="hljs-number">007710</span>)) + <span class="hljs-number">12</span> * (-log(<span class="hljs-number">0</span>.<span class="hljs-number">006168</span>)) + <span class="hljs-number">4</span> * (-log(<span class="hljs-number">0</span>.<span class="hljs-number">001451</span>)) + <span class="hljs-number">5</span> * (-log(<span class="hljs-number">0</span>.<span class="hljs-number">001701</span>)) = <span class="hljs-number">169</span>.<span class="hljs-number">8</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1x8u81k">အခု token တစ်ခုစီကို ဖယ်ရှားခြင်းက loss အပေါ် ဘယ်လိုသက်ရောက်လဲဆိုတာ တွက်ချက်ဖို့ လိုပါတယ်။ ဒါက အတော်လေး ပင်ပန်းတဲ့ လုပ်ဆောင်ချက်ဖြစ်တာကြောင့်၊ ကျွန်တော်တို့ code အကူအညီရတဲ့အခါမှပဲ လုပ်ငန်းစဉ်တစ်ခုလုံးကို လုပ်ဆောင်ပြီး ဒီနေရာမှာတော့ tokens နှစ်ခုအတွက်ပဲ လုပ်ဆောင်ပါမယ်။ ဒီ (အလွန်) သီးခြားကိစ္စမှာ၊ words အားလုံးရဲ့ တူညီတဲ့ tokenizations နှစ်ခုရှိခဲ့ပါတယ်- အရင်က တွေ့ခဲ့ရတဲ့အတိုင်း၊ ဥပမာ <code>"pug"</code> ကို <code>["p", "ug"]</code> လို့ တူညီတဲ့ score နဲ့ tokenize လုပ်နိုင်ပါတယ်။ ဒါကြောင့် vocabulary ကနေ <code>"pu"</code> token ကို ဖယ်ရှားခြင်းက အတိအကျတူညီတဲ့ loss ကို ပေးပါလိမ့်မယ်။</p> <p data-svelte-h="svelte-1eain9w">အခြားတစ်ဖက်မှာ၊ <code>"hug"</code> ကို ဖယ်ရှားခြင်းက loss ကို ပိုဆိုးစေပါလိမ့်မယ်။ ဘာလို့လဲဆိုတော့ <code>"hug"</code> နဲ့ <code>"hugs"</code> ရဲ့ tokenization က…</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-string">"hug"</span>: [<span class="hljs-string">"hu"</span>, <span class="hljs-string">"g"</span>] <span class="hljs-comment">(score 0.006802)</span> | |
| <span class="hljs-string">"hugs"</span>: [<span class="hljs-string">"hu"</span>, <span class="hljs-string">"gs"</span>] <span class="hljs-comment">(score 0.001701)</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-y3zon5">ဒီပြောင်းလဲမှုတွေက loss ကို အောက်ပါအတိုင်း တိုးစေပါလိမ့်မယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->- <span class="hljs-number">10</span> * (<span class="hljs-name">-log</span>(<span class="hljs-number">0.071428</span>)) + <span class="hljs-number">10</span> * (<span class="hljs-name">-log</span>(<span class="hljs-number">0.006802</span>)) = <span class="hljs-number">23.5</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mrg0e6">ဒါကြောင့်၊ <code>"pu"</code> token ကို vocabulary ကနေ ဖယ်ရှားဖွယ်ရှိပေမယ့် <code>"hug"</code> ကိုတော့ ဖယ်ရှားမှာ မဟုတ်ပါဘူး။</p> <h2 class="relative group"><a id="implementing-unigram" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#implementing-unigram"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Unigram ကို Implement လုပ်ခြင်း</span></h2> <p data-svelte-h="svelte-1uskoas">အခုထိ ကျွန်တော်တို့ မြင်တွေ့ခဲ့ရတာတွေအားလုံးကို code ထဲမှာ Implement လုပ်ကြည့်ရအောင်။ BPE နဲ့ WordPiece တို့လိုပဲ၊ ဒါက Unigram algorithm ရဲ့ ထိရောက်တဲ့ implementation မဟုတ်ပါဘူး (ဆန့်ကျင်ဘက်ပါပဲ)၊ ဒါပေမယ့် ဒါက သင့်ကို ပိုကောင်းကောင်း နားလည်အောင် ကူညီပေးသင့်ပါတယ်။</p> <p data-svelte-h="svelte-iitzql">ဥပမာအနေနဲ့ ယခင် corpus တူတူကို ကျွန်တော်တို့ အသုံးပြုပါမယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->corpus = [ | |
| <span class="hljs-string">"This is the Hugging Face Course."</span>, | |
| <span class="hljs-string">"This chapter is about tokenization."</span>, | |
| <span class="hljs-string">"This section shows several tokenizer algorithms."</span>, | |
| <span class="hljs-string">"Hopefully, you will be able to understand how they are trained and generate tokens."</span>, | |
| ]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1v6hsdh">ဒီတစ်ခါတော့၊ <code>xlnet-base-cased</code> ကို ကျွန်တော်တို့ model အဖြစ် အသုံးပြုပါမယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"xlnet-base-cased"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-12m9nb9">BPE နဲ့ WordPiece တို့လိုပဲ၊ corpus ထဲက word တစ်ခုစီရဲ့ occurrences အရေအတွက်ကို ရေတွက်ခြင်းဖြင့် ကျွန်တော်တို့ စတင်ပါတယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> defaultdict | |
| word_freqs = defaultdict(<span class="hljs-built_in">int</span>) | |
| <span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> corpus: | |
| words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text) | |
| new_words = [word <span class="hljs-keyword">for</span> word, offset <span class="hljs-keyword">in</span> words_with_offsets] | |
| <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> new_words: | |
| word_freqs[word] += <span class="hljs-number">1</span> | |
| word_freqs<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1chzwhs">ပြီးမှ၊ ကျွန်တော်တို့ရဲ့ vocabulary ကို နောက်ဆုံး လိုချင်တဲ့ vocab size ထက် ပိုကြီးတဲ့ တစ်ခုခုနဲ့ initialize လုပ်ဖို့ လိုအပ်ပါတယ်။ ကျွန်တော်တို့ဟာ basic characters အားလုံးကို ထည့်သွင်းရပါမယ် (ဒါမှမဟုတ်ရင် word တိုင်းကို tokenize လုပ်နိုင်မှာ မဟုတ်ပါဘူး)၊ ဒါပေမယ့် ပိုကြီးတဲ့ substrings တွေအတွက်တော့ အများဆုံး common ones တွေကိုပဲ ထိန်းသိမ်းထားပါမယ်၊ ဒါကြောင့် ၎င်းတို့ကို frequency အလိုက် sort လုပ်ပါတယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->char_freqs = defaultdict(<span class="hljs-built_in">int</span>) | |
| subwords_freqs = defaultdict(<span class="hljs-built_in">int</span>) | |
| <span class="hljs-keyword">for</span> word, freq <span class="hljs-keyword">in</span> word_freqs.items(): | |
| <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">len</span>(word)): | |
| char_freqs[word[i]] += freq | |
| <span class="hljs-comment"># Loop through the subwords of length at least 2</span> | |
| <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(i + <span class="hljs-number">2</span>, <span class="hljs-built_in">len</span>(word) + <span class="hljs-number">1</span>): | |
| subwords_freqs[word[i:j]] += freq | |
| <span class="hljs-comment"># subwords တွေကို frequency အလိုက် sort လုပ်ပါ။</span> | |
| sorted_subwords = <span class="hljs-built_in">sorted</span>(subwords_freqs.items(), key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>], reverse=<span class="hljs-literal">True</span>) | |
| sorted_subwords[:<span class="hljs-number">10</span>]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[(<span class="hljs-string">' t'</span>, <span class="hljs-number">7</span>), (<span class="hljs-string">'is'</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">'er'</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">' a'</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">' to'</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">'to'</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">'en'</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">' T'</span>, <span class="hljs-number">3</span>), (<span class="hljs-string">' Th'</span>, <span class="hljs-number">3</span>), (<span class="hljs-string">' Thi'</span>, <span class="hljs-number">3</span>)]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1kc7dzs">characters တွေကို အကောင်းဆုံး subwords တွေနဲ့ အုပ်စုဖွဲ့ပြီး size 300 ရှိတဲ့ initial vocabulary ကို ရရှိပါတယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->token_freqs = <span class="hljs-built_in">list</span>(char_freqs.items()) + sorted_subwords[: <span class="hljs-number">300</span> - <span class="hljs-built_in">len</span>(char_freqs)] | |
| token_freqs = {token: freq <span class="hljs-keyword">for</span> token, freq <span class="hljs-keyword">in</span> token_freqs}<!-- HTML_TAG_END --></pre></div> <blockquote class="tip" data-svelte-h="svelte-i6ed14"><p>💡 SentencePiece က initial vocabulary ကို ဖန်တီးဖို့ Enhanced Suffix Array (ESA) လို့ခေါ်တဲ့ ပိုထိရောက်တဲ့ algorithm ကို အသုံးပြုပါတယ်။</p></blockquote> <p data-svelte-h="svelte-t81q9r">နောက်တစ်ဆင့်မှာတော့၊ frequencies တွေကို probabilities တွေအဖြစ် ပြောင်းလဲဖို့ frequencies အားလုံးရဲ့ ပေါင်းလဒ်ကို တွက်ချက်ပါတယ်။ ကျွန်တော်တို့ model အတွက် probabilities ရဲ့ logarithms တွေကို သိမ်းဆည်းထားပါမယ်၊ ဘာလို့လဲဆိုတော့ small numbers တွေကို မြှောက်တာထက် logarithms တွေကို ပေါင်းတာက ပိုပြီး numerically stable ဖြစ်ပြီး၊ ဒါက model ရဲ့ loss တွက်ချက်ခြင်းကို ရိုးရှင်းစေပါလိမ့်မယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> math <span class="hljs-keyword">import</span> log | |
| total_sum = <span class="hljs-built_in">sum</span>([freq <span class="hljs-keyword">for</span> token, freq <span class="hljs-keyword">in</span> token_freqs.items()]) | |
| model = {token: -log(freq / total_sum) <span class="hljs-keyword">for</span> token, freq <span class="hljs-keyword">in</span> token_freqs.items()}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-vgwc8p">အခု အဓိက function က Viterbi algorithm ကို အသုံးပြုပြီး words တွေကို tokenize လုပ်တဲ့ function ပါပဲ။ အရင်က တွေ့ခဲ့ရတဲ့အတိုင်း၊ အဲဒီ algorithm က word ရဲ့ substring တစ်ခုစီရဲ့ အကောင်းဆုံး segmentation ကို တွက်ချက်ပြီး၊ ဒါကို <code>best_segmentations</code> လို့ခေါ်တဲ့ variable တစ်ခုမှာ ကျွန်တော်တို့ သိမ်းဆည်းထားပါမယ်။ word ထဲက position တစ်ခုစီ (0 ကနေ စုစုပေါင်းအရှည်အထိ) အတွက် dictionary တစ်ခုစီ သိမ်းဆည်းထားပါမယ်၊ keys နှစ်ခုနဲ့ပါ။ အဲဒါတွေက အကောင်းဆုံး segmentation ထဲက နောက်ဆုံး token ရဲ့ စတင်ခြင်း index နဲ့ အကောင်းဆုံး segmentation ရဲ့ score ပါပဲ။ နောက်ဆုံး token ရဲ့ စတင်ခြင်း index နဲ့၊ list ကို အပြည့်အစုံ ဖြည့်ပြီးတာနဲ့ full segmentation ကို ပြန်လည်ရယူနိုင်ပါလိမ့်မယ်။</p> <p data-svelte-h="svelte-9ehsxd">list ကို ဖြည့်သွင်းတာက loops နှစ်ခုနဲ့ လုပ်ဆောင်ပါတယ်၊ အဓိက loop က start position တစ်ခုစီကို ဖြတ်သွားပြီး၊ ဒုတိယ loop က အဲဒီ start position ကနေ စတင်တဲ့ substrings အားလုံးကို ကြိုးစားကြည့်ပါတယ်။ substring က vocabulary ထဲမှာ ပါဝင်တယ်ဆိုရင်၊ အဲဒီ end position အထိ word ရဲ့ segmentation အသစ်တစ်ခုကို ကျွန်တော်တို့ ရရှိပြီး၊ ဒါကို <code>best_segmentations</code> မှာရှိတဲ့ အရာနဲ့ နှိုင်းယှဉ်ပါတယ်။</p> <p data-svelte-h="svelte-titbi">အဓိက loop ပြီးဆုံးတာနဲ့၊ ကျွန်တော်တို့ အဆုံးကနေ စတင်ပြီး start position တစ်ခုကနေ နောက်တစ်ခုကို ခုန်ကူးသွားကာ၊ word ရဲ့ အစကို ရောက်တဲ့အထိ tokens တွေကို မှတ်တမ်းတင်သွားပါမယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">encode_word</span>(<span class="hljs-params">word, model</span>): | |
| best_segmentations = [{<span class="hljs-string">"start"</span>: <span class="hljs-number">0</span>, <span class="hljs-string">"score"</span>: <span class="hljs-number">1</span>}] + [ | |
| {<span class="hljs-string">"start"</span>: <span class="hljs-literal">None</span>, <span class="hljs-string">"score"</span>: <span class="hljs-literal">None</span>} <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">len</span>(word)) | |
| ] | |
| <span class="hljs-keyword">for</span> start_idx <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">len</span>(word)): | |
| <span class="hljs-comment"># ဒီနေရာက loop ရဲ့ ယခင်အဆင့်တွေကနေ မှန်ကန်စွာ ဖြည့်ထားသင့်ပါတယ်။</span> | |
| best_score_at_start = best_segmentations[start_idx][<span class="hljs-string">"score"</span>] | |
| <span class="hljs-keyword">for</span> end_idx <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(start_idx + <span class="hljs-number">1</span>, <span class="hljs-built_in">len</span>(word) + <span class="hljs-number">1</span>): | |
| token = word[start_idx:end_idx] | |
| <span class="hljs-keyword">if</span> token <span class="hljs-keyword">in</span> model <span class="hljs-keyword">and</span> best_score_at_start <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>: | |
| score = model[token] + best_score_at_start | |
| <span class="hljs-comment"># အကယ်၍ end_idx မှာ အဆုံးသတ်တဲ့ ပိုကောင်းတဲ့ segmentation တစ်ခုကို ကျွန်တော်တို့ ရှာတွေ့ခဲ့ရင်၊ update လုပ်ပါမယ်။</span> | |
| <span class="hljs-keyword">if</span> ( | |
| best_segmentations[end_idx][<span class="hljs-string">"score"</span>] <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span> | |
| <span class="hljs-keyword">or</span> best_segmentations[end_idx][<span class="hljs-string">"score"</span>] > score | |
| ): | |
| best_segmentations[end_idx] = {<span class="hljs-string">"start"</span>: start_idx, <span class="hljs-string">"score"</span>: score} | |
| segmentation = best_segmentations[-<span class="hljs-number">1</span>] | |
| <span class="hljs-keyword">if</span> segmentation[<span class="hljs-string">"score"</span>] <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>: | |
| <span class="hljs-comment"># word ရဲ့ tokenization ကို ကျွန်တော်တို့ ရှာမတွေ့ခဲ့ပါဘူး -> unknown</span> | |
| <span class="hljs-keyword">return</span> [<span class="hljs-string">"<unk>"</span>], <span class="hljs-literal">None</span> | |
| score = segmentation[<span class="hljs-string">"score"</span>] | |
| start = segmentation[<span class="hljs-string">"start"</span>] | |
| end = <span class="hljs-built_in">len</span>(word) | |
| tokens = [] | |
| <span class="hljs-keyword">while</span> start != <span class="hljs-number">0</span>: | |
| tokens.insert(<span class="hljs-number">0</span>, word[start:end]) | |
| next_start = best_segmentations[start][<span class="hljs-string">"start"</span>] | |
| end = start | |
| start = next_start | |
| tokens.insert(<span class="hljs-number">0</span>, word[start:end]) | |
| <span class="hljs-keyword">return</span> tokens, score<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-4usvag">ကျွန်တော်တို့ရဲ့ initial model ကို words အချို့ပေါ်မှာ စမ်းသပ်ကြည့်နိုင်ပါပြီ။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-built_in">print</span>(encode_word(<span class="hljs-string">"Hopefully"</span>, model)) | |
| <span class="hljs-built_in">print</span>(encode_word(<span class="hljs-string">"This"</span>, model))<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->([<span class="hljs-string">'H'</span>, <span class="hljs-string">'o'</span>, <span class="hljs-string">'p'</span>, <span class="hljs-string">'e'</span>, <span class="hljs-string">'f'</span>, <span class="hljs-string">'u'</span>, <span class="hljs-string">'ll'</span>, <span class="hljs-string">'y'</span>], <span class="hljs-number">41.5157494601402</span>) | |
| ([<span class="hljs-string">'This'</span>], <span class="hljs-number">6.288267030694535</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-11ilz2o">အခု model ရဲ့ loss ကို corpus ပေါ်မှာ တွက်ချက်ဖို့ လွယ်ကူပါပြီ။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">compute_loss</span>(<span class="hljs-params">model</span>): | |
| loss = <span class="hljs-number">0</span> | |
| <span class="hljs-keyword">for</span> word, freq <span class="hljs-keyword">in</span> word_freqs.items(): | |
| _, word_loss = encode_word(word, model) | |
| loss += freq * word_loss | |
| <span class="hljs-keyword">return</span> loss<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1n5anfb">ကျွန်တော်တို့မှာရှိတဲ့ model ပေါ်မှာ အလုပ်ဖြစ်မဖြစ် စစ်ဆေးနိုင်ပါတယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->compute_loss(model)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-number">413.10377642940875</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1a39skm">token တစ်ခုစီအတွက် scores တွေ တွက်ချက်တာလည်း မခက်ခဲပါဘူး။ token တစ်ခုစီကို ဖယ်ရှားခြင်းဖြင့် ရရှိတဲ့ models တွေအတွက် loss ကို တွက်ချက်ဖို့ပဲ လိုအပ်ပါတယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> copy | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">compute_scores</span>(<span class="hljs-params">model</span>): | |
| scores = {} | |
| model_loss = compute_loss(model) | |
| <span class="hljs-keyword">for</span> token, score <span class="hljs-keyword">in</span> model.items(): | |
| <span class="hljs-comment"># အရှည် 1 ရှိတဲ့ tokens တွေကို အမြဲတမ်း ထိန်းသိမ်းထားပါတယ်။</span> | |
| <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(token) == <span class="hljs-number">1</span>: | |
| <span class="hljs-keyword">continue</span> | |
| model_without_token = copy.deepcopy(model) | |
| _ = model_without_token.pop(token) | |
| scores[token] = compute_loss(model_without_token) - model_loss | |
| <span class="hljs-keyword">return</span> scores<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-s7qzgz">ပေးထားတဲ့ token တစ်ခုပေါ်မှာ စမ်းသပ်ကြည့်နိုင်ပါတယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->scores = compute_scores(model) | |
| <span class="hljs-built_in">print</span>(scores[<span class="hljs-string">"ll"</span>]) | |
| <span class="hljs-built_in">print</span>(scores[<span class="hljs-string">"his"</span>])<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1gsnm4v"><code>"ll"</code> ကို <code>"Hopefully"</code> ရဲ့ tokenization မှာ အသုံးပြုတာကြောင့်၊ ဒါကို ဖယ်ရှားလိုက်ရင် <code>"l"</code> token ကို နှစ်ကြိမ် အစားထိုး အသုံးပြုရဖွယ်ရှိပြီး၊ ဒါကြောင့် positive loss ရရှိမယ်လို့ ကျွန်တော်တို့ မျှော်လင့်ပါတယ်။ <code>"his"</code> ကို <code>"This"</code> word အတွင်းမှာပဲ အသုံးပြုတာကြောင့်၊ ဒါက သူ့ကိုယ်သူ tokenize လုပ်တာဖြစ်ပြီး၊ ဒါကြောင့် zero loss ရရှိမယ်လို့ ကျွန်တော်တို့ မျှော်လင့်ပါတယ်။ ရလဒ်တွေကတော့…</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-number">6.376412403623874</span> | |
| <span class="hljs-number">0.0</span><!-- HTML_TAG_END --></pre></div> <blockquote class="tip" data-svelte-h="svelte-1740c18"><p>💡 ဒီနည်းလမ်းက အလွန်ထိရောက်မှု မရှိပါဘူး။ ဒါကြောင့် SentencePiece က token X မပါတဲ့ model ရဲ့ loss ကို ခန့်မှန်းတွက်ချက်တဲ့ နည်းလမ်းကို အသုံးပြုပါတယ်။ အစကနေ ပြန်မစဘဲ၊ ဒါက token X ကို ကျန်ရှိနေတဲ့ vocabulary ထဲက ၎င်းရဲ့ segmentation နဲ့ အစားထိုးလိုက်ရုံပါပဲ။ ဒီနည်းနဲ့ model loss နဲ့အတူ scores အားလုံးကို တစ်ပြိုင်နက်တည်း တွက်ချက်နိုင်ပါတယ်။</p></blockquote> <p data-svelte-h="svelte-1rr8l5x">ဒီအရာအားလုံး ပြီးသွားတာနဲ့၊ နောက်ဆုံးလုပ်ရမယ့်အရာက model က အသုံးပြုတဲ့ special tokens တွေကို vocabulary ထဲကို ထည့်သွင်းဖို့ပါပဲ။ ပြီးမှ လိုချင်တဲ့ size ကို ရောက်တဲ့အထိ vocabulary ကနေ tokens တွေကို လုံလောက်အောင် prune လုပ်သည်အထိ loop လုပ်ပါ။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->percent_to_remove = <span class="hljs-number">0.1</span> | |
| <span class="hljs-keyword">while</span> <span class="hljs-built_in">len</span>(model) > <span class="hljs-number">100</span>: | |
| scores = compute_scores(model) | |
| sorted_scores = <span class="hljs-built_in">sorted</span>(scores.items(), key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>]) | |
| <span class="hljs-comment"># အနိမ့်ဆုံး scores ရှိတဲ့ tokens percent_to_remove ကို ဖယ်ရှားပါ။</span> | |
| <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">int</span>(<span class="hljs-built_in">len</span>(model) * percent_to_remove)): | |
| _ = token_freqs.pop(sorted_scores[i][<span class="hljs-number">0</span>]) | |
| total_sum = <span class="hljs-built_in">sum</span>([freq <span class="hljs-keyword">for</span> token, freq <span class="hljs-keyword">in</span> token_freqs.items()]) | |
| model = {token: -log(freq / total_sum) <span class="hljs-keyword">for</span> token, freq <span class="hljs-keyword">in</span> token_freqs.items()}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1cp6wfh">ပြီးမှ၊ text အချို့ကို tokenize လုပ်ဖို့၊ ကျွန်တော်တို့ pre-tokenization ကို အသုံးပြုပြီး၊ ကျွန်တော်တို့ရဲ့ <code>encode_word()</code> function ကို အသုံးပြုဖို့ပဲ လိုအပ်ပါတယ်။</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">tokenize</span>(<span class="hljs-params">text, model</span>): | |
| words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text) | |
| pre_tokenized_text = [word <span class="hljs-keyword">for</span> word, offset <span class="hljs-keyword">in</span> words_with_offsets] | |
| encoded_words = [encode_word(word, model)[<span class="hljs-number">0</span>] <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> pre_tokenized_text] | |
| <span class="hljs-keyword">return</span> <span class="hljs-built_in">sum</span>(encoded_words, []) | |
| tokenize(<span class="hljs-string">"This is the Hugging Face course."</span>, model)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">' This'</span>, <span class="hljs-string">' is'</span>, <span class="hljs-string">' the'</span>, <span class="hljs-string">' Hugging'</span>, <span class="hljs-string">' Face'</span>, <span class="hljs-string">' '</span>, <span class="hljs-string">'c'</span>, <span class="hljs-string">'ou'</span>, <span class="hljs-string">'r'</span>, <span class="hljs-string">'s'</span>, <span class="hljs-string">'e'</span>, <span class="hljs-string">'.'</span>]<!-- HTML_TAG_END --></pre></div> <blockquote class="tip" data-svelte-h="svelte-aejcqb"><p>XLNetTokenizer က SentencePiece ကို အသုံးပြုတာကြောင့် <code>"_"</code> character ပါဝင်ပါတယ်။ SentencePiece နဲ့ decode လုပ်ဖို့၊ tokens အားလုံးကို concatenate လုပ်ပြီး <code>"_"</code> ကို space နဲ့ အစားထိုးပါ။</p></blockquote> <p data-svelte-h="svelte-vfbrrx">Unigram အတွက် ဒါပါပဲ! အခုဆိုရင် သင်ဟာ tokenizer အရာအားလုံးမှာ ကျွမ်းကျင်သူတစ်ယောက်လို ခံစားရလိမ့်မယ်လို့ မျှော်လင့်ပါတယ်။ နောက်အပိုင်းမှာ၊ 🤗 Tokenizers library ရဲ့ building blocks တွေထဲကို ကျွန်တော်တို့ နက်နက်နဲနဲ လေ့လာပြီး သင့်ကိုယ်ပိုင် tokenizer ကို ဘယ်လိုတည်ဆောက်ရမလဲဆိုတာ ပြသပေးပါမယ်။</p> <h2 class="relative group"><a id="ဝဟရ-ရငလငခက-glossary" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#ဝဟရ-ရငလငခက-glossary"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>ဝေါဟာရ ရှင်းလင်းချက် (Glossary)</span></h2> <ul data-svelte-h="svelte-40tn3w"><li><strong>Unigram Algorithm</strong>: Subword tokenization algorithm တစ်မျိုးဖြစ်ပြီး vocabulary ကြီးကြီးမှ စတင်ကာ loss ကို အနည်းဆုံးဖြစ်စေရန် tokens များကို ဖယ်ရှားခြင်းဖြင့် အလုပ်လုပ်သည်။</li> <li><strong>SentencePiece</strong>: Google မှ ဖန်တီးထားသော open-source text tokenization algorithm တစ်ခုဖြစ်ပြီး ဘာသာစကားမျိုးစုံအတွက် အလုပ်လုပ်သည်။ ၎င်းသည် spaces များကို စကားလုံးခွဲခြားရန် မသုံးသော ဘာသာစကားများ (ဥပမာ- တရုတ်၊ ဂျပန်) အတွက် အထူးသင့်လျော်သည်။</li> <li><strong>AlBERT</strong>: BERT ၏ lightweight version ဖြစ်သော AI model။</li> <li><strong>T5</strong>: Google မှ ဖန်တီးထားသော Text-to-Text Transfer Transformer model။</li> <li><strong>mBART</strong>: Multilingual Bidirectional and Auto-Regressive Transformers (multilingual sequence-to-sequence model)။</li> <li><strong>Big Bird</strong>: Long sequence များအတွက် Transformer model ၏ efficient version။</li> <li><strong>XLNet</strong>: Autoregressive Transformer model တစ်မျိုး။</li> <li><strong>Raw Input Stream</strong>: မည်သည့် preprocessing မျှ မလုပ်ဆောင်ရသေးသော input data။</li> <li><strong>Vocabulary</strong>: tokenizer သို့မဟုတ် model တစ်ခုက သိရှိနားလည်ပြီး ကိုင်တွယ်နိုင်သော ထူးခြားသည့် tokens များ စုစုပေါင်း။</li> <li><strong>BPE (Byte-Pair Encoding)</strong>: Subword tokenization algorithm တစ်မျိုး။</li> <li><strong>WordPiece</strong>: Subword tokenization algorithm တစ်မျိုး။</li> <li><strong>Substrings</strong>: string တစ်ခု၏ အစိတ်အပိုင်းများ။</li> <li><strong>Pre-tokenized Words</strong>: subword tokenization မလုပ်ဆောင်မီ ပိုင်းခြားထားသော စကားလုံးများ။</li> <li><strong>Initial Corpus</strong>: model သို့မဟုတ် tokenizer ကို လေ့ကျင့်ရန် အသုံးပြုသော မူလဒေတာအစုအဝေး။</li> <li><strong>Loss</strong>: Model ၏ ခန့်မှန်းချက်များနှင့် အမှန်တကယ် labels များကြား ကွာခြားမှုကို တိုင်းတာသော တန်ဖိုး။</li> <li><strong>Corpus</strong>: စာသား (သို့မဟုတ် အခြားဒေတာ) အစုအဝေးကြီးတစ်ခု။</li> <li><strong>Symbol</strong>: token သို့မဟုတ် subword တစ်ခုကို ရည်ညွှန်းသည်။</li> <li><strong>Hyperparameter</strong>: model training မစမီ သတ်မှတ်ပေးရသော parameter (ဥပမာ- learning rate, batch size, percent_to_remove)။</li> <li><strong>Base Characters</strong>: ဘာသာစကားတစ်ခု၏ အခြေခံစာလုံးများ။</li> <li><strong>Language Model</strong>: လူသားဘာသာစကား၏ ဖြန့်ဝေမှုကို နားလည်ရန် လေ့ကျင့်ထားသော AI မော်ဒယ်တစ်ခု။</li> <li><strong>Probability</strong>: ဖြစ်နိုင်ခြေတန်ဖိုး။</li> <li><strong>Frequency</strong>: အရာတစ်ခု ပေါ်လာသည့် အကြိမ်အရေအတွက်။</li> <li><strong>Sum of All Frequencies</strong>: Vocabulary ထဲရှိ tokens အားလုံး၏ frequencies ပေါင်းလဒ်။</li> <li><strong>Segmentation</strong>: စကားလုံးတစ်ခုကို subword tokens များအဖြစ် ပိုင်းခြားခြင်း။</li> <li><strong>Product of Probability</strong>: probability များကို မြှောက်ခြင်းဖြင့် ရရှိသော တန်ဖိုး။</li> <li><strong>Viterbi Algorithm</strong>: Dynamic programming technique တစ်မျိုးဖြစ်ပြီး sequence တစ်ခုအတွက် ဖြစ်နိုင်ခြေအများဆုံး state path (ဥပမာ- tokenization) ကို ရှာဖွေရာတွင် အသုံးပြုသည်။</li> <li><strong>Graph</strong>: nodes (vertices) နှင့် edges (connections) များဖြင့် ဖွဲ့စည်းထားသော ဒေတာဖွဲ့စည်းပုံ။</li> <li><strong>Subword</strong>: စကားလုံးတစ်ခု၏ အစိတ်အပိုင်း။</li> <li><strong>Negative Log Likelihood</strong>: probability ၏ logarithm ၏ အနုတ်လက္ခဏာတန်ဖိုး။ loss function တစ်ခုအဖြစ် အသုံးပြုသည်။</li> <li><strong>XLNetTokenizer</strong>: XLNet model အတွက် အသုံးပြုသော tokenizer။</li> <li><strong><code>xlnet-base-cased</code></strong>: XLNet model ၏ base version အတွက် checkpoint identifier (cased version)။</li> <li><strong><code>collections.defaultdict(int)</code></strong>: Python dictionary တစ်မျိုးဖြစ်ပြီး မရှိသေးသော key ကို ဝင်ရောက်ကြည့်ရှုသောအခါ int() ကို default value (0) အဖြစ် ပြန်ပေးသည်။</li> <li><strong><code>tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)</code></strong>: 🤗 Tokenizers library မှ pre-tokenization ကို လုပ်ဆောင်သော method။</li> <li><strong><code>word_freqs</code></strong>: corpus ထဲရှိ words များ၏ frequency များကို သိမ်းဆည်းထားသော dictionary။</li> <li><strong><code>char_freqs</code></strong>: corpus ထဲရှိ characters များ၏ frequency များကို သိမ်းဆည်းထားသော dictionary။</li> <li><strong><code>subwords_freqs</code></strong>: corpus ထဲရှိ subwords များ၏ frequency များကို သိမ်းဆည်းထားသော dictionary။</li> <li><strong><code>lambda x: x[1]</code></strong>: Python lambda function တစ်ခုဖြစ်ပြီး key-value pair မှ value ကို ပြန်ပေးသည်။ sort လုပ်ရာတွင် အသုံးပြုသည်။</li> <li><strong>Enhanced Suffix Array (ESA)</strong>: initial vocabulary ကို ဖန်တီးရန် SentencePiece မှ အသုံးပြုသော algorithm တစ်မျိုး။</li> <li><strong>Numerically Stable</strong>: Floating-point arithmetic ကြောင့် ဖြစ်ပေါ်လာနိုင်သော error များကို လျှော့ချရန် နည်းလမ်း။</li> <li><strong><code>log</code></strong>: Natural logarithm (e base)။</li> <li><strong><code>best_segmentations</code></strong>: Viterbi algorithm တွင် အကောင်းဆုံး segmentations များကို သိမ်းဆည်းထားသော list။</li> <li><strong><code>best_score_at_start</code></strong>: start position တစ်ခုတွင် အကောင်းဆုံး segmentation score။</li> <li><strong><code><unk></code> (Unknown Token)</strong>: vocabulary ထဲမှာ မပါဝင်တဲ့ word တွေအတွက် အစားထိုးအသုံးပြုတဲ့ special token။</li> <li><strong><code>compute_loss(model)</code></strong>: model ရဲ့ loss ကို တွက်ချက်သော function။</li> <li><strong><code>compute_scores(model)</code></strong>: vocabulary ထဲက token တစ်ခုစီကို ဖယ်ရှားလိုက်ရင် loss ဘယ်လောက်ပြောင်းလဲမလဲဆိုတာ တွက်ချက်သော function။</li> <li><strong><code>copy.deepcopy(model)</code></strong>: Python တွင် object တစ်ခု၏ နက်ရှိုင်းသော မိတ္တူ (deep copy) ကို ဖန်တီးခြင်း။</li> <li><strong><code>token_freqs.pop(sorted_scores[i][0])</code></strong>: dictionary မှ key ကို ဖယ်ရှားခြင်း။</li> <li><strong><code>percent_to_remove</code></strong>: training လုပ်နေစဉ် တစ်ကြိမ်တည်းမှာ ဖယ်ရှားမည့် tokens ရာခိုင်နှုန်း။</li> <li><strong><code>tokenize(text, model)</code></strong>: text ကို model အသုံးပြုပြီး tokenize လုပ်သော function။</li> <li><strong><code>sum(encoded_words, [])</code></strong>: list of lists များကို single list တစ်ခုအဖြစ် ပေါင်းစပ်ခြင်း။</li> <li><strong><code>_</code> Character</strong>: SentencePiece တွင် space ကို ကိုယ်စားပြုသော special character။</li></ul> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/course/blob/main/chapters/my/chapter6/7.mdx" target="_blank"><svg class="mr-1" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M31,16l-7,7l-1.41-1.41L28.17,16l-5.58-5.59L24,9l7,7z"></path><path d="M1,16l7-7l1.41,1.41L3.83,16l5.58,5.59L8,23l-7-7z"></path><path d="M12.419,25.484L17.639,6.552l1.932,0.518L14.351,26.002z"></path></svg> <span data-svelte-h="svelte-zjs2n5"><span class="underline">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_tyugt6 = { | |
| assets: "/docs/course/pr_1114/my", | |
| base: "/docs/course/pr_1114/my", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/course/pr_1114/my/_app/immutable/entry/start.14794ee9.js"), | |
| import("/docs/course/pr_1114/my/_app/immutable/entry/app.a133f5c6.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 52], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 129 kB
- Xet hash:
- 7338a53e67fc71a37fa79969861dfc7d1728a15d3a724b050e8321be3f496375
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.