Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / course /pr_1114 /my /chapter6 /7.md

rtrm

about 2 months ago

preview code

download

raw

46.1 kB

	# Unigram Tokenization[[unigram-tokenization]]

	<CourseFloatingBanner chapter={6}
	classNames="absolute z-10 right-0 top-0"
	notebooks={[
	{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter6/section7.ipynb"},
	{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter6/section7.ipynb"},
	]} />

	Unigram algorithm ကို [SentencePiece](https://huggingface.co/papers/1808.06226) နဲ့ ပေါင်းစပ်အသုံးပြုပါတယ်။ SentencePiece ကတော့ AlBERT, T5, mBART, Big Bird, နဲ့ XLNet လို models တွေ အသုံးပြုတဲ့ tokenization algorithm ဖြစ်ပါတယ်။

	SentencePiece က ဘာသာစကားအားလုံးက စကားလုံးတွေကို ခွဲခြားဖို့ spaces တွေကို မသုံးဘူးဆိုတဲ့ အချက်ကို ဖြေရှင်းပေးပါတယ်။ အဲဒီအစား၊ SentencePiece က input ကို raw input stream တစ်ခုလို သတ်မှတ်ပြီး၊ အသုံးပြုမယ့် characters တွေထဲမှာ space ကိုလည်း ထည့်သွင်းပေးပါတယ်။ ပြီးမှ Unigram algorithm ကို အသုံးပြုပြီး သင့်လျော်တဲ့ vocabulary ကို တည်ဆောက်နိုင်ပါတယ်။

	<Youtube id="TGZfZVuF9Yc"/>

	> [!TIP]
	> 💡 ဒီအပိုင်းက Unigram ကို အပြည့်အဝ ဖော်ပြထားပြီး၊ အပြည့်အဝ implement လုပ်ထားတာကိုလည်း ပြသထားပါတယ်။ tokenization algorithm ရဲ့ အထွေထွေ overview ကိုပဲ လိုချင်တယ်ဆိုရင် အဆုံးထိ ကျော်သွားနိုင်ပါတယ်။

	## Training Algorithm[[training-algorithm]]

	BPE နဲ့ WordPiece တို့နဲ့ ယှဉ်ရင် Unigram က အခြားတစ်ဘက်ကနေ အလုပ်လုပ်ပါတယ်၊ ဒါက ကြီးမားတဲ့ vocabulary ကနေ စတင်ပြီး လိုချင်တဲ့ vocabulary size ကို ရောက်တဲ့အထိ tokens တွေကို ဖယ်ရှားပါတယ်။ အဲဒီ base vocabulary ကို တည်ဆောက်ဖို့ နည်းလမ်းများစွာ ရှိပါတယ်၊ ဥပမာ၊ pre-tokenized words တွေထဲက အများဆုံး common substrings တွေကို ယူနိုင်ပါတယ်၊ ဒါမှမဟုတ် large vocabulary size နဲ့ initial corpus ပေါ်မှာ BPE ကို အသုံးချနိုင်ပါတယ်။

	training ရဲ့ အဆင့်တိုင်းမှာ၊ Unigram algorithm က လက်ရှိ vocabulary ကို ပေးပြီး corpus တစ်ခုလုံးပေါ်မှာ loss တစ်ခုကို တွက်ချက်ပါတယ်။ ပြီးမှ၊ vocabulary ထဲက symbol တစ်ခုစီအတွက်၊ အဲဒီ symbol ကို ဖယ်ရှားလိုက်ရင် overall loss ဘယ်လောက်တိုးလာမလဲဆိုတာ algorithm က တွက်ချက်ပြီး၊ အနည်းဆုံးတိုးလာမယ့် symbols တွေကို ရှာဖွေပါတယ်။ အဲဒီ symbols တွေက corpus တစ်ခုလုံးပေါ်က overall loss အပေါ် သက်ရောက်မှု အနည်းဆုံးဖြစ်ပြီး၊ တစ်နည်းအားဖြင့် ၎င်းတို့ဟာ "လိုအပ်မှု နည်းပါး" တာကြောင့် ဖယ်ရှားဖို့ အကောင်းဆုံး candidates တွေ ဖြစ်ပါတယ်။

	ဒါက အလွန်ကုန်ကျစရိတ်များတဲ့ လုပ်ဆောင်ချက်ဖြစ်တာကြောင့်၊ အနည်းဆုံး loss တိုးလာမှုနဲ့ ဆက်စပ်နေတဲ့ single symbol ကို ဖယ်ရှားရုံနဲ့ မလုံလောက်ပါဘူး၊ ဒါပေမယ့် အနည်းဆုံး loss တိုးလာမှုနဲ့ ဆက်စပ်နေတဲ့ \$p\$ (\$p\$ ကတော့ သင်ထိန်းချုပ်နိုင်တဲ့ hyperparameter တစ်ခုပါ၊ ပုံမှန်အားဖြင့် 10 ဒါမှမဟုတ် 20) ရာခိုင်နှုန်း symbols တွေကို ဖယ်ရှားပါတယ်။ ဒီလုပ်ငန်းစဉ်ကို vocabulary က လိုချင်တဲ့ size ကို ရောက်တဲ့အထိ ထပ်ခါတလဲလဲ လုပ်ဆောင်ပါတယ်။

	မည်သည့် word ကိုမဆို tokenize လုပ်နိုင်ဖို့ သေချာစေရန် base characters တွေကို ဘယ်တော့မှ မဖယ်ရှားဘူးဆိုတာ သတိပြုပါ။

	အခု ဒါက နည်းနည်းတော့ ဝိုးတဝါးဖြစ်နေပါသေးတယ်၊ algorithm ရဲ့ အဓိကအပိုင်းက corpus တစ်ခုလုံးပေါ်မှာ loss တစ်ခုကို တွက်ချက်ပြီး၊ vocabulary ကနေ tokens အချို့ကို ဖယ်ရှားတဲ့အခါ ဘယ်လိုပြောင်းလဲလဲဆိုတာ ကြည့်ဖို့ပါပဲ။ ဒါပေမယ့် ဒါကို ဘယ်လိုလုပ်ရမယ်ဆိုတာ ကျွန်တော်တို့ မရှင်းပြရသေးပါဘူး။ ဒီအဆင့်က Unigram model ရဲ့ tokenization algorithm ပေါ်မှာ မှီခိုနေတာကြောင့်၊ ဒါကို နောက်မှာ လေ့လာသွားပါမယ်။

	ယခင်ဥပမာတွေက corpus ကို ကျွန်တော်တို့ ပြန်လည်အသုံးပြုပါမယ်။

	```
	("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
	```

	ပြီးတော့ ဒီဥပမာအတွက်၊ initial vocabulary အတွက် strict substrings အားလုံးကို ယူပါမယ်။

	```
	["h", "u", "g", "hu", "ug", "p", "pu", "n", "un", "b", "bu", "s", "hug", "gs", "ugs"]
	```

	## Tokenization Algorithm[[tokenization-algorithm]]

	Unigram model ဆိုတာ language model အမျိုးအစားတစ်ခုဖြစ်ပြီး၊ token တစ်ခုစီကို ၎င်းရဲ့ရှေ့က tokens တွေနဲ့ လွတ်လပ်တယ်လို့ သတ်မှတ်ပါတယ်။ ဒါဟာ အလွယ်ကူဆုံး language model ဖြစ်ပြီး၊ ယခင် context ကို ပေးထားတဲ့ token X ရဲ့ probability က token X ရဲ့ probability သက်သက်ပဲ ဖြစ်ပါတယ်။ ဒါကြောင့်၊ Unigram language model ကို text generate လုပ်ဖို့ အသုံးပြုမယ်ဆိုရင်၊ ကျွန်တော်တို့ဟာ အများဆုံး common token ကို အမြဲတမ်း ခန့်မှန်းပါလိမ့်မယ်။

	ပေးထားတဲ့ token တစ်ခုရဲ့ probability က original corpus ထဲမှာ ၎င်းရဲ့ frequency (ကျွန်တော်တို့ ဘယ်နှစ်ကြိမ် တွေ့ရသလဲ) ကို vocabulary ထဲက tokens အားလုံးရဲ့ frequencies ပေါင်းလဒ်နဲ့ စားတာပါ (probabilities တွေ ပေါင်းလဒ် ၁ ဖြစ်ဖို့ သေချာစေရန်)။ ဥပမာ၊ `"ug"` က `"hug"`, `"pug"`, နဲ့ `"hugs"` ထဲမှာ ပါဝင်တာကြောင့်၊ ကျွန်တော်တို့ corpus မှာ 20 ရဲ့ frequency ရှိပါတယ်။

	vocabulary ထဲမှာရှိတဲ့ ဖြစ်နိုင်ခြေရှိတဲ့ subwords အားလုံးရဲ့ frequencies တွေကတော့ ဒီမှာပါ။

	```
	("h", 15) ("u", 36) ("g", 20) ("hu", 15) ("ug", 20) ("p", 17) ("pu", 17) ("n", 16)
	("un", 16) ("b", 4) ("bu", 4) ("s", 5) ("hug", 15) ("gs", 5) ("ugs", 5)
	```

	ဒါကြောင့် frequencies အားလုံးရဲ့ ပေါင်းလဒ်က 210 ဖြစ်ပြီး၊ subword `"ug"` ရဲ့ probability က 20/210 ဖြစ်ပါတယ်။

	> [!TIP]
	> ✏️ အခု သင့်အလှည့်! အထက်ပါ frequencies တွေကို တွက်ချက်ဖို့ code ကို ရေးပြီး၊ ပြသထားတဲ့ ရလဒ်တွေ မှန်ကန်ခြင်းရှိမရှိ၊ ပြီးတော့ စုစုပေါင်းပေါင်းလဒ် မှန်ကန်ခြင်းရှိမရှိ ထပ်မံစစ်ဆေးပါ။

	အခု၊ ပေးထားတဲ့ word တစ်ခုကို tokenize လုပ်ဖို့၊ tokens တွေအဖြစ် ဖြစ်နိုင်ခြေရှိတဲ့ segmentations အားလုံးကို ကြည့်ပြီး Unigram model အရ တစ်ခုစီရဲ့ probability ကို တွက်ချက်ပါတယ်။ tokens အားလုံးကို လွတ်လပ်တယ်လို့ ယူဆတာကြောင့်၊ ဒီ probability က token တစ်ခုစီရဲ့ probability တွေရဲ့ product သက်သက်ပဲ ဖြစ်ပါတယ်။ ဥပမာ၊ `"pug"` ကို tokenize လုပ်တဲ့ `["p", "u", "g"]` က အောက်ပါ probability ရှိပါတယ်။

	$$P([``p", ``u", ``g"]) = P(``p") \times P(``u") \times P(``g") = \frac{5}{210} \times \frac{36}{210} \times \frac{20}{210} = 0.000389$$

	နှိုင်းယှဉ်ကြည့်မယ်ဆိုရင်၊ `["pu", "g"]` ကို tokenize လုပ်တာက အောက်ပါ probability ရှိပါတယ်-

	$$P([``pu", ``g"]) = P(``pu") \times P(``g") = \frac{5}{210} \times \frac{20}{210} = 0.0022676$$

	ဒါကြောင့် အဲဒီတစ်ခုက ဖြစ်နိုင်ခြေ ပိုများပါတယ်။ ယေဘုယျအားဖြင့်၊ ဖြစ်နိုင်ခြေ အနည်းဆုံး tokens များပါဝင်တဲ့ tokenizations တွေက အမြင့်ဆုံး probability ကို ရရှိပါလိမ့်မယ် (token တစ်ခုစီအတွက် 210 နဲ့ စားတာ ထပ်ခါတလဲလဲ လုပ်ရလို့ပါ)၊ ဒါက ကျွန်တော်တို့ ပုံမှန်အားဖြင့် လိုချင်တာနဲ့ ကိုက်ညီပါတယ်၊ word တစ်ခုကို ဖြစ်နိုင်ခြေ အနည်းဆုံး tokens အရေအတွက်အဖြစ် ပိုင်းခြားဖို့ပါ။

	Unigram model နဲ့ word တစ်ခုကို tokenize လုပ်တာကတော့ အမြင့်ဆုံး probability ရှိတဲ့ tokenization ပါပဲ။ `"pug"` ဥပမာမှာ၊ ဖြစ်နိုင်ခြေရှိတဲ့ segmentation တစ်ခုစီအတွက် ကျွန်တော်တို့ ရရှိမယ့် probabilities တွေကတော့...

	```
	["p", "u", "g"] : 0.000389
	["p", "ug"] : 0.0022676
	["pu", "g"] : 0.0022676
	```

	ဒါကြောင့် `"pug"` ကို `["p", "ug"]` ဒါမှမဟုတ် `["pu", "g"]` အဖြစ် tokenize လုပ်ပါလိမ့်မယ် (ဒီလို တူညီတဲ့ကိစ္စမျိုးတွေက ပိုကြီးတဲ့ corpus မှာ ရှားပါးမယ်ဆိုတာ သတိပြုပါ)။

	ဒီကိစ္စမှာ၊ ဖြစ်နိုင်ခြေရှိတဲ့ segmentations အားလုံးကို ရှာဖွေပြီး ၎င်းတို့ရဲ့ probabilities တွေကို တွက်ချက်တာ လွယ်ကူခဲ့ပါတယ်၊ ဒါပေမယ့် ယေဘုယျအားဖြင့်တော့ နည်းနည်း ပိုခက်ပါလိမ့်မယ်။ ဒီအတွက် အသုံးပြုတဲ့ classic algorithm တစ်ခုရှိပါတယ်၊ ဒါကို Viterbi algorithm လို့ ခေါ်ပါတယ်။ အနှစ်သာရအားဖြင့်၊ word တစ်ခုရဲ့ ဖြစ်နိုင်ခြေရှိတဲ့ segmentations တွေကို ရှာဖွေဖို့ graph တစ်ခု တည်ဆောက်နိုင်ပါတယ်။ အကယ်၍ character _a_ ကနေ character _b_ အထိ subword က vocabulary ထဲမှာ ပါဝင်တယ်ဆိုရင်၊ အဲဒီ branch ကို subword ရဲ့ probability ကို သတ်မှတ်ပေးပြီး၊ character _a_ ကနေ character _b_ အထိ branch တစ်ခု ရှိတယ်လို့ ပြောနိုင်ပါတယ်။

	အဲဒီ graph ထဲမှာ အကောင်းဆုံး score ရှိမယ့် path ကို ရှာဖွေဖို့ Viterbi algorithm က word ထဲက position တစ်ခုစီအတွက်၊ အဲဒီ position မှာ အဆုံးသတ်ပြီး အကောင်းဆုံး score ရှိတဲ့ segmentation ကို ဆုံးဖြတ်ပါတယ်။ ကျွန်တော်တို့က အစကနေ အဆုံးထိ သွားတာကြောင့်၊ အကောင်းဆုံး score ကို လက်ရှိ position မှာ အဆုံးသတ်တဲ့ subwords အားလုံးကို loop လုပ်ပြီး၊ အဲဒီ subword စတင်တဲ့ position ကနေ အကောင်းဆုံး tokenization score ကို အသုံးပြုခြင်းဖြင့် ရှာဖွေနိုင်ပါတယ်။ ပြီးမှ၊ အဆုံးထိရောက်ဖို့ ယူခဲ့တဲ့ path ကို ပြန်ဖွင့်ဖို့ပဲ လိုအပ်ပါတယ်။

	ကျွန်တော်တို့ရဲ့ vocabulary နဲ့ `"unhug"` word ကို အသုံးပြုပြီး ဥပမာတစ်ခု ကြည့်ရအောင်။ position တစ်ခုစီအတွက်၊ အဲဒီမှာ အဆုံးသတ်ပြီး အကောင်းဆုံး scores ရှိတဲ့ subwords တွေက အောက်ပါအတိုင်းပါ။

	```
	Character 0 (u): "u" (score 0.171429)
	Character 1 (n): "un" (score 0.076191)
	Character 2 (h): "un" "h" (score 0.005442)
	Character 3 (u): "un" "hu" (score 0.005442)
	Character 4 (g): "un" "hug" (score 0.005442)
	```

	ဒါကြောင့် `"unhug"` ကို `["un", "hug"]` အဖြစ် tokenize လုပ်ပါလိမ့်မယ်။

	> [!TIP]
	> ✏️ အခု သင့်အလှည့်! `"huggun"` ဆိုတဲ့ word ရဲ့ tokenization နဲ့ ၎င်းရဲ့ score ကို ဆုံးဖြတ်ပါ။

	## Training သို့ ပြန်သွားခြင်း[[back-to-training]]

	tokenization ဘယ်လိုအလုပ်လုပ်လဲဆိုတာ မြင်ခဲ့ရပြီဆိုတော့၊ training လုပ်နေစဉ် အသုံးပြုတဲ့ loss ကို နည်းနည်းပိုနက်နက်နဲနဲ လေ့လာကြည့်နိုင်ပါပြီ။ မည်သည့်အဆင့်မှာမဆို၊ ဒီ loss ကို corpus ထဲက word တိုင်းကို tokenize လုပ်ခြင်းဖြင့် တွက်ချက်ပါတယ်။ လက်ရှိ vocabulary နဲ့ corpus ထဲက token တစ်ခုစီရဲ့ frequencies (အရင်က တွေ့ခဲ့ရတဲ့အတိုင်း) နဲ့ ဆုံးဖြတ်ထားတဲ့ Unigram model ကို အသုံးပြုပါတယ်။

	corpus ထဲက word တိုင်းမှာ score တစ်ခုရှိပြီး၊ loss က အဲဒီ scores တွေရဲ့ negative log likelihood ပါ — ဒါက corpus ထဲက words အားလုံးအတွက် `-log(P(word))` ရဲ့ ပေါင်းလဒ်ပါပဲ။

	အောက်ပါ corpus နဲ့ ကျွန်တော်တို့ရဲ့ ဥပမာကို ပြန်သွားကြစို့။

	```
	("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
	```

	၎င်းတို့ရဲ့ သက်ဆိုင်ရာ scores တွေနဲ့ word တစ်ခုစီရဲ့ tokenization ကတော့...

	```
	"hug": ["hug"] (score 0.071428)
	"pug": ["pu", "g"] (score 0.007710)
	"pun": ["pu", "n"] (score 0.006168)
	"bun": ["bu", "n"] (score 0.001451)
	"hugs": ["hug", "s"] (score 0.001701)
	```

	ဒါကြောင့် loss ကတော့...

	```
	10 * (-log(0.071428)) + 5 * (-log(0.007710)) + 12 * (-log(0.006168)) + 4 * (-log(0.001451)) + 5 * (-log(0.001701)) = 169.8
	```

	အခု token တစ်ခုစီကို ဖယ်ရှားခြင်းက loss အပေါ် ဘယ်လိုသက်ရောက်လဲဆိုတာ တွက်ချက်ဖို့ လိုပါတယ်။ ဒါက အတော်လေး ပင်ပန်းတဲ့ လုပ်ဆောင်ချက်ဖြစ်တာကြောင့်၊ ကျွန်တော်တို့ code အကူအညီရတဲ့အခါမှပဲ လုပ်ငန်းစဉ်တစ်ခုလုံးကို လုပ်ဆောင်ပြီး ဒီနေရာမှာတော့ tokens နှစ်ခုအတွက်ပဲ လုပ်ဆောင်ပါမယ်။ ဒီ (အလွန်) သီးခြားကိစ္စမှာ၊ words အားလုံးရဲ့ တူညီတဲ့ tokenizations နှစ်ခုရှိခဲ့ပါတယ်- အရင်က တွေ့ခဲ့ရတဲ့အတိုင်း၊ ဥပမာ `"pug"` ကို `["p", "ug"]` လို့ တူညီတဲ့ score နဲ့ tokenize လုပ်နိုင်ပါတယ်။ ဒါကြောင့် vocabulary ကနေ `"pu"` token ကို ဖယ်ရှားခြင်းက အတိအကျတူညီတဲ့ loss ကို ပေးပါလိမ့်မယ်။

	အခြားတစ်ဖက်မှာ၊ `"hug"` ကို ဖယ်ရှားခြင်းက loss ကို ပိုဆိုးစေပါလိမ့်မယ်။ ဘာလို့လဲဆိုတော့ `"hug"` နဲ့ `"hugs"` ရဲ့ tokenization က...

	```
	"hug": ["hu", "g"] (score 0.006802)
	"hugs": ["hu", "gs"] (score 0.001701)
	```

	ဒီပြောင်းလဲမှုတွေက loss ကို အောက်ပါအတိုင်း တိုးစေပါလိမ့်မယ်။

	```
	- 10 * (-log(0.071428)) + 10 * (-log(0.006802)) = 23.5
	```

	ဒါကြောင့်၊ `"pu"` token ကို vocabulary ကနေ ဖယ်ရှားဖွယ်ရှိပေမယ့် `"hug"` ကိုတော့ ဖယ်ရှားမှာ မဟုတ်ပါဘူး။

	## Unigram ကို Implement လုပ်ခြင်း[[implementing-unigram]]

	အခုထိ ကျွန်တော်တို့ မြင်တွေ့ခဲ့ရတာတွေအားလုံးကို code ထဲမှာ Implement လုပ်ကြည့်ရအောင်။ BPE နဲ့ WordPiece တို့လိုပဲ၊ ဒါက Unigram algorithm ရဲ့ ထိရောက်တဲ့ implementation မဟုတ်ပါဘူး (ဆန့်ကျင်ဘက်ပါပဲ)၊ ဒါပေမယ့် ဒါက သင့်ကို ပိုကောင်းကောင်း နားလည်အောင် ကူညီပေးသင့်ပါတယ်။

	ဥပမာအနေနဲ့ ယခင် corpus တူတူကို ကျွန်တော်တို့ အသုံးပြုပါမယ်။

	```python
	corpus = [
	"This is the Hugging Face Course.",
	"This chapter is about tokenization.",
	"This section shows several tokenizer algorithms.",
	"Hopefully, you will be able to understand how they are trained and generate tokens.",
	]
	```

	ဒီတစ်ခါတော့၊ `xlnet-base-cased` ကို ကျွန်တော်တို့ model အဖြစ် အသုံးပြုပါမယ်။

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
	```

	BPE နဲ့ WordPiece တို့လိုပဲ၊ corpus ထဲက word တစ်ခုစီရဲ့ occurrences အရေအတွက်ကို ရေတွက်ခြင်းဖြင့် ကျွန်တော်တို့ စတင်ပါတယ်။

	```python
	from collections import defaultdict

	word_freqs = defaultdict(int)
	for text in corpus:
	words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
	new_words = [word for word, offset in words_with_offsets]
	for word in new_words:
	word_freqs[word] += 1

	word_freqs
	```

	ပြီးမှ၊ ကျွန်တော်တို့ရဲ့ vocabulary ကို နောက်ဆုံး လိုချင်တဲ့ vocab size ထက် ပိုကြီးတဲ့ တစ်ခုခုနဲ့ initialize လုပ်ဖို့ လိုအပ်ပါတယ်။ ကျွန်တော်တို့ဟာ basic characters အားလုံးကို ထည့်သွင်းရပါမယ် (ဒါမှမဟုတ်ရင် word တိုင်းကို tokenize လုပ်နိုင်မှာ မဟုတ်ပါဘူး)၊ ဒါပေမယ့် ပိုကြီးတဲ့ substrings တွေအတွက်တော့ အများဆုံး common ones တွေကိုပဲ ထိန်းသိမ်းထားပါမယ်၊ ဒါကြောင့် ၎င်းတို့ကို frequency အလိုက် sort လုပ်ပါတယ်။

	```python
	char_freqs = defaultdict(int)
	subwords_freqs = defaultdict(int)
	for word, freq in word_freqs.items():
	for i in range(len(word)):
	char_freqs[word[i]] += freq
	# Loop through the subwords of length at least 2
	for j in range(i + 2, len(word) + 1):
	subwords_freqs[word[i:j]] += freq

	# subwords တွေကို frequency အလိုက် sort လုပ်ပါ။
	sorted_subwords = sorted(subwords_freqs.items(), key=lambda x: x[1], reverse=True)
	sorted_subwords[:10]
	```

	```python out
	[(' t', 7), ('is', 5), ('er', 5), (' a', 5), (' to', 4), ('to', 4), ('en', 4), (' T', 3), (' Th', 3), (' Thi', 3)]
	```

	characters တွေကို အကောင်းဆုံး subwords တွေနဲ့ အုပ်စုဖွဲ့ပြီး size 300 ရှိတဲ့ initial vocabulary ကို ရရှိပါတယ်။

	```python
	token_freqs = list(char_freqs.items()) + sorted_subwords[: 300 - len(char_freqs)]
	token_freqs = {token: freq for token, freq in token_freqs}
	```

	> [!TIP]
	> 💡 SentencePiece က initial vocabulary ကို ဖန်တီးဖို့ Enhanced Suffix Array (ESA) လို့ခေါ်တဲ့ ပိုထိရောက်တဲ့ algorithm ကို အသုံးပြုပါတယ်။

	နောက်တစ်ဆင့်မှာတော့၊ frequencies တွေကို probabilities တွေအဖြစ် ပြောင်းလဲဖို့ frequencies အားလုံးရဲ့ ပေါင်းလဒ်ကို တွက်ချက်ပါတယ်။ ကျွန်တော်တို့ model အတွက် probabilities ရဲ့ logarithms တွေကို သိမ်းဆည်းထားပါမယ်၊ ဘာလို့လဲဆိုတော့ small numbers တွေကို မြှောက်တာထက် logarithms တွေကို ပေါင်းတာက ပိုပြီး numerically stable ဖြစ်ပြီး၊ ဒါက model ရဲ့ loss တွက်ချက်ခြင်းကို ရိုးရှင်းစေပါလိမ့်မယ်။

	```python
	from math import log

	total_sum = sum([freq for token, freq in token_freqs.items()])
	model = {token: -log(freq / total_sum) for token, freq in token_freqs.items()}
	```

	အခု အဓိက function က Viterbi algorithm ကို အသုံးပြုပြီး words တွေကို tokenize လုပ်တဲ့ function ပါပဲ။ အရင်က တွေ့ခဲ့ရတဲ့အတိုင်း၊ အဲဒီ algorithm က word ရဲ့ substring တစ်ခုစီရဲ့ အကောင်းဆုံး segmentation ကို တွက်ချက်ပြီး၊ ဒါကို `best_segmentations` လို့ခေါ်တဲ့ variable တစ်ခုမှာ ကျွန်တော်တို့ သိမ်းဆည်းထားပါမယ်။ word ထဲက position တစ်ခုစီ (0 ကနေ စုစုပေါင်းအရှည်အထိ) အတွက် dictionary တစ်ခုစီ သိမ်းဆည်းထားပါမယ်၊ keys နှစ်ခုနဲ့ပါ။ အဲဒါတွေက အကောင်းဆုံး segmentation ထဲက နောက်ဆုံး token ရဲ့ စတင်ခြင်း index နဲ့ အကောင်းဆုံး segmentation ရဲ့ score ပါပဲ။ နောက်ဆုံး token ရဲ့ စတင်ခြင်း index နဲ့၊ list ကို အပြည့်အစုံ ဖြည့်ပြီးတာနဲ့ full segmentation ကို ပြန်လည်ရယူနိုင်ပါလိမ့်မယ်။

	list ကို ဖြည့်သွင်းတာက loops နှစ်ခုနဲ့ လုပ်ဆောင်ပါတယ်၊ အဓိက loop က start position တစ်ခုစီကို ဖြတ်သွားပြီး၊ ဒုတိယ loop က အဲဒီ start position ကနေ စတင်တဲ့ substrings အားလုံးကို ကြိုးစားကြည့်ပါတယ်။ substring က vocabulary ထဲမှာ ပါဝင်တယ်ဆိုရင်၊ အဲဒီ end position အထိ word ရဲ့ segmentation အသစ်တစ်ခုကို ကျွန်တော်တို့ ရရှိပြီး၊ ဒါကို `best_segmentations` မှာရှိတဲ့ အရာနဲ့ နှိုင်းယှဉ်ပါတယ်။

	အဓိက loop ပြီးဆုံးတာနဲ့၊ ကျွန်တော်တို့ အဆုံးကနေ စတင်ပြီး start position တစ်ခုကနေ နောက်တစ်ခုကို ခုန်ကူးသွားကာ၊ word ရဲ့ အစကို ရောက်တဲ့အထိ tokens တွေကို မှတ်တမ်းတင်သွားပါမယ်။

	```python
	def encode_word(word, model):
	best_segmentations = [{"start": 0, "score": 1}] + [
	{"start": None, "score": None} for _ in range(len(word))
	]
	for start_idx in range(len(word)):
	# ဒီနေရာက loop ရဲ့ ယခင်အဆင့်တွေကနေ မှန်ကန်စွာ ဖြည့်ထားသင့်ပါတယ်။
	best_score_at_start = best_segmentations[start_idx]["score"]
	for end_idx in range(start_idx + 1, len(word) + 1):
	token = word[start_idx:end_idx]
	if token in model and best_score_at_start is not None:
	score = model[token] + best_score_at_start
	# အကယ်၍ end_idx မှာ အဆုံးသတ်တဲ့ ပိုကောင်းတဲ့ segmentation တစ်ခုကို ကျွန်တော်တို့ ရှာတွေ့ခဲ့ရင်၊ update လုပ်ပါမယ်။
	if (
	best_segmentations[end_idx]["score"] is None
	or best_segmentations[end_idx]["score"] > score
	):
	best_segmentations[end_idx] = {"start": start_idx, "score": score}

	segmentation = best_segmentations[-1]
	if segmentation["score"] is None:
	# word ရဲ့ tokenization ကို ကျွန်တော်တို့ ရှာမတွေ့ခဲ့ပါဘူး -> unknown
	return ["<unk>"], None

	score = segmentation["score"]
	start = segmentation["start"]
	end = len(word)
	tokens = []
	while start != 0:
	tokens.insert(0, word[start:end])
	next_start = best_segmentations[start]["start"]
	end = start
	start = next_start
	tokens.insert(0, word[start:end])
	return tokens, score
	```

	ကျွန်တော်တို့ရဲ့ initial model ကို words အချို့ပေါ်မှာ စမ်းသပ်ကြည့်နိုင်ပါပြီ။

	```python
	print(encode_word("Hopefully", model))
	print(encode_word("This", model))
	```

	```python out
	(['H', 'o', 'p', 'e', 'f', 'u', 'll', 'y'], 41.5157494601402)
	(['This'], 6.288267030694535)
	```

	အခု model ရဲ့ loss ကို corpus ပေါ်မှာ တွက်ချက်ဖို့ လွယ်ကူပါပြီ။

	```python
	def compute_loss(model):
	loss = 0
	for word, freq in word_freqs.items():
	_, word_loss = encode_word(word, model)
	loss += freq * word_loss
	return loss
	```

	ကျွန်တော်တို့မှာရှိတဲ့ model ပေါ်မှာ အလုပ်ဖြစ်မဖြစ် စစ်ဆေးနိုင်ပါတယ်။

	```python
	compute_loss(model)
	```

	```python out
	413.10377642940875
	```

	token တစ်ခုစီအတွက် scores တွေ တွက်ချက်တာလည်း မခက်ခဲပါဘူး။ token တစ်ခုစီကို ဖယ်ရှားခြင်းဖြင့် ရရှိတဲ့ models တွေအတွက် loss ကို တွက်ချက်ဖို့ပဲ လိုအပ်ပါတယ်။

	```python
	import copy


	def compute_scores(model):
	scores = {}
	model_loss = compute_loss(model)
	for token, score in model.items():
	# အရှည် 1 ရှိတဲ့ tokens တွေကို အမြဲတမ်း ထိန်းသိမ်းထားပါတယ်။
	if len(token) == 1:
	continue
	model_without_token = copy.deepcopy(model)
	_ = model_without_token.pop(token)
	scores[token] = compute_loss(model_without_token) - model_loss
	return scores
	```

	ပေးထားတဲ့ token တစ်ခုပေါ်မှာ စမ်းသပ်ကြည့်နိုင်ပါတယ်။

	```python
	scores = compute_scores(model)
	print(scores["ll"])
	print(scores["his"])
	```

	`"ll"` ကို `"Hopefully"` ရဲ့ tokenization မှာ အသုံးပြုတာကြောင့်၊ ဒါကို ဖယ်ရှားလိုက်ရင် `"l"` token ကို နှစ်ကြိမ် အစားထိုး အသုံးပြုရဖွယ်ရှိပြီး၊ ဒါကြောင့် positive loss ရရှိမယ်လို့ ကျွန်တော်တို့ မျှော်လင့်ပါတယ်။ `"his"` ကို `"This"` word အတွင်းမှာပဲ အသုံးပြုတာကြောင့်၊ ဒါက သူ့ကိုယ်သူ tokenize လုပ်တာဖြစ်ပြီး၊ ဒါကြောင့် zero loss ရရှိမယ်လို့ ကျွန်တော်တို့ မျှော်လင့်ပါတယ်။ ရလဒ်တွေကတော့...

	```python out
	6.376412403623874
	0.0
	```

	> [!TIP]
	> 💡 ဒီနည်းလမ်းက အလွန်ထိရောက်မှု မရှိပါဘူး။ ဒါကြောင့် SentencePiece က token X မပါတဲ့ model ရဲ့ loss ကို ခန့်မှန်းတွက်ချက်တဲ့ နည်းလမ်းကို အသုံးပြုပါတယ်။ အစကနေ ပြန်မစဘဲ၊ ဒါက token X ကို ကျန်ရှိနေတဲ့ vocabulary ထဲက ၎င်းရဲ့ segmentation နဲ့ အစားထိုးလိုက်ရုံပါပဲ။ ဒီနည်းနဲ့ model loss နဲ့အတူ scores အားလုံးကို တစ်ပြိုင်နက်တည်း တွက်ချက်နိုင်ပါတယ်။

	ဒီအရာအားလုံး ပြီးသွားတာနဲ့၊ နောက်ဆုံးလုပ်ရမယ့်အရာက model က အသုံးပြုတဲ့ special tokens တွေကို vocabulary ထဲကို ထည့်သွင်းဖို့ပါပဲ။ ပြီးမှ လိုချင်တဲ့ size ကို ရောက်တဲ့အထိ vocabulary ကနေ tokens တွေကို လုံလောက်အောင် prune လုပ်သည်အထိ loop လုပ်ပါ။

	```python
	percent_to_remove = 0.1
	while len(model) > 100:
	scores = compute_scores(model)
	sorted_scores = sorted(scores.items(), key=lambda x: x[1])
	# အနိမ့်ဆုံး scores ရှိတဲ့ tokens percent_to_remove ကို ဖယ်ရှားပါ။
	for i in range(int(len(model) * percent_to_remove)):
	_ = token_freqs.pop(sorted_scores[i][0])

	total_sum = sum([freq for token, freq in token_freqs.items()])
	model = {token: -log(freq / total_sum) for token, freq in token_freqs.items()}
	```

	ပြီးမှ၊ text အချို့ကို tokenize လုပ်ဖို့၊ ကျွန်တော်တို့ pre-tokenization ကို အသုံးပြုပြီး၊ ကျွန်တော်တို့ရဲ့ `encode_word()` function ကို အသုံးပြုဖို့ပဲ လိုအပ်ပါတယ်။

	```python
	def tokenize(text, model):
	words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
	pre_tokenized_text = [word for word, offset in words_with_offsets]
	encoded_words = [encode_word(word, model)[0] for word in pre_tokenized_text]
	return sum(encoded_words, [])


	tokenize("This is the Hugging Face course.", model)
	```

	```python out
	[' This', ' is', ' the', ' Hugging', ' Face', ' ', 'c', 'ou', 'r', 's', 'e', '.']
	```

	> [!TIP]
	> XLNetTokenizer က SentencePiece ကို အသုံးပြုတာကြောင့် `"_"` character ပါဝင်ပါတယ်။ SentencePiece နဲ့ decode လုပ်ဖို့၊ tokens အားလုံးကို concatenate လုပ်ပြီး `"_"` ကို space နဲ့ အစားထိုးပါ။

	Unigram အတွက် ဒါပါပဲ! အခုဆိုရင် သင်ဟာ tokenizer အရာအားလုံးမှာ ကျွမ်းကျင်သူတစ်ယောက်လို ခံစားရလိမ့်မယ်လို့ မျှော်လင့်ပါတယ်။ နောက်အပိုင်းမှာ၊ 🤗 Tokenizers library ရဲ့ building blocks တွေထဲကို ကျွန်တော်တို့ နက်နက်နဲနဲ လေ့လာပြီး သင့်ကိုယ်ပိုင် tokenizer ကို ဘယ်လိုတည်ဆောက်ရမလဲဆိုတာ ပြသပေးပါမယ်။

	## ဝေါဟာရ ရှင်းလင်းချက် (Glossary)

	* Unigram Algorithm: Subword tokenization algorithm တစ်မျိုးဖြစ်ပြီး vocabulary ကြီးကြီးမှ စတင်ကာ loss ကို အနည်းဆုံးဖြစ်စေရန် tokens များကို ဖယ်ရှားခြင်းဖြင့် အလုပ်လုပ်သည်။
	* SentencePiece: Google မှ ဖန်တီးထားသော open-source text tokenization algorithm တစ်ခုဖြစ်ပြီး ဘာသာစကားမျိုးစုံအတွက် အလုပ်လုပ်သည်။ ၎င်းသည် spaces များကို စကားလုံးခွဲခြားရန် မသုံးသော ဘာသာစကားများ (ဥပမာ- တရုတ်၊ ဂျပန်) အတွက် အထူးသင့်လျော်သည်။
	* AlBERT: BERT ၏ lightweight version ဖြစ်သော AI model။
	* T5: Google မှ ဖန်တီးထားသော Text-to-Text Transfer Transformer model။
	* mBART: Multilingual Bidirectional and Auto-Regressive Transformers (multilingual sequence-to-sequence model)။
	* Big Bird: Long sequence များအတွက် Transformer model ၏ efficient version။
	* XLNet: Autoregressive Transformer model တစ်မျိုး။
	* Raw Input Stream: မည်သည့် preprocessing မျှ မလုပ်ဆောင်ရသေးသော input data။
	* Vocabulary: tokenizer သို့မဟုတ် model တစ်ခုက သိရှိနားလည်ပြီး ကိုင်တွယ်နိုင်သော ထူးခြားသည့် tokens များ စုစုပေါင်း။
	* BPE (Byte-Pair Encoding): Subword tokenization algorithm တစ်မျိုး။
	* WordPiece: Subword tokenization algorithm တစ်မျိုး။
	* Substrings: string တစ်ခု၏ အစိတ်အပိုင်းများ။
	* Pre-tokenized Words: subword tokenization မလုပ်ဆောင်မီ ပိုင်းခြားထားသော စကားလုံးများ။
	* Initial Corpus: model သို့မဟုတ် tokenizer ကို လေ့ကျင့်ရန် အသုံးပြုသော မူလဒေတာအစုအဝေး။
	* Loss: Model ၏ ခန့်မှန်းချက်များနှင့် အမှန်တကယ် labels များကြား ကွာခြားမှုကို တိုင်းတာသော တန်ဖိုး။
	* Corpus: စာသား (သို့မဟုတ် အခြားဒေတာ) အစုအဝေးကြီးတစ်ခု။
	* Symbol: token သို့မဟုတ် subword တစ်ခုကို ရည်ညွှန်းသည်။
	* Hyperparameter: model training မစမီ သတ်မှတ်ပေးရသော parameter (ဥပမာ- learning rate, batch size, percent_to_remove)။
	* Base Characters: ဘာသာစကားတစ်ခု၏ အခြေခံစာလုံးများ။
	* Language Model: လူသားဘာသာစကား၏ ဖြန့်ဝေမှုကို နားလည်ရန် လေ့ကျင့်ထားသော AI မော်ဒယ်တစ်ခု။
	* Probability: ဖြစ်နိုင်ခြေတန်ဖိုး။
	* Frequency: အရာတစ်ခု ပေါ်လာသည့် အကြိမ်အရေအတွက်။
	* Sum of All Frequencies: Vocabulary ထဲရှိ tokens အားလုံး၏ frequencies ပေါင်းလဒ်။
	* Segmentation: စကားလုံးတစ်ခုကို subword tokens များအဖြစ် ပိုင်းခြားခြင်း။
	* Product of Probability: probability များကို မြှောက်ခြင်းဖြင့် ရရှိသော တန်ဖိုး။
	* Viterbi Algorithm: Dynamic programming technique တစ်မျိုးဖြစ်ပြီး sequence တစ်ခုအတွက် ဖြစ်နိုင်ခြေအများဆုံး state path (ဥပမာ- tokenization) ကို ရှာဖွေရာတွင် အသုံးပြုသည်။
	* Graph: nodes (vertices) နှင့် edges (connections) များဖြင့် ဖွဲ့စည်းထားသော ဒေတာဖွဲ့စည်းပုံ။
	* Subword: စကားလုံးတစ်ခု၏ အစိတ်အပိုင်း။
	* Negative Log Likelihood: probability ၏ logarithm ၏ အနုတ်လက္ခဏာတန်ဖိုး။ loss function တစ်ခုအဖြစ် အသုံးပြုသည်။
	* XLNetTokenizer: XLNet model အတွက် အသုံးပြုသော tokenizer။
	* `xlnet-base-cased`: XLNet model ၏ base version အတွက် checkpoint identifier (cased version)။
	* `collections.defaultdict(int)`: Python dictionary တစ်မျိုးဖြစ်ပြီး မရှိသေးသော key ကို ဝင်ရောက်ကြည့်ရှုသောအခါ int() ကို default value (0) အဖြစ် ပြန်ပေးသည်။
	* `tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)`: 🤗 Tokenizers library မှ pre-tokenization ကို လုပ်ဆောင်သော method။
	* `word_freqs`: corpus ထဲရှိ words များ၏ frequency များကို သိမ်းဆည်းထားသော dictionary။
	* `char_freqs`: corpus ထဲရှိ characters များ၏ frequency များကို သိမ်းဆည်းထားသော dictionary။
	* `subwords_freqs`: corpus ထဲရှိ subwords များ၏ frequency များကို သိမ်းဆည်းထားသော dictionary။
	* `lambda x: x[1]`: Python lambda function တစ်ခုဖြစ်ပြီး key-value pair မှ value ကို ပြန်ပေးသည်။ sort လုပ်ရာတွင် အသုံးပြုသည်။
	* Enhanced Suffix Array (ESA): initial vocabulary ကို ဖန်တီးရန် SentencePiece မှ အသုံးပြုသော algorithm တစ်မျိုး။
	* Numerically Stable: Floating-point arithmetic ကြောင့် ဖြစ်ပေါ်လာနိုင်သော error များကို လျှော့ချရန် နည်းလမ်း။
	* `log`: Natural logarithm (e base)။
	* `best_segmentations`: Viterbi algorithm တွင် အကောင်းဆုံး segmentations များကို သိမ်းဆည်းထားသော list။
	* `best_score_at_start`: start position တစ်ခုတွင် အကောင်းဆုံး segmentation score။
	* `<unk>` (Unknown Token): vocabulary ထဲမှာ မပါဝင်တဲ့ word တွေအတွက် အစားထိုးအသုံးပြုတဲ့ special token။
	* `compute_loss(model)`: model ရဲ့ loss ကို တွက်ချက်သော function။
	* `compute_scores(model)`: vocabulary ထဲက token တစ်ခုစီကို ဖယ်ရှားလိုက်ရင် loss ဘယ်လောက်ပြောင်းလဲမလဲဆိုတာ တွက်ချက်သော function။
	* `copy.deepcopy(model)`: Python တွင် object တစ်ခု၏ နက်ရှိုင်းသော မိတ္တူ (deep copy) ကို ဖန်တီးခြင်း။
	* `token_freqs.pop(sorted_scores[i][0])`: dictionary မှ key ကို ဖယ်ရှားခြင်း။
	* `percent_to_remove`: training လုပ်နေစဉ် တစ်ကြိမ်တည်းမှာ ဖယ်ရှားမည့် tokens ရာခိုင်နှုန်း။
	* `tokenize(text, model)`: text ကို model အသုံးပြုပြီး tokenize လုပ်သော function။
	* `sum(encoded_words, [])`: list of lists များကို single list တစ်ခုအဖြစ် ပေါင်းစပ်ခြင်း။
	* `_` Character: SentencePiece တွင် space ကို ကိုယ်စားပြုသော special character။

	<EditOnGithub source="https://github.com/huggingface/course/blob/main/chapters/my/chapter6/7.mdx" />

Xet Storage Details

Size:: 46.1 kB
Xet hash:: ef52916a8c5aa0344e4230de52b7e2e345cd08b8e3cda790791cdd25c8ab8dbd

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.