Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Byte-Pair Encoding tokenization","local":"byte-pair-encoding-tokenization","sections":[{"title":"Thuật toán huấn luyện","local":"thuật-toán-huấn-luyện","sections":[],"depth":2},{"title":"Thuật toán tokenize","local":"thuật-toán-tokenize","sections":[],"depth":2},{"title":"Triển khai BPE","local":"triển-khai-bpe","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/course/pr_1069/vi/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/entry/start.bcd19957.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/scheduler.37c15a92.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/singletons.20a6a839.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/index.18351ede.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/paths.c89f4ad2.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/entry/app.38d32b86.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/index.2bf4358c.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/nodes/0.cba642dc.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/nodes/48.0645477e.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/Tip.363c041f.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/Youtube.1e50a667.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/CodeBlock.4e987730.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/DocNotebookDropdown.efc1fb7c.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1069/vi/_app/immutable/chunks/getInferenceSnippets.24b50994.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Byte-Pair Encoding tokenization","local":"byte-pair-encoding-tokenization","sections":[{"title":"Thuật toán huấn luyện","local":"thuật-toán-huấn-luyện","sections":[],"depth":2},{"title":"Thuật toán tokenize","local":"thuật-toán-tokenize","sections":[],"depth":2},{"title":"Triển khai BPE","local":"triển-khai-bpe","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="byte-pair-encoding-tokenization" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#byte-pair-encoding-tokenization"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Byte-Pair Encoding tokenization</span></h1> <div class="flex space-x-1 absolute z-10 right-0 top-0"> <a href="https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/vi/chapter6/section5.ipynb" target="_blank"><img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/vi/chapter6/section5.ipynb" target="_blank"><img alt="Open In Studio Lab" class="!m-0" src="https://studiolab.sagemaker.aws/studiolab.svg"></a></div> <p data-svelte-h="svelte-11kmw4w">Mã hóa theo cặp (BPE) tiền thân được phát triển như một thuật toán để nén văn bản, sau đó được OpenAI sử dụng để tokenize khi huấn luyện trước mô hình GPT. Nó được sử dụng bởi rất nhiều mô hình Transformer, bao gồm GPT, GPT-2, RoBERTa, BART và DeBERTa.</p> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/HEikzVL-lZU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-qb86qo">💡 Phần này trình bày sâu hơn về BPE, đi xa hơn nữa là trình bày cách triển khai đầy đủ. Bạn có thể bỏ qua phần cuối nếu bạn chỉ muốn có một cái nhìn tổng quan chung về thuật toán tokenize.</p></div> <h2 class="relative group"><a id="thuật-toán-huấn-luyện" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#thuật-toán-huấn-luyện"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Thuật toán huấn luyện</span></h2> <p data-svelte-h="svelte-zg05r7">Huấn luyện BPE bắt đầu bằng cách tính toán tập hợp các từ duy nhất được sử dụng trong kho ngữ liệu (sau khi hoàn thành các bước chuẩn hóa và pre-tokenization), sau đó xây dựng từ vựng bằng cách lấy tất cả các ký hiệu được sử dụng để viết những từ đó. Ví dụ rất đơn giản, giả sử kho dữ liệu của chúng ta sử dụng năm từ sau:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-string">"hug"</span>, <span class="hljs-string">"pug"</span>, <span class="hljs-string">"pun"</span>, <span class="hljs-string">"bun"</span>, <span class="hljs-string">"hugs"</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-gi73yv">Từ vựng cơ sở khi đó sẽ là <code>["b", "g", "h", "n", "p", "s", "u"]</code>. Đối với các trường hợp trong thực tế, từ vựng cơ sở đó sẽ chứa tất cả các ký tự ASCII, ít nhất và có thể là một số ký tự Unicode. Nếu một mẫu bạn đang tokenize sử dụng một ký tự không có trong kho dữ liệu huấn luyện, thì ký tự đó sẽ được chuyển đổi thành token không xác định. Đó là một lý do tại sao nhiều mô hình NLP rất kém trong việc phân tích nội dung bằng biểu tượng cảm xúc.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-ce0g87">GPT-2 và RoBERTa tokenizer (khá giống nhau) có một cách thông minh để giải quyết vấn đề này: chúng không xem các từ được viết bằng các ký tự Unicode mà là các byte. Bằng cách này, từ vựng cơ sở có kích thước nhỏ (256), nhưng mọi ký tự bạn có thể nghĩ đến sẽ vẫn được bao gồm và không bị chuyển đổi thành token không xác định. Thủ thuật này được gọi là <em>BPE cấp byte</em>.</p></div> <p data-svelte-h="svelte-1kt74dv">Sau khi có được bộ từ vựng cơ bản này, chúng ta thêm các token mới cho đến khi đạt được kích thước từ vựng mong muốn bằng cách học <em>hợp nhất</em>, đây là các quy tắc để hợp nhất hai yếu tố của từ vựng hiện có với nhau thành một từ mới. Vì vậy, lúc đầu sự hợp nhất này sẽ tạo ra các token có hai ký tự và sau đó, khi quá trình huấn luyện tiến triển, các từ phụ sẽ dài hơn.</p> <p data-svelte-h="svelte-synn8h">Tại bất kỳ bước nào trong quá trình huấn luyện token, thuật toán BPE sẽ tìm kiếm cặp token hiện có thường xuyên nhất (theo “cặp”, ở đây có nghĩa là hai token liên tiếp trong một từ). Cặp thường xuyên nhất đó là cặp sẽ được hợp nhất, và chúng ta xả và lặp lại cho bước tiếp theo.</p> <p data-svelte-h="svelte-12rd29q">Quay trở lại ví dụ trước, giả sử các từ có tần số như sau:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">"hug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">10</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"pug"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"pun"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">12</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"bun"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">4</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"hugs"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-qz03b6">nghĩa là <code>"hug"</code> có mặt 10 lần trong kho ngữ liệu, <code>"pug"</code> 5 lần, <code>"pun"</code> 12 lần, <code>"bun"</code> 4 lần và <code>"hug"</code> 5 lần. Chúng ta bắt đầu huấn luyện bằng cách tách từng từ thành các ký tự (những ký tự hình thành từ vựng ban đầu của chúng ta) để có thể xem mỗi từ như một danh sách các token:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">"h"</span> <span class="hljs-string">"u"</span> <span class="hljs-string">"g"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">10</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"p"</span> <span class="hljs-string">"u"</span> <span class="hljs-string">"g"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"p"</span> <span class="hljs-string">"u"</span> <span class="hljs-string">"n"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">12</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"b"</span> <span class="hljs-string">"u"</span> <span class="hljs-string">"n"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">4</span>)<span class="hljs-punctuation">,</span> (<span class="hljs-string">"h"</span> <span class="hljs-string">"u"</span> <span class="hljs-string">"g"</span> <span class="hljs-string">"s"</span><span class="hljs-punctuation">,</span> <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1qpj7kn">Sau đó, chúng ta xem xét các cặp. Cặp <code>("h", "u")</code> có trong các từ <code>"hug"</code> và <code>"hugs"</code>, vì vậy tổng cộng là 15 lần trong ngữ liệu. Tuy nhiên, đây không phải là cặp thường xuyên nhất: vinh dự đó thuộc về <code>("u", "g")</code>, có trong <code>"hug"</code>, <code>"pug"</code>, và <code>"hugs"</code>, với tổng cộng 20 lần xuất hiện trong bộ từ vựng.</p> <p data-svelte-h="svelte-veqq88">Do đó, quy tắc hợp nhất đầu tiên được học bởi tokenizer là <code>("u", "g") -> "ug"</code>, có nghĩa là <code>"ug"</code> sẽ được thêm vào từ vựng và cặp này sẽ được hợp nhất trong tất cả các từ của ngữ liệu. Vào cuối giai đoạn này, từ vựng và ngữ liệu sẽ giống như sau:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-symbol">Vocabulary:</span> [<span class="hljs-string">"b"</span>, <span class="hljs-string">"g"</span>, <span class="hljs-string">"h"</span>, <span class="hljs-string">"n"</span>, <span class="hljs-string">"p"</span>, <span class="hljs-string">"s"</span>, <span class="hljs-string">"u"</span>, <span class="hljs-string">"ug"</span>] | |
| <span class="hljs-symbol">Corpus:</span> (<span class="hljs-string">"h"</span> <span class="hljs-string">"ug"</span>, <span class="hljs-number">10</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"ug"</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"u"</span> <span class="hljs-string">"n"</span>, <span class="hljs-number">12</span>), (<span class="hljs-string">"b"</span> <span class="hljs-string">"u"</span> <span class="hljs-string">"n"</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">"h"</span> <span class="hljs-string">"ug"</span> <span class="hljs-string">"s"</span>, <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-dbvjcs">Bây giờ chúng ta có một số cặp dẫn đến một token dài hơn hai ký tự: ví dụ: cặp <code>("h", "ug")</code>, (hiện diện 15 lần trong kho ngữ liệu). Cặp thường gặp nhất ở giai đoạn này là <code>("u", "n")</code>, xuất hiện 16 lần trong kho ngữ liệu, vì vậy quy tắc hợp nhất thứ hai đã học là <code>("u", "n") -> "un"</code>. Thêm nó vào bộ từ vựng và hợp nhất tất cả các lần xuất hiện hiện có sẽ dẫn chúng ta đến:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-symbol">Vocabulary:</span> [<span class="hljs-string">"b"</span>, <span class="hljs-string">"g"</span>, <span class="hljs-string">"h"</span>, <span class="hljs-string">"n"</span>, <span class="hljs-string">"p"</span>, <span class="hljs-string">"s"</span>, <span class="hljs-string">"u"</span>, <span class="hljs-string">"ug"</span>, <span class="hljs-string">"un"</span>] | |
| <span class="hljs-symbol">Corpus:</span> (<span class="hljs-string">"h"</span> <span class="hljs-string">"ug"</span>, <span class="hljs-number">10</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"ug"</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"un"</span>, <span class="hljs-number">12</span>), (<span class="hljs-string">"b"</span> <span class="hljs-string">"un"</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">"h"</span> <span class="hljs-string">"ug"</span> <span class="hljs-string">"s"</span>, <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1g0xfcl">Giờ thì cặp xuất hiện nhiều nhất là <code>("h", "ug")</code>, nên chúng ta hợp nhất <code>("h", "ug") -> "hug"</code>, trả về cho chúng ta token gồn ba kí tự đầu tiên. Sau sự hợp nhất này, kho ngữ liệu sẽ như sau:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-symbol">Vocabulary:</span> [<span class="hljs-string">"b"</span>, <span class="hljs-string">"g"</span>, <span class="hljs-string">"h"</span>, <span class="hljs-string">"n"</span>, <span class="hljs-string">"p"</span>, <span class="hljs-string">"s"</span>, <span class="hljs-string">"u"</span>, <span class="hljs-string">"ug"</span>, <span class="hljs-string">"un"</span>, <span class="hljs-string">"hug"</span>] | |
| <span class="hljs-symbol">Corpus:</span> (<span class="hljs-string">"hug"</span>, <span class="hljs-number">10</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"ug"</span>, <span class="hljs-number">5</span>), (<span class="hljs-string">"p"</span> <span class="hljs-string">"un"</span>, <span class="hljs-number">12</span>), (<span class="hljs-string">"b"</span> <span class="hljs-string">"un"</span>, <span class="hljs-number">4</span>), (<span class="hljs-string">"hug"</span> <span class="hljs-string">"s"</span>, <span class="hljs-number">5</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-hfuvhi">Và chúng ta tiếp túc làm vậy cho đến khi chúng ta chạm đến kích thước bộ tự điển ta mong muốn.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1xbl3fa">✏️ <strong>Giờ thì đến lượt bạn!</strong> Bạn nghĩ bước hợp nhất tiếp theo sẽ là gì?</p></div> <h2 class="relative group"><a id="thuật-toán-tokenize" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#thuật-toán-tokenize"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Thuật toán tokenize</span></h2> <p data-svelte-h="svelte-ffv8nq">Tokenize tuân thủ chặt chẽ quá trình huấn luyện, theo nghĩa là các đầu vào mới được tokenize bằng cách áp dụng các bước sau:</p> <ol data-svelte-h="svelte-11lx2za"><li>Chuẩn hoá</li> <li>Pre-tokenization</li> <li>Tách các từ thành các ký tự riêng lẻ</li> <li>Áp dụng các quy tắc hợp nhất đã học theo thứ tự trên các phần tách đó</li></ol> <p data-svelte-h="svelte-gqyamq">Lấy ví dụ mà ta đã sử dụng trong quá trình huấn luyện, với ba quy tắc hợp nhất đã học:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-function"><span class="hljs-params">(<span class="hljs-string">"u"</span>, <span class="hljs-string">"g"</span>)</span> -></span> <span class="hljs-string">"ug"</span> | |
| <span class="hljs-function"><span class="hljs-params">(<span class="hljs-string">"u"</span>, <span class="hljs-string">"n"</span>)</span> -></span> <span class="hljs-string">"un"</span> | |
| <span class="hljs-function"><span class="hljs-params">(<span class="hljs-string">"h"</span>, <span class="hljs-string">"ug"</span>)</span> -></span> <span class="hljs-string">"hug"</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ogsb8b">Từ <code>"bug"</code> sẽ được tokenize thành <code>["b", "ug"]</code>. <code>"mug"</code>, tuy nhiên, sẽ tokenize thành <code>["[UNK]", "ug"]</code> vì kí tự <code>"m"</code> không có trong bộ tự vựng gốc. Tương tự, từ <code>"thug"</code> sẽ được tokenize thành <code>["[UNK]", "hug"]</code>: kí tự <code>"t"</code> không có trong bộ tự vựng gốc, và áp dụng quy tắc hợp nhất ở <code>"u"</code> và <code>"g"</code> và sau đó <code>"hu"</code> và <code>"g"</code>.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-gq28p5">✏️ <strong>Giờ tới lượt bạn!</strong> Bạn nghĩ rằng <code>"unhug"</code> sẽ được tokenize như thế nào?</p></div> <h2 class="relative group"><a id="triển-khai-bpe" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#triển-khai-bpe"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Triển khai BPE</span></h2> <p data-svelte-h="svelte-1vl7fwd">Hãy cùng xem các thuật toán BPE được triển khai. Đây không phải là phiên bản tối ưu mà bạn có thể thực sự sử dụng cho một kho ngữ liệu lớn; chúng tôi chỉ muốn cho bạn xem đoạn mã để bạn có thể hiểu thuật toán này tốt hơn.</p> <p data-svelte-h="svelte-4d1s7j">Đầu tiên chúng ta cần một kho ngữ liệu, vậy nên hay tạo ra một bản đơn giản với một vài câu:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->corpus = [ | |
| <span class="hljs-string">"This is the Hugging Face Course."</span>, | |
| <span class="hljs-string">"This chapter is about tokenization."</span>, | |
| <span class="hljs-string">"This section shows several tokenizer algorithms."</span>, | |
| <span class="hljs-string">"Hopefully, you will be able to understand how they are trained and generate tokens."</span>, | |
| ]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-56uiq8">Tiếp theo, ta cần tiền tokenize kho ngữ liệu này thành các từ. Vì ta đang sao chép một bản BPE tokenizer (như GPT-2), ta vẫn có thể sử dụng <code>gpt2</code> tokenize cho bước pre-tokenization:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"gpt2"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1gazbec">Sau đó ta tính tần suất của từng từ trong kho ngữ liệu như khi làm với pre-tokenization:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> defaultdict | |
| word_freqs = defaultdict(<span class="hljs-built_in">int</span>) | |
| <span class="hljs-keyword">for</span> text <span class="hljs-keyword">in</span> corpus: | |
| words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text) | |
| new_words = [word <span class="hljs-keyword">for</span> word, offset <span class="hljs-keyword">in</span> words_with_offsets] | |
| <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> new_words: | |
| word_freqs[word] += <span class="hljs-number">1</span> | |
| <span class="hljs-built_in">print</span>(word_freqs)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->defaultdict(<span class="hljs-built_in">int</span>, {<span class="hljs-string">'This'</span>: <span class="hljs-number">3</span>, <span class="hljs-string">'Ġis'</span>: <span class="hljs-number">2</span>, <span class="hljs-string">'Ġthe'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'ĠHugging'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'ĠFace'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'ĠCourse'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'.'</span>: <span class="hljs-number">4</span>, <span class="hljs-string">'Ġchapter'</span>: <span class="hljs-number">1</span>, | |
| <span class="hljs-string">'Ġabout'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġtokenization'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġsection'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġshows'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġseveral'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġtokenizer'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġalgorithms'</span>: <span class="hljs-number">1</span>, | |
| <span class="hljs-string">'Hopefully'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">','</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġyou'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġwill'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġbe'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġable'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġto'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġunderstand'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġhow'</span>: <span class="hljs-number">1</span>, | |
| <span class="hljs-string">'Ġthey'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġare'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġtrained'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġand'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġgenerate'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'Ġtokens'</span>: <span class="hljs-number">1</span>})<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1q44ilk">Tiếp theo chúng ta sẽ tính bộ từ vựng cơ sở từ các kí tự sử dụng trong kho ngữ liệu:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->alphabet = [] | |
| <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> word_freqs.keys(): | |
| <span class="hljs-keyword">for</span> letter <span class="hljs-keyword">in</span> word: | |
| <span class="hljs-keyword">if</span> letter <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> alphabet: | |
| alphabet.append(letter) | |
| alphabet.sort() | |
| <span class="hljs-built_in">print</span>(alphabet)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[ <span class="hljs-string">','</span>, <span class="hljs-string">'.'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'F'</span>, <span class="hljs-string">'H'</span>, <span class="hljs-string">'T'</span>, <span class="hljs-string">'a'</span>, <span class="hljs-string">'b'</span>, <span class="hljs-string">'c'</span>, <span class="hljs-string">'d'</span>, <span class="hljs-string">'e'</span>, <span class="hljs-string">'f'</span>, <span class="hljs-string">'g'</span>, <span class="hljs-string">'h'</span>, <span class="hljs-string">'i'</span>, <span class="hljs-string">'k'</span>, <span class="hljs-string">'l'</span>, <span class="hljs-string">'m'</span>, <span class="hljs-string">'n'</span>, <span class="hljs-string">'o'</span>, <span class="hljs-string">'p'</span>, <span class="hljs-string">'r'</span>, <span class="hljs-string">'s'</span>, | |
| <span class="hljs-string">'t'</span>, <span class="hljs-string">'u'</span>, <span class="hljs-string">'v'</span>, <span class="hljs-string">'w'</span>, <span class="hljs-string">'y'</span>, <span class="hljs-string">'z'</span>, <span class="hljs-string">'Ġ'</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1oxyb2m">Ta cũng có thể thêm các token đặc biệt từ mô hình ở đầu của bộ tự vựng. Trong trường hợp của GPT-2, token đặc biệt duy nhất đó là <code>"<|endoftext|>"</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->vocab = [<span class="hljs-string">"<|endoftext|>"</span>] + alphabet.copy()<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-2htgmu">Ta giờ cần phải chia mỗi từ thành các kí tự riêng lẻ để có thể bắt đầu huấn luyện</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->splits = {word: [c <span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> word] <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> word_freqs.keys()}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1a8mxvp">Giờ ta đã sẵn sàng để huấn luyện, hãy cùng viết một hàm tính tần suất mỗi cặp. Ta sẽ cần sử dụng nó ở bước huấn luyện:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">compute_pair_freqs</span>(<span class="hljs-params">splits</span>): | |
| pair_freqs = defaultdict(<span class="hljs-built_in">int</span>) | |
| <span class="hljs-keyword">for</span> word, freq <span class="hljs-keyword">in</span> word_freqs.items(): | |
| split = splits[word] | |
| <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(split) == <span class="hljs-number">1</span>: | |
| <span class="hljs-keyword">continue</span> | |
| <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">len</span>(split) - <span class="hljs-number">1</span>): | |
| pair = (split[i], split[i + <span class="hljs-number">1</span>]) | |
| pair_freqs[pair] += freq | |
| <span class="hljs-keyword">return</span> pair_freqs<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-e72r5t">Hãy nhìn vào một phần từ điẻn sau khi tách:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->pair_freqs = compute_pair_freqs(splits) | |
| <span class="hljs-keyword">for</span> i, key <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(pair_freqs.keys()): | |
| <span class="hljs-built_in">print</span>(<span class="hljs-string">f"<span class="hljs-subst">{key}</span>: <span class="hljs-subst">{pair_freqs[key]}</span>"</span>) | |
| <span class="hljs-keyword">if</span> i >= <span class="hljs-number">5</span>: | |
| <span class="hljs-keyword">break</span><!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">'T'</span>, <span class="hljs-string">'h'</span>): <span class="hljs-number">3</span> | |
| (<span class="hljs-string">'h'</span>, <span class="hljs-string">'i'</span>): <span class="hljs-number">3</span> | |
| (<span class="hljs-string">'i'</span>, <span class="hljs-string">'s'</span>): <span class="hljs-number">5</span> | |
| (<span class="hljs-string">'Ġ'</span>, <span class="hljs-string">'i'</span>): <span class="hljs-number">2</span> | |
| (<span class="hljs-string">'Ġ'</span>, <span class="hljs-string">'t'</span>): <span class="hljs-number">7</span> | |
| (<span class="hljs-string">'t'</span>, <span class="hljs-string">'h'</span>): <span class="hljs-number">3</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-4rjuk1">Giờ thì, tìm xem cặp xuất hiện nhiều nhất bằng một vòng lặp nhanh:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->best_pair = <span class="hljs-string">""</span> | |
| max_freq = <span class="hljs-literal">None</span> | |
| <span class="hljs-keyword">for</span> pair, freq <span class="hljs-keyword">in</span> pair_freqs.items(): | |
| <span class="hljs-keyword">if</span> max_freq <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">or</span> max_freq < freq: | |
| best_pair = pair | |
| max_freq = freq | |
| <span class="hljs-built_in">print</span>(best_pair, max_freq)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-string">'Ġ'</span>, <span class="hljs-string">'t'</span>) <span class="hljs-number">7</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-c1bdjy">Vậy phép hợp nhất đầu tiên là <code>('Ġ', 't') -> 'Ġt'</code>, và ta thêm <code>'Ġt'</code> vào bộ từ vựng:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->merges = {(<span class="hljs-string">"Ġ"</span>, <span class="hljs-string">"t"</span>): <span class="hljs-string">"Ġt"</span>} | |
| vocab.append(<span class="hljs-string">"Ġt"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-18ne55y">Để tiếp tục, ta cần áp dụng sự hợp nhất ở từ điển <code>splits</code>. Hãy cùng viết một hàm khác cho nó:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">merge_pair</span>(<span class="hljs-params">a, b, splits</span>): | |
| <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> word_freqs: | |
| split = splits[word] | |
| <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(split) == <span class="hljs-number">1</span>: | |
| <span class="hljs-keyword">continue</span> | |
| i = <span class="hljs-number">0</span> | |
| <span class="hljs-keyword">while</span> i < <span class="hljs-built_in">len</span>(split) - <span class="hljs-number">1</span>: | |
| <span class="hljs-keyword">if</span> split[i] == a <span class="hljs-keyword">and</span> split[i + <span class="hljs-number">1</span>] == b: | |
| split = split[:i] + [a + b] + split[i + <span class="hljs-number">2</span> :] | |
| <span class="hljs-keyword">else</span>: | |
| i += <span class="hljs-number">1</span> | |
| splits[word] = split | |
| <span class="hljs-keyword">return</span> splits<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-dw7mnu">Giờ ta có thể nhìn xem kết quả của lần hợp nhất đầu tiên:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->splits = merge_pair(<span class="hljs-string">"Ġ"</span>, <span class="hljs-string">"t"</span>, splits) | |
| <span class="hljs-built_in">print</span>(splits[<span class="hljs-string">"Ġtrained"</span>])<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'Ġt'</span>, <span class="hljs-string">'r'</span>, <span class="hljs-string">'a'</span>, <span class="hljs-string">'i'</span>, <span class="hljs-string">'n'</span>, <span class="hljs-string">'e'</span>, <span class="hljs-string">'d'</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1rsvobe">Giờ thì ta có tất cả những gì mình cần để lặp cho đến khi ta học tất các các hợp nhất mà ta muốn. Hãy cũng nhắm tới bộ tự vựng có kích cỡ là 50:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->vocab_size = <span class="hljs-number">50</span> | |
| <span class="hljs-keyword">while</span> <span class="hljs-built_in">len</span>(vocab) < vocab_size: | |
| pair_freqs = compute_pair_freqs(splits) | |
| best_pair = <span class="hljs-string">""</span> | |
| max_freq = <span class="hljs-literal">None</span> | |
| <span class="hljs-keyword">for</span> pair, freq <span class="hljs-keyword">in</span> pair_freqs.items(): | |
| <span class="hljs-keyword">if</span> max_freq <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span> <span class="hljs-keyword">or</span> max_freq < freq: | |
| best_pair = pair | |
| max_freq = freq | |
| splits = merge_pair(*best_pair, splits) | |
| merges[best_pair] = best_pair[<span class="hljs-number">0</span>] + best_pair[<span class="hljs-number">1</span>] | |
| vocab.append(best_pair[<span class="hljs-number">0</span>] + best_pair[<span class="hljs-number">1</span>])<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1f46jbe">Kết quả là, chúng ta đã học 19 quy tắc hợp nhất (bộ từ điển gốc có kích cỡ là 31 tương ứng 30 kí tự trong bảng chữ cái cùng một token đặt biệt):</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-built_in">print</span>(merges)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{(<span class="hljs-string">'Ġ'</span>, <span class="hljs-string">'t'</span>): <span class="hljs-string">'Ġt'</span>, (<span class="hljs-string">'i'</span>, <span class="hljs-string">'s'</span>): <span class="hljs-string">'is'</span>, (<span class="hljs-string">'e'</span>, <span class="hljs-string">'r'</span>): <span class="hljs-string">'er'</span>, (<span class="hljs-string">'Ġ'</span>, <span class="hljs-string">'a'</span>): <span class="hljs-string">'Ġa'</span>, (<span class="hljs-string">'Ġt'</span>, <span class="hljs-string">'o'</span>): <span class="hljs-string">'Ġto'</span>, (<span class="hljs-string">'e'</span>, <span class="hljs-string">'n'</span>): <span class="hljs-string">'en'</span>, | |
| (<span class="hljs-string">'T'</span>, <span class="hljs-string">'h'</span>): <span class="hljs-string">'Th'</span>, (<span class="hljs-string">'Th'</span>, <span class="hljs-string">'is'</span>): <span class="hljs-string">'This'</span>, (<span class="hljs-string">'o'</span>, <span class="hljs-string">'u'</span>): <span class="hljs-string">'ou'</span>, (<span class="hljs-string">'s'</span>, <span class="hljs-string">'e'</span>): <span class="hljs-string">'se'</span>, (<span class="hljs-string">'Ġto'</span>, <span class="hljs-string">'k'</span>): <span class="hljs-string">'Ġtok'</span>, | |
| (<span class="hljs-string">'Ġtok'</span>, <span class="hljs-string">'en'</span>): <span class="hljs-string">'Ġtoken'</span>, (<span class="hljs-string">'n'</span>, <span class="hljs-string">'d'</span>): <span class="hljs-string">'nd'</span>, (<span class="hljs-string">'Ġ'</span>, <span class="hljs-string">'is'</span>): <span class="hljs-string">'Ġis'</span>, (<span class="hljs-string">'Ġt'</span>, <span class="hljs-string">'h'</span>): <span class="hljs-string">'Ġth'</span>, (<span class="hljs-string">'Ġth'</span>, <span class="hljs-string">'e'</span>): <span class="hljs-string">'Ġthe'</span>, | |
| (<span class="hljs-string">'i'</span>, <span class="hljs-string">'n'</span>): <span class="hljs-string">'in'</span>, (<span class="hljs-string">'Ġa'</span>, <span class="hljs-string">'b'</span>): <span class="hljs-string">'Ġab'</span>, (<span class="hljs-string">'Ġtoken'</span>, <span class="hljs-string">'i'</span>): <span class="hljs-string">'Ġtokeni'</span>}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1qjs069">Và bộ tự vựng cấu thành bởi token đặc biết, các kí tự trong bảng chữ cái, và tất cả kết quả từ các quy tắc hợp nhất:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-built_in">print</span>(vocab)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'<|endoftext|>'</span>, <span class="hljs-string">','</span>, <span class="hljs-string">'.'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'F'</span>, <span class="hljs-string">'H'</span>, <span class="hljs-string">'T'</span>, <span class="hljs-string">'a'</span>, <span class="hljs-string">'b'</span>, <span class="hljs-string">'c'</span>, <span class="hljs-string">'d'</span>, <span class="hljs-string">'e'</span>, <span class="hljs-string">'f'</span>, <span class="hljs-string">'g'</span>, <span class="hljs-string">'h'</span>, <span class="hljs-string">'i'</span>, <span class="hljs-string">'k'</span>, <span class="hljs-string">'l'</span>, <span class="hljs-string">'m'</span>, <span class="hljs-string">'n'</span>, <span class="hljs-string">'o'</span>, | |
| <span class="hljs-string">'p'</span>, <span class="hljs-string">'r'</span>, <span class="hljs-string">'s'</span>, <span class="hljs-string">'t'</span>, <span class="hljs-string">'u'</span>, <span class="hljs-string">'v'</span>, <span class="hljs-string">'w'</span>, <span class="hljs-string">'y'</span>, <span class="hljs-string">'z'</span>, <span class="hljs-string">'Ġ'</span>, <span class="hljs-string">'Ġt'</span>, <span class="hljs-string">'is'</span>, <span class="hljs-string">'er'</span>, <span class="hljs-string">'Ġa'</span>, <span class="hljs-string">'Ġto'</span>, <span class="hljs-string">'en'</span>, <span class="hljs-string">'Th'</span>, <span class="hljs-string">'This'</span>, <span class="hljs-string">'ou'</span>, <span class="hljs-string">'se'</span>, | |
| <span class="hljs-string">'Ġtok'</span>, <span class="hljs-string">'Ġtoken'</span>, <span class="hljs-string">'nd'</span>, <span class="hljs-string">'Ġis'</span>, <span class="hljs-string">'Ġth'</span>, <span class="hljs-string">'Ġthe'</span>, <span class="hljs-string">'in'</span>, <span class="hljs-string">'Ġab'</span>, <span class="hljs-string">'Ġtokeni'</span>]<!-- HTML_TAG_END --></pre></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-19s7cjy">💡 Sử dụng <code>train_new_from_iterator()</code> trên cùng kho ngữ liệu sẽ không mang về kết quả kho ngữ liệu y hệt. Đó là bởi khi có sự lựa chọn về cặp có tần suất cao nhất, ta đã chọn cái đầu tiên xuất hiện, trong khi thư viện 🤗 Tokenizers chọn cái đầu tiên dựa trên ID bên trong của nó.</p></div> <p data-svelte-h="svelte-aev4h0">Để tokenize văn bản mới, chúng ta tiền tokenize nó, tách ra, rồi áp dụng quy tắc hợp nhất được học:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">tokenize</span>(<span class="hljs-params">text</span>): | |
| pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text) | |
| pre_tokenized_text = [word <span class="hljs-keyword">for</span> word, offset <span class="hljs-keyword">in</span> pre_tokenize_result] | |
| splits = [[l <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> word] <span class="hljs-keyword">for</span> word <span class="hljs-keyword">in</span> pre_tokenized_text] | |
| <span class="hljs-keyword">for</span> pair, merge <span class="hljs-keyword">in</span> merges.items(): | |
| <span class="hljs-keyword">for</span> idx, split <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(splits): | |
| i = <span class="hljs-number">0</span> | |
| <span class="hljs-keyword">while</span> i < <span class="hljs-built_in">len</span>(split) - <span class="hljs-number">1</span>: | |
| <span class="hljs-keyword">if</span> split[i] == pair[<span class="hljs-number">0</span>] <span class="hljs-keyword">and</span> split[i + <span class="hljs-number">1</span>] == pair[<span class="hljs-number">1</span>]: | |
| split = split[:i] + [merge] + split[i + <span class="hljs-number">2</span> :] | |
| <span class="hljs-keyword">else</span>: | |
| i += <span class="hljs-number">1</span> | |
| splits[idx] = split | |
| <span class="hljs-keyword">return</span> <span class="hljs-built_in">sum</span>(splits, [])<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-yx1cbw">t | |
| Ta có thể thử các này với bất kì đoạn văn nào khác được tạo thành từ các kí tự trong bảng chữ cái:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->tokenize(<span class="hljs-string">"This is not a token."</span>)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'This'</span>, <span class="hljs-string">'Ġis'</span>, <span class="hljs-string">'Ġ'</span>, <span class="hljs-string">'n'</span>, <span class="hljs-string">'o'</span>, <span class="hljs-string">'t'</span>, <span class="hljs-string">'Ġa'</span>, <span class="hljs-string">'Ġtoken'</span>, <span class="hljs-string">'.'</span>]<!-- HTML_TAG_END --></pre></div> <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"><p data-svelte-h="svelte-dajqb">⚠️ Các triển khai của chúng ta sẽ gặp lỗi nếu có những kí tự vô danh vì chúng ta đã không làm gì để xử lý chúng. GPT-2 không thực sự có những token vô danh (không thể có kí tự vô danh khi sử dụng BPE cấp byte), nhưng nó có thể xảy ra ở đây vì ta không bao gồm tất cả các byte có thể có trong bộ từ vựng gốc. Khía cạnh này của BPE nằm ngoài phạm vi phần này, nên chúng tôi sẽ không đi sau vào chi tiết.</p></div> <p data-svelte-h="svelte-1idci4l">Đó là những gì ta cần biết về thuật toán BPE! Tiếp theo, chúng ta sẽ cùng tìm hiểu về WordPiece.</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/course/blob/main/chapters/vi/chapter6/5.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_rdxbtd = { | |
| assets: "/docs/course/pr_1069/vi", | |
| base: "/docs/course/pr_1069/vi", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/course/pr_1069/vi/_app/immutable/entry/start.bcd19957.js"), | |
| import("/docs/course/pr_1069/vi/_app/immutable/entry/app.38d32b86.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 48], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 88.8 kB
- Xet hash:
- 1cddba62b36dc259caa72619220de84169c1a5b03f190fafe42a877a920d1c94
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.