Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / course /pr_1114 /th /chapter6 /3.md

rtrm

about 1 month ago

preview code

download

raw

39.1 kB

	# ความสามารถพิเศษของตัวตัดคำแบบเร็ว (fast tokenizers)

	{#if fw === 'pt'}

	<CourseFloatingBanner chapter={6}
	classNames="absolute z-10 right-0 top-0"
	notebooks={[
	{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/th/chapter6/section3_pt.ipynb"},
	{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/th/chapter6/section3_pt.ipynb"},
	]} />

	{:else}

	<CourseFloatingBanner chapter={6}
	classNames="absolute z-10 right-0 top-0"
	notebooks={[
	{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/th/chapter6/section3_tf.ipynb"},
	{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/th/chapter6/section3_tf.ipynb"},
	]} />

	{/if}

	ในบทนี้ เราจะเรียนเกี่ยวกับความสามารถของ tokenizer ต่างๆ ใน 🤗 Transformers library กัน
	ในบทก่อนๆ คุณได้ลองใช้ tokenizer เพื่อแยกข้อความให้เป็นคำๆ และเพื่อแปลง ID ของคำให้กลับไปเป็นข้อความแล้ว
	จริงๆแล้ว tokenizer นั้นยังมีความสามารถอีกหลายอย่าง โดยเฉพาะ tokenizer จาก 🤗 Tokenizers library

	เพื่อให้คุณเห็นภาพได้อย่างชัดเจน เราจะมาลองคำนวณผลลัพธ์ (reproduce) ของ `token-classification` (ซึ่งเราจะเรียกสั้นๆว่า `ner`) และสร้าง pipeline สำหรับ `question-answering` อย่างที่คุณได้เรียนมาแล้ว[บทที่ 1](/course/chapter1)กัน


	<Youtube id="g8quOxoqhHQ"/>

	เราจะแยก tokenizer เป็นแบบช้า (slow) และแบบเร็ว (fast) ซึ่งแบบช้าหมายถึง tokenizer ที่เขียนด้วย Python และมาจาก 🤗 Transformers library ส่วนแบบเร็วหมายถึง tokenizer ที่เขียนด้วย Rust และมาจาก 🤗 Tokenizers library
	ถ้าคุณยังจำตารางจาก[บทที่ 5](/course/chapter5/3)ได้ ซึ่งเป็นตารางเปรียบเทียบเวลา ที่ tokenizer แบบเร็วและช้า ใช้ในการตัดคำชุดข้อมูล Drug Review คุณก็จะเห็นว่า ทำไมเราจึงเรียกพวกมันว่า แบบช้าและเร็ว

	\| Fast tokenizer \| Slow tokenizer
	:--------------:\|:--------------:\|:-------------:
	`batched=True` \| 10.8s \| 4min41s
	`batched=False` \| 59.2s \| 5min3s

	> [!WARNING]
	> ⚠️ ถ้าคุณเปรียบเทียบ tokenizer ทั้งสองแบบ โดยดูจากความเร็วในการตัดคำของประโยคเดียว คุณอาจจะไม่เห็นความแตกต่างมาก และบางที fast tokenizer อาจจะช้ากว่า slow tokenizer ด้วยซ้ำ คุณจะเห็นความแตกต่างที่แท้จริง ก็เมื่อลองรันกับ input ที่มีขนาดใหญ่ระดับหนึ่ง เพราะการทำแบบนี้ จะทำให้การประมวลผลแบบ parallel ถูกเรียกใช้งาน

	## Batch encoding

	<Youtube id="3umI3tm27Vw"/>

	output ที่ได้จากการตัดคำนั้นไม่ใช่ dictionary แต่เป็น Python object ที่เรียกว่า `BatchEncoding` ซึ่งเป็น subclass ของ dictionary อีกที ทำให้เราสามารถ index ตัว output ได้ `BatchEncoding` ต่างจาก dictionary ทั่วไปตรงที่ มันมี method เพิ่มเติม ที่ส่วนมากจะถูกใช้โดย fast tokenizer

	นอกจาก fast tokenizer จะสามารถประมวลผลแบบ parallel ได้แล้ว ความสามารถหลักของของมันก็คือ มันจะบันทึก span (ตำแหน่งเริ่มและจบ) ของแต่ละ token ไว้ด้วย ข้อมูลเกี่ยวกับ span นี้เราเรียกว่า offset mapping

	offset mapping สามารถช่วยให้เราโยง "คำ" ไปหา token ของมันได้ (ในที่นี้ "คำ" หมายถึง กลุ่มของตัวอักษรที่ถูกแบ่งด้วย space ซึ่งหนึ่งคำอาจจะถูกแบ่งออกเป็นหลาย token ได้) และนอกจากนั้น ก็ยังช่วยให้เราสามารถโยงตัวอักษร ไปหา token ได้ด้วยเช่นกัน

	มาดูตัวอย่างกัน:

	```py
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
	example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
	encoding = tokenizer(example)
	print(type(encoding))
	```

	จะเห็นได้ว่า ผลลัพธ์ที่ได้จากตัดคำ คือ `BatchEncoding` อย่างที่เราได้กล่าวไว้ข้างต้น:

	```python out
	<class 'transformers.tokenization_utils_base.BatchEncoding'>
	```

	เนื่องจาก class `AutoTokenizer` จะเรียกใช้ตัว fast tokenizer เป็นค่าเริ่มต้น (by default) ซึ่งแปลว่า output ที่ได้คือ `BatchEncoding` และเราก็จะสามารถใช้ method พิเศษของมันได้
	การจะเช็คว่า ตัวตัดคำเป็นแบบเร็วหรือช้า ทำได้สองวิธี วิธีแรกคือเช็คโดยการใช้ attribute `is_fast` ของ `tokenizer` :

	```python
	tokenizer.is_fast
	```

	```python out
	True
	```

	อีกวิธีคือเช็ค attribute `is_fast` ของ `encoding`:

	```python
	encoding.is_fast
	```

	```python out
	True
	```

	มาดูกันว่า fast tokenizer ทำอะไรได้บ้าง อย่างแรกคือ เราสามารถเรียกดู token ได้ โดยไม่ต้องแปลงแต่ละ ID กลับไปเป็น token

	```py
	encoding.tokens()
	```

	```python out
	['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in',
	'Brooklyn', '.', '[SEP]']
	```

	ในตัวอย่างนี้ token ในตำแหน่งที่ 5 คือ `##yl` ซึ่งเป็นส่วนหนึ่งของคำว่า "Sylvain"

	นอกจากนั้น คุณยังสามารถใช้ method `word_ids()` เพื่อเรียกดูตำแหน่งของคำได้ด้วย

	```py
	encoding.word_ids()
	```

	```python out
	[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]
	```

	คุณจะเห็นว่า token พิเศษอย่าง `[CLS]` และ `[SEP]` จะถูกจับคู่กับค่า `None` ส่วน token อื่นๆก็จะถูกจับคู่กับคำต้นตอของมัน
	ข้อมูลนี้ มีประโยชน์ในการเอาไว้เช็คว่า token ที่เราสนใจนั้น อยู่ในตำแหน่งที่เป็นจุดเริ่มต้นของคำหรือไม่ และเอาไว้เช็คว่า token สองตัว มาจากคำเดียวกันหรือไม่
	คุณสามารถเช็คข้อมูลพวกนี้ได้จากการดูที่สัญลักษณ์ `##` ที่อยู่ข้างหน้า token (ซึ่งแปลว่า token ตัวนี้ ไม่ได้อยู่ตรงจุดเริ่มต้นคำ) แต่วิธีนี้ ใช้ได้แค่กับตัวตัดคำที่มีโครงสร้างแบบโมเดล BERT เท่านั้น อย่างไรก็ตาม วิธีนี้ใช้ได้กับตัวตัดคำทุกประเภทที่เป็นแบบเร็ว

	ในบทหน้า เราจะมาดูกันว่า เราจะใช้ feature นี้เพื่อจับคู่ label กับ token ใน task เช่น entity recognition (NER) and part-of-speech (POS) tagging ได้อย่างไร
	นอกจากนั้น คุณยังสามารถใช้ feature นี้ เพื่อทำการปกปิด(mask) token ทุกตัวที่มาจากคำเดียวกัน เวลาใช้ masked language modeling ได้อีกด้วย (การทำแบบนี้เราเรียกว่า _whole word masking_)

	> [!TIP]
	> นิยามของ "คำ" นั้นค่อนข้างยากที่จะกำหนด ตัวอย่างเช่น "I'll" (เป็นการเขียนแบบสั้นของ "I will" ) ควรนับเป็นหนึ่งหรือสองคำ ?
	> คำตอบของคำถามนี้นั้น ขึ้นกับว่า คุณใช้ตัวตัดคำแบบไหน และมีการปรับแต่งข้อความ input ก่อนที่จะทำการตัดคำหรือไม่
	> ตัวตัดคำบางตัว อาจจะแยกคำด้วย space บางตัวอาจจะแยกคำด้วยเครื่องหมายวรรคตอน (punctuation) ก่อนแล้วจึงแบ่งด้วย space ในกรณีหลังนี้ "I'll" ก็จะถูกแบ่งเป็นสองคำ
	>
	> ✏️ ลองทำดู! ให้คุณลองสร้างตัวตัดคำจาก checkpoint ของ `bert-base-cased` and `roberta-base` แล้วให้ลองตัดคำว่า "81s" คุณสังเกตเห็นอะไรบ้าง และ ID ของคำที่ได้คืออะไร

	นอกจากนั้นยังมี method คล้ายๆกัน ที่ชื่อ `sentence_ids()` ที่เอาไว้ใช้เพื่อโยง token ไปหาประโยคต้นตอ ในตัวอย่างของเรา คุณสามารถใช้ `token_type_ids` ซึ่งเป็นผลลัพธ์จากการรัน tokenizer แทน `sentence_ids()` ได้ เพราะทั้งสองให้ข้อมูลเดียวกัน

	ความสามารถสุดท้าย คือโยง token ไปหาแต่ละตัวอักษร หรือกลับกัน โดยการใช้ method `word_to_chars()` หรือ `token_to_chars()` และ `char_to_word()` หรือ `char_to_token()`

	ตัวอย่างเช่น method `word_ids()` สามารถบอกให้คุณรู้ว่า `##yl` เป็นส่วนหนึ่งของคำในตำแหน่งที่ 3 แต่ถ้าคุณอยากรู้ว่า เป็นคำไหนในประโยค คุณสามารถเช็คได้ดังนี้ :

	```py
	start, end = encoding.word_to_chars(3)
	example[start:end]
	```

	```python out
	Sylvain
	```

	อย่างที่เราได้บอกข้างต้นแล้ว fast tokenizer สามารถทำแบบนี้ได้ เพราะมันเก็บข้อมูลเกี่ยวกับ span ของแต่ละ token เอาไว้ และบันทึกไว้ใน list ของ offsets
	เพื่อที่จะอธิบายการใช้งานของ feature นี้ เรามาลองคำนวณผลลัพธ์ของ pipeline `token-classification` กัน

	> [!TIP]
	> ✏️ ลองทำดู! ให้คุณลองคิดข้อความตัวอย่างขึ้นมา แล้วถามตัวเองว่า token ตัวไหนคู่กับ ID ของคำไหน และ คุณจะหา span ของแต่ละคำได้อย่างไร นอกจากนั้น ให้คุณลองสร้างสองประโยคเพื่อเป็น input ให้กับตัวตัดคำของคุณ แล้วดูว่า ID ของประโยคนั้นเหมาะสมหรือไม่

	## โครงสร้างภายในของ pipeline `token-classification`

	ใน[บทที่ 1](/course/chapter1) คุณได้เรียนเกี่ยวกับการสร้างระบบ NER ซึ่งเป้าหมายหลักของระบบนี้ คือการหาส่วนของข้อความที่เป็น entitiy เช่น ชื่อคน ชื่อสถานที่ หรือชื่อองค์กร โดยการใช้ฟังก์ชัน `pipeline()` จาก 🤗 Transformers
	ส่วนใน[บทที่ 2](/course/chapter2) คุณได้เรียนรู้ว่า pipeline ประกอบไปด้วย 3 ขั้นตอนสำคัญ เริ่มจาก การตัดคำ จากนั้นผลลัพธ์ก็จะถูกส่งไปให้โมเดล และสุดท้ายคือ การการปรับแต่งผลลัพธ์ (post-processing)
	สองขั้นตอนแรกใน pipeline `token-classification` นั้น จะเหมือนกันกับ pipeline อื่นๆ แต่ขั้นตอน post-processing จะค่อนข้างซับซ้อนกว่า เราจะมาดูรายละเอียดกัน


	{#if fw === 'pt'}

	<Youtube id="0E7ltQB7fM8"/>

	{:else}

	<Youtube id="PrX4CjrVnNc"/>

	{/if}

	### การคำนวณผลลัพธ์เบื้องต้นด้วยการใช้ pipeline

	อันดับแรก เราจะใช้ token classification pipeline เพื่อเปรียบเทียบกับ pipeline
	ของเรา โมเดลที่ถูกตั้งเป็นค่าเบื้องต้นคือ [`dbmdz/bert-large-cased-finetuned-conll03-english`](https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english) ซึ่งมันจะคำนวณ NER ของแต่ละ ข้อความ input:

	```py
	from transformers import pipeline

	token_classifier = pipeline("token-classification")
	token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")
	```

	```python out
	[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12},
	{'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14},
	{'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16},
	{'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18},
	{'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35},
	{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40},
	{'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45},
	{'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]
	```

	คุณจะเห็นว่า โมเดลนี้สามารถบอกได้ว่า token ที่มาจาก คำว่า "Sylvain" นั้นเป็นชื่อคน และ token ที่มาจากคำว่า "Hugging Face" นั้นเป็นชื่อองค์กร และ "Brooklyn" เป็นชื่อสถานที่
	เราสามารถใช้ pipeline นี้เพื่อรวมรวม token ที่มี entity ประเภทเดียวกันได้ด้วย

	```py
	from transformers import pipeline

	token_classifier = pipeline("token-classification", aggregation_strategy="simple")
	token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")
	```

	```python out
	[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
	{'entity_group': 'ORG', 'score': 0.97960204, 'word': 'Hugging Face', 'start': 33, 'end': 45},
	{'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]
	```

	`aggregation_strategy` ที่เราเลือก จะเปลี่ยนคะแนนของแต่ละกลุ่ม entity ด้วย
	ถ้าเราตั้งค่าให้เป็นแบบ `"simple"` มันจะคำนวณคะแนน โดยการเฉลี่ยคะแนนของแต่ละ token ที่นับเป็น entity เดียวกัน ตัวอย่างเช่น คะแนนของคำว่า "Sylvain" คือค่าเฉลี่ยของคะแนนจาก token ย่อยซึ่งก็คือ `S`, `##yl`, `##va`, and `##in`

	วิธีคำนวณคะแนนรวมแบบอื่น :

	- `"first"` จะใช้คะแนนของ token แรกเท่านั้น เป็นคะแนนรวม (เช่น คะแนนรวมของคำว่า "Sylvain" ก็จะเป็น 0.993828 ซึ่งมาจากคะแนนของ `S`)
	- `"max"` จะใช้คะแนนของ token ที่มีคะแนนมากที่สุด (เช่น คะแนนรวมของคำว่า "Hugging Face" ก็จะเป็น 0.98879766 ซึ่งมาจากคะแนนของ "Face")
	- `"average"` จะใช้ค่าเฉลี่ยของแต่ละ token ที่เป็นส่วนของ entity นั้น เป็นคะแนนรวมของ entity (สำหรับคำว่า "Sylvain" คะแนนรวมแบบเฉลี่ยจะไม่ต่างจากคะแนนรวมแบบ `"simple"` แต่คำว่า "Hugging Face" จำได้คะแนน 0.9819 ซึ่งเป็นค่าเฉลี่ย ของ "Hugging" 0.975 และ "Face" 0.98879)

	มาดูกันว่า คุณจะสร้างผลลัพธ์แบบนี้ได้อย่างไร โดยไม่ใช้ฟังก์ชัน `pipeline()`!

	### จาก input สู่ ผลลัพธ์

	{#if fw === 'pt'}

	อันดับแรก เราจะเอาข้อความ input มาตัดคำก่อน แล้วส่งต่อผลลัพธ์ที่ได้ไปให้กับโมเดล ขั้นตอนนี้จะเหมือนที่เราทำกันใน[บทที่ 2](/course/chapter2) โดยเริ่มจากสร้างตัวตัดคำและโมเดลขึ้นมา โดยใช้คลาส `AutoXxx`
	```py
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
	tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
	model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

	example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
	inputs = tokenizer(example, return_tensors="pt")
	outputs = model(**inputs)
	```

	เนื่องจากเราใช้ `AutoModelForTokenClassification` หลังจากรัน เราจะได้ค่า logit (set of logits) สำหรับแต่ละ token ที่อยู่ในข้อความ input

	```py
	print(inputs["input_ids"].shape)
	print(outputs.logits.shape)
	```

	```python out
	torch.Size([1, 19])
	torch.Size([1, 19, 9])
	```

	{:else}

	อันดับแรก เราจะเอาข้อความ input มาตัดคำก่อน แล้วส่งต่อผลลัพธ์ที่ได้ไปให้กับโมเดล ขั้นตอนนี้จะเหมือนที่เราทำกันใน[บทที่ 2](/course/chapter2) โดยเริ่มจากสร้างตัวตัดคำและโมเดลขึ้นมา โดยใช้คลาส `TFAutoXxx`

	```py
	from transformers import AutoTokenizer, TFAutoModelForTokenClassification

	model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
	tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
	model = TFAutoModelForTokenClassification.from_pretrained(model_checkpoint)

	example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
	inputs = tokenizer(example, return_tensors="tf")
	outputs = model(**inputs)
	```

	เนื่องจากเราใช้ `TFAutoModelForTokenClassification` หลังจากรัน เราจะได้ค่า logit (set of logits) สำหรับแต่ละ token ที่อยู่ในข้อความ input

	```py
	print(inputs["input_ids"].shape)
	print(outputs.logits.shape)
	```

	```python out
	(1, 19)
	(1, 19, 9)
	```

	{/if}

	แต่ละ batch ประกอบไปด้วย 1 ข้อความ ซึ่งมี token 19 ตัว และโมเดลที่เราใช้ สามารถทำนายได้ 9 หมวด (label) ดังนั้นขนาดของ output ที่ได้คือ 1 x 19 x 9
	เช่นเดียวกับตอนที่เราใช้ text classification pipeline คือเราจะใช้ฟังก์ชัน softmax เพื่อที่จะแปลงค่า logits ไปเป็นค่าความเป็นไปได้ (probabilities) จากนั้นเราจะคำนวณค่า argmax เพื่อคำนวณคำทำนายสุดท้าย (เราใช้ argmax ของค่า logits ตรงนี้ได้ ก็เพราะการคำนวณ softmax จากคะแนนของแต่ละหมวด ไม่ได้ทำให้ลำดับของหมวดเปลี่ยน)

	{#if fw === 'pt'}

	```py
	import torch

	probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
	predictions = outputs.logits.argmax(dim=-1)[0].tolist()
	print(predictions)
	```

	{:else}

	```py
	import tensorflow as tf

	probabilities = tf.math.softmax(outputs.logits, axis=-1)[0]
	probabilities = probabilities.numpy().tolist()
	predictions = tf.math.argmax(outputs.logits, axis=-1)[0]
	predictions = predictions.numpy().tolist()
	print(predictions)
	```

	{/if}

	```python out
	[0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]
	```

	attribute `model.config.id2label` คือตัวที่เก็บข้อมูลเกี่ยวกับ mapping ของ index ไปหา label ที่เราเอาไว้ใช้เพื่อดูว่า ผลลัพธ์นั้นถูกต้องหรือไม่

	```py
	model.config.id2label
	```

	```python out
	{0: 'O',
	1: 'B-MISC',
	2: 'I-MISC',
	3: 'B-PER',
	4: 'I-PER',
	5: 'B-ORG',
	6: 'I-ORG',
	7: 'B-LOC',
	8: 'I-LOC'}
	```

	จาก output ข้างบน คุณจะเห็นว่า เรามีคำทำนายทั้งหมด 9 หมวด (label) โดยที่ label `O` หมายถึง token ที่ไม่ได้เป็น named entity (`O` มาจากคำว่า "outside") และ แต่ละประเภทของ entity จะมีอย่างละสอง label (เรามีทั้งหมด 4 entity: miscellaneous, person, organization, and location)

	ถ้า token ถูก label ด้วย `B-XXX` นั่นแปลว่า token นี้อยู่ข้างหน้าของ entity ประเภท `XXX` ส่วน label `I-XXX` หมายถึง token นี้อยู่ข้างใน entity ประเภท `XXX`
	ถ้าดูจากข้อความตัวอย่างที่เราใช้ข้างบน โมเดลของเราก็ควรจะจับคู่ token `S` กับหมวด `B-PER` (ซึ่งแปลว่า `S` เป็นส่วนข้างหน้าของ entity ประเภทชื่อคน) ส่วน token `##yl`, `##va` และ `##in` ก็ควรจะถูกทำนายให้เป็นหมวด `I-PER` (ซึ่งหมายถึง token ที่อยู่ข้างใน entity ประเภทชื่อคน)

	คุณอาจจะคิดว่า โมเดลของเราทำนายผิดในตัวอย่างนี้ เพราะว่ามันทำนายทั้งสี่ token ให้เป็น `I-PER` แต่จริงๆแล้ว ทำนายแบบนี้ก็ไม่ได้ผิดไปซะทั้งหมด
	เพราะว่า การทำนายโดยใช้ label `B-` และ `I-` ในงาน NER มีสองแบบ คือ IOB1 and IOB2

	IOB1 (สีส้ม) เป็นแบบที่เราใช้ในตัวอย่างข้างบน ส่วน IOB2 (สีม่วง) แตกต่างตรงที่ `B-` เอาไว้ใช้แบ่ง entity สองตัวที่เป็นประเภทเดียว ให้ออกจากกัน
	โมเดลที่เราใช้นั้นถูก fine-tune จากชุดข้อมูลที่มี label แบบ IOB2 ทำให้มันทำนาย `S` เป็น `I-PER`


	<div class="flex justify-center">
	<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/IOB_versions.svg" alt="IOB1 vs IOB2 format"/>
	<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter6/IOB_versions-dark.svg" alt="IOB1 vs IOB2 format"/>
	</div>

	ตอนนี้เราก็พร้อมที่จะคำนวณผลลัพธ์ ให้ได้แบบเดียวกับ pipeline แรกแล้ว โดยที่เราจะใช้คะแนนและ label ของแต่ละ token ที่ไม่ใช่ `O`เท่านั้น

	```py
	results = []
	tokens = inputs.tokens()

	for idx, pred in enumerate(predictions):
	label = model.config.id2label[pred]
	if label != "O":
	results.append(
	{"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
	)

	print(results)
	```

	```python out
	[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S'},
	{'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl'},
	{'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va'},
	{'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in'},
	{'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu'},
	{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging'},
	{'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face'},
	{'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn'}]
	```

	จะเห็นว่าตอนนี้ เราได้ผลลัพธ์ที่คล้ายกับผลลัพธ์จาก pipeline ก่อนหน้านี้แล้ว ข้อแตกต่างเดียวก็คือ ผลลัพธ์จาก pipeline จะให้ข้อมูลเกี่ยวกับ ตำแหน่งเริ่มและจบในข้อความของแต่ละ entity ด้วย
	ขั้นตอนต่อไป เราจะได้เรียกใช้ค่า offset mapping เพื่อตั้งค่าให้โมเดลคำนวณค่า offset เราจะเช็ต `return_offsets_mapping=True` ในตอนที่เราใช้รันตัวตัดคำ


	```py
	inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
	inputs_with_offsets["offset_mapping"]
	```

	```python out
	[(0, 0), (0, 2), (3, 7), (8, 10), (11, 12), (12, 14), (14, 16), (16, 18), (19, 22), (23, 24), (25, 29), (30, 32),
	(33, 35), (35, 40), (41, 45), (46, 48), (49, 57), (57, 58), (0, 0)]
	```

	แต่ละ tuple ในนี้ จะบันทึกค่า span ของแต่ละ token ไว้ โดยที่พวก token พิเศษ ก็จะมีค่า span เป็น `(0, 0)`
	ก่อนหน้านี้ เราได้เห็นว่า token ในตำแหน่งที่ 5 คือ `##yl` มีค่า offset เป็น `(12, 14)`
	ถ้าคุณลอง slice ข้อความ input ด้วย index สองค่านี้

	```py
	example[12:14]
	```

	คุณจะได้ส่วนของข้อความที่เป็น `yl` โดยที่ `##` จะถูกละออกจากผลลัพธ์:

	```python out
	yl
	```

	เราจะใช้วิธีนี้ เพื่อคำนวณผลลัพธ์ให้ได้ผลลัพธ์เหมือนใน pipeline:

	```py
	results = []
	inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
	tokens = inputs_with_offsets.tokens()
	offsets = inputs_with_offsets["offset_mapping"]

	for idx, pred in enumerate(predictions):
	label = model.config.id2label[pred]
	if label != "O":
	start, end = offsets[idx]
	results.append(
	{
	"entity": label,
	"score": probabilities[idx][pred],
	"word": tokens[idx],
	"start": start,
	"end": end,
	}
	)

	print(results)
	```

	```python out
	[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12},
	{'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14},
	{'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16},
	{'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18},
	{'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35},
	{'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40},
	{'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45},
	{'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]
	```

	ตอนนี้ เราก็ได้ผลลัพธ์แบบเดียวกับผลลัพธ์จาก pipeline แล้ว!

	### การรวม entities

	ข้อมูลเกี่ยวกับ ตำแหน่งเริ่มและตำแหน่งจบของแต่ละ entity ที่เราได้จาก offset อาจจะไม่เป็นประโยชน์มากนัก
	แต่ถ้าคุณต้องการจะรวม token ที่อยู่ในกลุ่ม entity เดียวกันเข้าด้วยกัน ข้อมูลจาก offset พวกนี้ จะมีโยชน์มาก และทำให้คุณไม่ต้องเขียนโค้ดเพิ่มเติมด้วย

	ตัวอย่างเช่น ถ้าคุณต้องการจะรวม `Hu`, `##gging`, และ `Face` เข้าด้วยกัน คุณอาจจะเขียนกฎขึ้นมาว่า ให้รวม token แรกกับ token ที่สองเข้าด้วยกัน โดยลบ `##` ออก
	แล้วให้รวม token ที่สามโดยใช้ช่องว่างในการเชื่อม เพราะ `Face` ไม่ได้เริ่มต้นด้วย `##` อย่างไรก็ตามวิธีนี้ ใช้ได้แค่กับตัวตัดคำบางประเภทเท่านั้น
	สำหรับตัวตัดคำอื่นๆเช่น แบบ SentencePiece หรือ Byte-Pair-Encoding เราก็จะต้องสร้างกฎขึ้นมาใหม่

	การที่เราใช้ค่า offset ทำให้เราไม่จำเป็นต้องเขียนโค้ดเกี่ยวกับกฎพวกนี้ขึ้นมาเอง คุณสามารถใช้ตำแหน่งเริ่มของ token แรก และ ตำแหน่งจบของ token สุดท้าย เป็นค่าในการ slice ข้อความ input เพื่อที่จะคำนวณส่วนของข้อความของ entity ที่คุณสนใจ
	ตัวอย่างเช่น ถ้าเรามี 3 token `Hu`, `##gging`, และ `Face` ซึ่งอยู่ในกลุ่ม entity เดียวกัน เราก็จะใช้ค่า 33 (ตำแหน่งเริ่มของ `Hu`) เป็นจุดเริ่มต้น และ 45 (ตำแหน่งจบของ `Hu`) เป็นจุดจบของ entity นี้


	```py
	example[33:45]
	```

	```python out
	Hugging Face
	```

	ขั้นตอนต่อไป เราจะมาเขียนโค้ดเพื่อปรับแต่งผลลัพธ์จากโมเดลกัน (post-processing) โดยจะทำไปพร้อมๆกับการรวมกลุ่ม entity ด้วยวิธีต่อไปนี้

	เริ่มจาก token แรกของ entity ซึ่งเป็นได้ทั้ง `B-XXX` และ `I-XXX` จากนั้นให้รวม token ตัวถัดๆไป ที่เป็น `I-XXX` ทั้งหมดเข้าด้วยกัน จนกว่าจะเห็น `O` (แปลว่าเรากำลังอ่านถึง token ที่ไม่ใช่ entity ใดๆ ให้เราหยุดการรวม) หรือ `B-XXX` (ให้เราเริ่มต้นรวม entity ตัวใหม่)

	```py
	import numpy as np

	results = []
	inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
	tokens = inputs_with_offsets.tokens()
	offsets = inputs_with_offsets["offset_mapping"]

	idx = 0
	while idx < len(predictions):
	pred = predictions[idx]
	label = model.config.id2label[pred]
	if label != "O":
	# Remove the B- or I-
	label = label[2:]
	start, _ = offsets[idx]

	# Grab all the tokens labeled with I-label
	all_scores = []
	while (
	idx < len(predictions)
	and model.config.id2label[predictions[idx]] == f"I-{label}"
	):
	all_scores.append(probabilities[idx][pred])
	_, end = offsets[idx]
	idx += 1

	# The score is the mean of all the scores of the tokens in that grouped entity
	score = np.mean(all_scores).item()
	word = example[start:end]
	results.append(
	{
	"entity_group": label,
	"score": score,
	"word": word,
	"start": start,
	"end": end,
	}
	)
	idx += 1

	print(results)
	```

	ตอนนี้ เราก็ได้ผลลัพธ์แบบเดียวกันกับ pipeline แล้ว!

	```python out
	[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
	{'entity_group': 'ORG', 'score': 0.97960204, 'word': 'Hugging Face', 'start': 33, 'end': 45},
	{'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]
	```

	อีกตัวอย่างหนึ่งของงานที่ค่า offset เป็นโยชน์มาก ก็คือ question answering
	ในบทถัดไป คุณจะได้เรียนเกี่ยวกับ pipeline แบบเจาะลึกมากขึ้น ซึ่งคุณจะได้รู้จักกับ feature สุดท้ายของ tokenizers ใน 🤗 Transformers library ซึ่งก็คือ การจัดการกับ tokens ที่ถูกตัดออกจากข้อความ input

	<EditOnGithub source="https://github.com/huggingface/course/blob/main/chapters/th/chapter6/3.mdx" />

Xet Storage Details

Size:: 39.1 kB
Xet hash:: 9dc4b7151b17a96f9f0bc633dd7bcdd71d3758541ba3e1cc5f45c7c06b9f8957

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.