Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / course /pr_1114 /th /chapter2 /6.md

rtrm

about 1 month ago

preview code

download

raw

10.8 kB

	# ประกอบทุกอย่างเข้าด้วยกัน

	{#if fw === 'pt'}

	<CourseFloatingBanner chapter={2}
	classNames="absolute z-10 right-0 top-0"
	notebooks={[
	{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/th/chapter2/section6_pt.ipynb"},
	{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/th/chapter2/section6_pt.ipynb"},
	]} />

	{:else}

	<CourseFloatingBanner chapter={2}
	classNames="absolute z-10 right-0 top-0"
	notebooks={[
	{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/th/chapter2/section6_tf.ipynb"},
	{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/th/chapter2/section6_tf.ipynb"},
	]} />

	{/if}

	ในสองสาม sections ที่ผ่านมา เราได้พยายามทำทุกอย่างด้วยมือของเราเอง เราได้ลองศึกษาว่า tokenizer นั้นทำงานอย่างไรและวิธีการ tokenization, แปลงข้อมูลไปเป็น input IDs, การเติม(padding), การตัด(truncation), และ attention masks

	อย่างไรก็ตาม เหมือนที่เราเห็นใน section 2, 🤗 Transformers API นั้นสามารถจัดการกับสิ่งต่างๆเหล่านั้นให้เราได้ด้วย high-level ฟังก์ชันที่เราจะลงลึงในรายละเอียดกันในที่นี่ เมื่อคุณเรียกใช้งาน `tokenizer` ของคุณตรงๆกับประโยคหนึ่งๆ, คุณได้อินพุตที่พร้อมจะใส่เข้าไปยังโมเดลกลับมา:

	```py
	from transformers import AutoTokenizer

	checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)

	sequence = "I've been waiting for a HuggingFace course my whole life."

	model_inputs = tokenizer(sequence)
	```

	ในที่นี้ ตัวแปร `model_inputs` นั้นประกอบด้วยทุกอย่างที่จำเป็นสำหรับโมเดลที่จะทำงานได้เป็นอย่างดี สำหรับ DistilBERT นั้นรวมไปถึง input IDs และ attention mask ด้วย ส่วนโมเดลอื่นๆที่รองรับอินพุตต่างๆเพิ่มเติมก็จะได้ผลลัพท์เหล่านั้นจาก `tokenizer` object ด้วย

	อย่างที่เราจะได้เห็นในบางตัวอย่างด้านล่างนี้ วิธีนี้เป็นวิธีที่ทรงพลังมาก อันดับแรก มันสามารถที่จะ tokenize ประโยคเพียงประโยคเดียวได้:

	```py
	sequence = "I've been waiting for a HuggingFace course my whole life."

	model_inputs = tokenizer(sequence)
	```

	มันยังสามารถจัดการกับประโยคหลายๆประโยคได้ในคราวเดียวกัน โดยที่ไม่มีอะไรเปลี่ยนใน API เลย:

	```py
	sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

	model_inputs = tokenizer(sequences)
	```

	มันสามารถที่จะเติม(padding) ให้สอดคล้องกับหลายๆวัตถุประสงค์:

	```py
	# จะเติมประโยคไปจนถึงความยาวที่ยาวที่สุดของประโยค
	model_inputs = tokenizer(sequences, padding="longest")

	# จะเติมประโยคไปจนถึงความยาวที่ยาวที่สุดที่โมเดลรับได้
	# (512 for BERT or DistilBERT)
	model_inputs = tokenizer(sequences, padding="max_length")

	# จะเติมประโยคไปจนถึงความยาวที่ยาวที่สุดที่ระบุไว้
	model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
	```

	มันสามารถตัดประโยคได้อีกด้วย:

	```py
	sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

	# จะตัดประโยคที่มีความยาวเกินกว่าความยาวที่โมเดลรับได้
	# (512 for BERT or DistilBERT)
	model_inputs = tokenizer(sequences, truncation=True)

	# จะตัดประโยคที่มีความยาวเกินกว่าความยาวที่ระบุไว้
	model_inputs = tokenizer(sequences, max_length=8, truncation=True)
	```

	`tokenizer` object สามารถที่จะจัดการกับการแปลงข้อมูลไปเป็น tensors สำหรับ framework ที่เฉพาะเจาะจงได้ ซึ่งสามารถที่จะส่งเข้าโมเดลได้ทันที ยกตัวอย่างเช่น ในโค้ดตัวอย่างต่อไปนี้ เราจะสั่งให้ tokenizer ส่ง tensors จาก frameworks ต่างๆ กัน — `"pt"` ให้ PyTorch tensors, `"tf"` ให้ TensorFlow tensors, and `"np"` ให้ NumPy arrays:

	```py
	sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

	# Returns PyTorch tensors
	model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

	# Returns TensorFlow tensors
	model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

	# Returns NumPy arrays
	model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
	```

	## tokens พิเศษ

	ถ้าเราดูที่ input IDs ที่ได้จาก tokenizer เราจะเห็นได้ว่ามันค่อนข้างแตกต่างไปจากสิ่งที่เราเคยได้ก่อนหน้านี้:

	```py
	sequence = "I've been waiting for a HuggingFace course my whole life."

	model_inputs = tokenizer(sequence)
	print(model_inputs["input_ids"])

	tokens = tokenizer.tokenize(sequence)
	ids = tokenizer.convert_tokens_to_ids(tokens)
	print(ids)
	```

	```python out
	[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
	[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
	```

	มีหนึ่ง token ID ได้ถูกใส่เข้ามาด้านหน้าสุด และอีกหนึ่ง token ID ใส่ด้านหลังสุด มาถอดรหัสสองประโยคของ IDs ด้านบนดูว่ามันเกี่ยกับอะไร:

	```py
	print(tokenizer.decode(model_inputs["input_ids"]))
	print(tokenizer.decode(ids))
	```

	```python out
	"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
	"i've been waiting for a huggingface course my whole life."
	```

	tokenizer ทำการเพิ่มคำพิเศษ `[CLS]` ที่ด้านหน้าสุด และคำพิเศษ `[SEP]` ที่ด้านหลังสุด นั้นก็เพราะว่าโมเดลนั้นได้ผ่านการเทรนมาแบบนั้น ดังนั้นเพื่อให้ได้ผลลัพท์เดียวกันสำหรับการอนุมาน(inference) เราจำเป็นต้องเพิ่มมันเข้าไปเช่นเดียวกัน แต่ก็ต้องตระหนักว่าบางโมเดลนั้นไม่ได้เพิ่มคำพิเศษ หรือ ใส่คำที่ต่างออกไป; โมเดลอาจจะเพิ่มคำพิเศษเหล่านี้แค่เฉพาะด้านหน้าสุด หรือ ด้านหลังสุดเท่านั้น ไม่ว่าจะในกรณีใดๆ tokenizer รู้ว่าอันไหนเป็นอันที่ต้องการและมันจะจัดการให้คุณเอง:

	## สรุป: จาก tokenizer ไปยังโมเดล


	ถึงตรงนี้เราได้เห็นขั้นตอนแต่ละอย่างทั้งหมดที่ `tokenizer` ใช้เพื่อประมวลผลข้อความ เรามาดูกันครั้งสุดท้ายว่ามันสามารถจัดการประโยคหลายๆประโยค (padding!), ประโยคยาวๆ, และ tensors หลายๆ ประเภทได้อย่างไรด้วย API หลักของมัน:

	{#if fw === 'pt'}
	```py
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
	sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

	tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
	output = model(**tokens)
	```
	{:else}
	```py
	import tensorflow as tf
	from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

	checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
	sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

	tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
	output = model(**tokens)
	```
	{/if}


	<EditOnGithub source="https://github.com/huggingface/course/blob/main/chapters/th/chapter2/6.mdx" />

Xet Storage Details

Size:: 10.8 kB
Xet hash:: 242765f623309940a7e6fe37a1c29fb03fe0b40f80a11eb0cedeea865c479cd0

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.