Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / course /pr_1114 /th /chapter2 /5.md

rtrm

about 1 month ago

preview code

download

raw

19.2 kB

การจัดการกับหลายๆประโยค(multiple sequences)

{#if fw === 'pt'}

{:else}

{/if}

{#if fw === 'pt'} {:else} {/if}

ใน section ที่แล้ว เราได้ลองทำตัวอย่างการใช้งานที่ง่ายที่สุด: คือการอนุมาน(inference)โดยใช้ประโยคสั้นๆ เพียงประโยคเดียว อย่างไรก็ตาม ก็มีบางคำถามเกิดขึ้นมา:

เราจะจัดการกับประโยคหลายๆประโยคอย่างไร?
เราจะจัดการกับประโยคหลายๆประโยคที่มี ความยาวต่างกัน อย่างไร?
ดัชนีคำศัพท์(vocabulary indices) เป็นเพียงข้อมูลประเภทเดียวที่ทำให้โมเดลทำงานได้ดีหรือไม่?
มีประโยคที่เป็นประโยคที่ยาวเกินไปหรือไม่?

มาดูกันว่าปัญหาประเภทใดบ้างที่เกิดขึ้นจากคำถามเหล่านี้ และเราจะแก้ไขมันโดยใช้ 🤗 Transformers API ได้อย่างไร

โมเดลคาดหวังที่จะได้ชุด(batch)ของข้อมูลเป็นอินพุต

ในแบบฝึกหัดที่ผ่านมาคุณได้เห็นแล้วว่าประโยคถูกแปลงไปเป็นลิสท์ของตัวเลขอย่างไร ทีนี้เรามาลองแปลงลิสท์ของตัวเลขนี้ไปเป็น tensor และใส่มันเข้าไปในโมเดล:

{#if fw === 'pt'}

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

{:else}

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tf.constant(ids)
# This line will fail.
model(input_ids)

InvalidArgumentError: Input to reshape is a tensor with 14 values, but the requested shape has 196 [Op:Reshape]

{/if}

โอ้วไม่นะ! ทำไมมันมีข้อผิดพลาดละ? ในเมื่อเราก็ทำตามขั้นตอนต่างๆ จาก pipeline ใน section 2

ปัญหาก็คือเราใส่ประโยคเพียงประโยคเดียวเข้าไปในโมเดล แต่ในขณะที่โมเดล 🤗 Transformers นั้นต้องการประโยคหลายๆประโยค ในที่นี้เราได้ลองทำทุกอย่างที่ tokenizer ทำอยู่เบื้องหลังเมื่อเราใช้มันกับ ประโยค(sequence) แต่ถ้าคุณลองสังเกตุดีๆ คุณจะเห็นว่ามันไม่ได้เพียงแค่แปลง input IDs ไปเป็น tensor เท่านั้น แต่มันยังเพิ่มมิติ(dimension) ของ tensor อีกด้วย:

{#if fw === 'pt'}

tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])

{:else}

tokenized_inputs = tokenizer(sequence, return_tensors="tf")
print(tokenized_inputs["input_ids"])

<tf.Tensor: shape=(1, 16), dtype=int32, numpy=
array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102]], dtype=int32)>

{/if}

ลองอีกครั้งและเพิ่มมิติ(dimension)ใหม่ด้วย:

{#if fw === 'pt'}

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

{:else}

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = tf.constant([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

{/if}

เราแสดง input IDs พร้อมทั้งผล logits และนี่คือผลลัพธ์:

{#if fw === 'pt'}

Input IDs: [[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607, 2026,  2878,  2166,  1012]]
Logits: [[-2.7276,  2.8789]]

{:else}

Input IDs: tf.Tensor(
[[ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
   2166  1012]], shape=(1, 14), dtype=int32)
Logits: tf.Tensor([[-2.7276208  2.8789377]], shape=(1, 2), dtype=float32)

{/if}

Batching คือ การส่งประโยคหลายๆประโยคเข้าไปยังโมเดลพร้อมๆกันทีเดียว ถ้าคุณมีแค่หนึ่งประโยค คุณก็สามารถสร้างชุด(batch)ของประโยค ที่มีเพียงแค่หนึ่งประโยคได้:

batched_ids = [ids, ids]

นี่คือชุด(batch)ของข้อมูลที่ประด้วยสองประโยคที่เหมือนกัน!

✏️ Try it out! แปลงลิสท์ของ batched_ids นี้ไปเป็น tensor และใส่เข้าไปในโมเดลของคุณ แล้วลองตรวจสอบดูว่าคุณได้ logits เหมือนกับที่ได้ก่อนหน้านี้ไหม(แต่ได้สองค่า)!

ฺBatching นั้นทำให้โมเดลสามารถทำงานได้เมื่อคุณใส่ประโยคหลายๆประโยคเข้าไป การใช้ประโยคหลายๆประโยคนั้นก็สามารถทำได้ง่ายเหมือนกับที่ทำกับประโยคเดียว แต่ก็ยังมีปัญหาที่สอง เมื่อคุณพยายามรวมประโยคตั้งแต่สองประโยคขึ้นไปเป็นชุดข้อมูลเดียวกัน แต่ประโยคเหล่านั้นอาจจะมีความยาวที่แตกต่างกัน ถ้าคุณเคยทำ tensors มาก่อนหน้านี้ คุณจะรู้ว่ามันจำเป็นต้องมีขนาดจตุรัส(rectangular) ดังนั้นคุณจะไม่สามารถแปลงลิสท์ของ input IDs ไปเป็น tensor ได้โดยตรง เราสามารถแก้ปัญหานี้ได้ด้วยการ pad อินพุต

การเติม(Padding) อินพุต

ลิสท์ของลิสท์ต่อไปนี้ไม่สามารถถูกแปลงไปเป็น tensor ได้:

batched_ids = [
    [200, 200, 200],
    [200, 200]
]

ในการแก้ปัญหานี้ เราจะใช้ padding เพื่อทำให้ tensor ของเรามีขนาดเป็นจตุรัส Padding จะทำให้ทุกประโยคของเรามีความยาวเท่ากันโดยการเพิ่มคำพิเศษ ที่เรียกว่า padding token ไปในประโยค ยกตัวอย่างเช่น ถ้าคุณมี 10 ประโยคที่แต่ละประโยคมี 10 คำ และมี 1 ประโยคที่มี 20 คำ padding ทำให้ทุกประโยคนั้นมี 20 คำเหมือนกัน ในตัวอย่างของเรา tensor ที่เป็นผลลัพท์ของเราเป็นแบบนี้:

padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

padding token ID สามารถหาได้ใน tokenizer.pad_token_id เรามาลองใช้มันดูและใส่ประโยคสองประโยคของเราเข้าไปในโมเดลทีละอันและใส่แบบเป็นชุด(batch)ด้วย:

{#if fw === 'pt'}

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

{:else}

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(tf.constant(sequence1_ids)).logits)
print(model(tf.constant(sequence2_ids)).logits)
print(model(tf.constant(batched_ids)).logits)

tf.Tensor([[ 1.5693678 -1.3894581]], shape=(1, 2), dtype=float32)
tf.Tensor([[ 0.5803005  -0.41252428]], shape=(1, 2), dtype=float32)
tf.Tensor(
[[ 1.5693681 -1.3894582]
 [ 1.3373486 -1.2163193]], shape=(2, 2), dtype=float32)

{/if}

มีอะไรบางอย่างผิดปกติเกิดขึ้นกับ logits ในการทำนายแบบเป็นชุด(batch) ของเรา: แถวที่สองควรจะได้ logits เหมือนกันกับประโยคที่สอง แต่เราได้ที่ต่างกันโดยสิ้นเชิง!

นี่เป็นเพราะว่าคุณลักษณะสำคัญของโมเดล Transformer คือ attention layers ที่พิจารณาบริบทของแต่ละ token โดยจะมีการพิจารณา padding token ด้วย เนื่องจาก attention layers พิจารณาทุก tokens ของประโยค เพื่อให้ได้ผลลัพท์เดียวกันไม่ว่าจะใส่ประโยคที่ความยาวต่างกันเข้าไปทีละประโยคหรือใส่ประโยคเดียวกันนี้ที่มีการ padding ไปพร้อมกันทีละหลายๆประโยค(batch) เราจำเป็นที่จะต้องบอก attention layers เหล่านั้นให้ไม่ต้องพิจารณา padding tokens โดยสามารถทำได้ด้วยการใช้ attention mask

Attention masks

Attention masks คือ tensors ที่มีขนาดเท่ากับ tensor ของ input IDs และประกอบด้วย 0 และ 1: 1 บ่งบอกว่า tokens ในตำแหน่งนั้นๆ จำเป็นต้องพิจารณา, และ 0 บ่งบอกว่า tokens ในตำแหน่งนั้นๆ ไม่ต้องพิจารณา(กล่าวคือ มันควรจะถูกละเลยโดย attention layers ของโมเดล)

มาทำตัวอย่างที่แล้วโดยใช้ attention mask กัน:

{#if fw === 'pt'}

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

{:else}

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(tf.constant(batched_ids), attention_mask=tf.constant(attention_mask))
print(outputs.logits)

tf.Tensor(
[[ 1.5693681  -1.3894582 ]
 [ 0.5803021  -0.41252586]], shape=(2, 2), dtype=float32)

{/if}

ตอนนี้เราได้ logits เดียวกันสำหรับประโยคที่สองในชุด(batch) ของข้อมูล

สังเกตุว่าค่าสุดท้ายของประโยคที่สองนั้นเป็นอย่างไรใน padding ID ซึ่งก็คือค่า 0 ใน attention mask

✏️ ลองเลย! ทำ tokenization กับสองประโยคใน section 2 ("I've been waiting for a HuggingFace course my whole life." และ "I hate this so much!") แล้วใส่เข้าไปในโมเดลและตรวจสอบดูว่าคุณได้ logits เหมือนกับใน section 2 หรือไม่ หลังจากนั้นให้จับประโยครวมกันเป็นชุด(batch) โดยใช้ padding token แล้วสร้าง attention mask ที่ถูกต้อง และตรวจสอบดูว่าคุณได้ผลลัพท์เหมือนกันหรือไม่หลังจากใส่เข้าไปในโมเดลแล้ว!

ประโยคที่ยาวขึ้น

การใช้โมเดล Transformer นั้นมีข้อจำกัดเรื่องความยาวของประโยคที่เราสามารถใส่เข้าไปได้ โมเดลส่วนใหญ่จะสามารถรองรับประโยคได้อย่างมาก 512 หรือ 1024 tokens และก็จะทำงานไม่ได้ถ้าให้มันประมวลผลประโยคที่ยาวขึ้น มีวีธีแก้ปัญหานี้อยู่ 2 วิธี:

ใช้โมเดลที่รองรับประโยคที่ยาวขึ้น
ตัดประโยคของคุณให้สั้นลง

โมเดลต่างๆ รองรับความยาวของประโยคที่แตกต่างกัน และบางโมเดลนั้นเชี่ยวชาญในการจัดการกับประโยคที่ยาวๆ Longformer เป็นหนึ่งตัวอย่าง และอีกตัวอย่างก็คือ LED ถ้าคุณกำลังทำงานที่ต้องการประโยคที่ยาวมากๆ เราแนะนำให้คุณลองดูโมเดลเหล่านั้น

หรือในทางตรงกันข้าม เราแนะนำให้คุณตัดประโยคของคุณให้สั้นลงโดยการระบุตัวแปร `max_sequence_length:

sequence = sequence[:max_sequence_length]

Xet Storage Details

Size:: 19.2 kB
Xet hash:: b1441462ea9cb0f232c1cbd457d3eac9f34b1f676067fc7288c371bff42edf08

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.