Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / course /pr_1114 /en /chapter2 /9.md

rtrm

about 2 months ago

preview code

download

raw

8.83 kB

	# End-of-chapter quiz[[end-of-chapter-quiz]]

	<CourseFloatingBanner
	chapter={2}
	classNames="absolute z-10 right-0 top-0"
	/>

	### 1. What is the order of the language modeling pipeline?

	<Question
	choices={[
	{
	text: "First, the model, which handles text and returns raw predictions. The tokenizer then makes sense of these predictions and converts them back to text when needed.",
	explain: "The model cannot understand text! The tokenizer must first tokenize the text and convert it to IDs so that it is understandable by the model."
	},
	{
	text: "First, the tokenizer, which handles text and returns IDs. The model handles these IDs and outputs a prediction, which can be some text.",
	explain: "The model's prediction cannot be text straight away. The tokenizer has to be used in order to convert the prediction back to text!"
	},
	{
	text: "The tokenizer handles text and returns IDs. The model handles these IDs and outputs a prediction. The tokenizer can then be used once again to convert these predictions back to some text.",
	explain: "The tokenizer can be used for both tokenizing and de-tokenizing.",
	correct: true
	}
	]}
	/>

	### 2. How many dimensions does the tensor output by the base Transformer model have, and what are they?

	<Question
	choices={[
	{
	text: "2: The sequence length and the batch size",
	explain: "False! The tensor output by the model has a third dimension: hidden size."
	},
	{
	text: "2: The sequence length and the hidden size",
	explain: "False! All Transformer models handle batches, even with a single sequence; that would be a batch size of 1!"
	},
	{
	text: "3: The sequence length, the batch size, and the hidden size",
	explain: "Nicely done!",
	correct: true
	}
	]}
	/>

	### 3. Which of the following is an example of subword tokenization?

	<Question
	choices={[
	{
	text: "WordPiece",
	explain: "Yes, that's one example of subword tokenization!",
	correct: true
	},
	{
	text: "Character-based tokenization",
	explain: "Character-based tokenization is not a type of subword tokenization."
	},
	{
	text: "Splitting on whitespace and punctuation",
	explain: "That's a word-based tokenization scheme!"
	},
	{
	text: "BPE",
	explain: "Yes, that's one example of subword tokenization!",
	correct: true
	},
	{
	text: "Unigram",
	explain: "Yes, that's one example of subword tokenization!",
	correct: true
	},
	{
	text: "None of the above",
	explain: "Wrong!"
	}
	]}
	/>

	### 4. What is a model head?

	<Question
	choices={[
	{
	text: "A component of the base Transformer network that redirects tensors to their correct layers",
	explain: "There's no such component."
	},
	{
	text: "Also known as the self-attention mechanism, it adapts the representation of a token according to the other tokens of the sequence",
	explain: "The self-attention layer does contain attention \"heads,\" but these are not adaptation heads."
	},
	{
	text: "An additional component, usually made up of one or a few layers, to convert the transformer predictions to a task-specific output",
	explain: "That's right. Adaptation heads, also known simply as heads, come up in different forms: language modeling heads, question answering heads, sequence classification heads... ",
	correct: true
	}
	]}
	/>

	### 5. What is an AutoModel?

	<Question
	choices={[
	{
	text: "A model that automatically trains on your data",
	explain: "Are you mistaking this with our <a href='https://huggingface.co/autotrain'>AutoTrain</a> product?"
	},
	{
	text: "An object that returns the correct architecture based on the checkpoint",
	explain: "Exactly: the <code>AutoModel</code> only needs to know the checkpoint from which to initialize to return the correct architecture.",
	correct: true
	},
	{
	text: "A model that automatically detects the language used for its inputs to load the correct weights",
	explain: "While some checkpoints and models are capable of handling multiple languages, there are no built-in tools for automatic checkpoint selection according to language. You should head over to the <a href='https://huggingface.co/models'>Model Hub</a> to find the best checkpoint for your task!"
	}
	]}
	/>

	### 6. What are the techniques to be aware of when batching sequences of different lengths together?

	<Question
	choices={[
	{
	text: "Truncating",
	explain: "Yes, truncation is a correct way of evening out sequences so that they fit in a rectangular shape. Is it the only one, though?",
	correct: true
	},
	{
	text: "Returning tensors",
	explain: "While the other techniques allow you to return rectangular tensors, returning tensors isn't helpful when batching sequences together."
	},
	{
	text: "Padding",
	explain: "Yes, padding is a correct way of evening out sequences so that they fit in a rectangular shape. Is it the only one, though?",
	correct: true
	},
	{
	text: "Attention masking",
	explain: "Absolutely! Attention masks are of prime importance when handling sequences of different lengths. That's not the only technique to be aware of, however.",
	correct: true
	}
	]}
	/>

	### 7. What is the point of applying a SoftMax function to the logits output by a sequence classification model?

	<Question
	choices={[
	{
	text: "It softens the logits so that they're more reliable.",
	explain: "No, the SoftMax function does not affect the reliability of results."
	},
	{
	text: "It applies a lower and upper bound so that they're understandable.",
	explain: "The resulting values are bound between 0 and 1. That's not the only reason we use a SoftMax function, though.",
	correct: true
	},
	{
	text: "The total sum of the output is then 1, resulting in a possible probabilistic interpretation.",
	explain: "Correct! That's not the only reason we use a SoftMax function, though.",
	correct: true
	}
	]}
	/>

	### 8. What method is most of the tokenizer API centered around?

	<Question
	choices={[
	{
	text: "<code>encode</code>, as it can encode text into IDs and IDs into predictions",
	explain: "Wrong! While the <code>encode</code> method does exist on tokenizers, it does not exist on models."
	},
	{
	text: "Calling the tokenizer object directly.",
	explain: "Exactly! The <code>__call__</code> method of the tokenizer is a very powerful method which can handle pretty much anything. It is also the method used to retrieve predictions from a model.",
	correct: true
	},
	{
	text: "<code>pad</code>",
	explain: "Wrong! Padding is very useful, but it's just one part of the tokenizer API."
	},
	{
	text: "<code>tokenize</code>",
	explain: "The <code>tokenize</code> method is arguably one of the most useful methods, but it isn't the core of the tokenizer API."
	}
	]}
	/>

	### 9. What does the `result` variable contain in this code sample?

	```py
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
	result = tokenizer.tokenize("Hello!")
	```

	<Question
	choices={[
	{
	text: "A list of strings, each string being a token",
	explain: "Absolutely! Convert this to IDs, and send them to a model!",
	correct: true
	},
	{
	text: "A list of IDs",
	explain: "Incorrect; that's what the <code>__call__</code> or <code>convert_tokens_to_ids</code> method is for!"
	},
	{
	text: "A string containing all of the tokens",
	explain: "This would be suboptimal, as the goal is to split the string into multiple tokens."
	}
	]}
	/>

	### 10. Is there something wrong with the following code?

	```py
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
	model = AutoModel.from_pretrained("gpt2")

	encoded = tokenizer("Hey!", return_tensors="pt")
	result = model(**encoded)
	```

	<Question
	choices={[
	{
	text: "No, it seems correct.",
	explain: "Unfortunately, coupling a model with a tokenizer that was trained with a different checkpoint is rarely a good idea. The model was not trained to make sense out of this tokenizer's output, so the model output (if it can even run!) will not make any sense."
	},
	{
	text: "The tokenizer and model should always be from the same checkpoint.",
	explain: "Right!",
	correct: true
	},
	{
	text: "It's good practice to pad and truncate with the tokenizer as every input is a batch.",
	explain: "It's true that every model input needs to be a batch. However, truncating or padding this sequence wouldn't necessarily make sense as there is only one of it, and those are techniques to batch together a list of sentences."
	}
	]}
	/>


	<EditOnGithub source="https://github.com/huggingface/course/blob/main/chapters/en/chapter2/9.mdx" />

Xet Storage Details

Size:: 8.83 kB
Xet hash:: e5885434ba5a9620298c6a645f1d2b967f6073e85757787adc6a8663942ba27a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.