DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

์˜ค๋””์˜ค ๋ถ„๋ฅ˜[[audio_classification]]

[[open-in-colab]]

์˜ค๋””์˜ค ๋ถ„๋ฅ˜๋Š” ํ…์ŠคํŠธ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์— ํด๋ž˜์Šค ๋ ˆ์ด๋ธ” ์ถœ๋ ฅ์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค. ์œ ์ผํ•œ ์ฐจ์ด์ ์€ ํ…์ŠคํŠธ ์ž…๋ ฅ ๋Œ€์‹  ์›์‹œ ์˜ค๋””์˜ค ํŒŒํ˜•์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ค๋””์˜ค ๋ถ„๋ฅ˜์˜ ์‹ค์ œ ์ ์šฉ ๋ถ„์•ผ์—๋Š” ํ™”์ž์˜ ์˜๋„ ํŒŒ์•…, ์–ธ์–ด ๋ถ„๋ฅ˜, ์†Œ๋ฆฌ๋กœ ๋™๋ฌผ ์ข…์„ ์‹๋ณ„ํ•˜๋Š” ๊ฒƒ ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฌธ์„œ์—์„œ ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

  1. MInDS-14 ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ Wav2Vec2๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•˜์—ฌ ํ™”์ž์˜ ์˜๋„๋ฅผ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ถ”๋ก ์— ๋ฏธ์„ธ ์กฐ์ •๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์„ธ์š”.

์ด ์ž‘์—…๊ณผ ํ˜ธํ™˜๋˜๋Š” ๋ชจ๋“  ์•„ํ‚คํ…์ฒ˜์™€ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๋ณด๋ ค๋ฉด ์ž‘์—… ํŽ˜์ด์ง€๋ฅผ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ๋ชจ๋‘ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pip install transformers datasets evaluate

๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•˜๊ณ  ์ปค๋ฎค๋‹ˆํ‹ฐ์™€ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ—ˆ๊น…ํŽ˜์ด์Šค ๊ณ„์ •์— ๋กœ๊ทธ์ธํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ๋ฉ”์‹œ์ง€๊ฐ€ ํ‘œ์‹œ๋˜๋ฉด ํ† ํฐ์„ ์ž…๋ ฅํ•˜์—ฌ ๋กœ๊ทธ์ธํ•ฉ๋‹ˆ๋‹ค:

>>> from huggingface_hub import notebook_login

>>> notebook_login()

MInDS-14 ๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ[[load_minds_14_dataset]]

๋จผ์ € ๐Ÿค— Datasets ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ MinDS-14 ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค:

>>> from datasets import load_dataset, Audio

>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ train ๋ถ„ํ• ์„ [~datasets.Dataset.train_test_split] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋” ์ž‘์€ ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธ ์ง‘ํ•ฉ์œผ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋” ๋งŽ์€ ์‹œ๊ฐ„์„ ์†Œ๋น„ํ•˜๊ธฐ ์ „์— ๋ชจ๋“  ๊ฒƒ์ด ์ž‘๋™ํ•˜๋Š”์ง€ ์‹คํ—˜ํ•˜๊ณ  ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

>>> minds = minds.train_test_split(test_size=0.2)

์ด์ œ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ์‚ดํŽด๋ณผ๊ฒŒ์š”:

>>> minds
DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

๋ฐ์ดํ„ฐ ์„ธํŠธ์—๋Š” lang_id ๋ฐ english_transcription๊ณผ ๊ฐ™์€ ์œ ์šฉํ•œ ์ •๋ณด๊ฐ€ ๋งŽ์ด ํฌํ•จ๋˜์–ด ์žˆ์ง€๋งŒ ์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” audio ๋ฐ intent_class์— ์ค‘์ ์„ ๋‘˜ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์—ด์€ [~datasets.Dataset.remove_columns] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค:

>>> minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

์˜ˆ์‹œ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

>>> minds["train"][0]
{'audio': {'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00048828,
         -0.00024414, -0.00024414], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 8000},
 'intent_class': 2}

๋‘ ๊ฐœ์˜ ํ•„๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  • audio: ์˜ค๋””์˜ค ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ค๊ณ  ๋ฆฌ์ƒ˜ํ”Œ๋งํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ถœํ•ด์•ผ ํ•˜๋Š” ์Œ์„ฑ ์‹ ํ˜ธ์˜ 1์ฐจ์› ๋ฐฐ์—ด์ž…๋‹ˆ๋‹ค.
  • intent_class: ํ™”์ž์˜ ์˜๋„์— ๋Œ€ํ•œ ํด๋ž˜์Šค ID๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๋ชจ๋ธ์ด ๋ ˆ์ด๋ธ” ID์—์„œ ๋ ˆ์ด๋ธ” ์ด๋ฆ„์„ ์‰ฝ๊ฒŒ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ๋ ˆ์ด๋ธ” ์ด๋ฆ„์„ ์ •์ˆ˜๋กœ ๋งคํ•‘ํ•˜๋Š” ์‚ฌ์ „์„ ๋งŒ๋“ค๊ฑฐ๋‚˜ ๊ทธ ๋ฐ˜๋Œ€๋กœ ๋งคํ•‘ํ•˜๋Š” ์‚ฌ์ „์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค:

>>> labels = minds["train"].features["intent_class"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label

์ด์ œ ๋ ˆ์ด๋ธ” ID๋ฅผ ๋ ˆ์ด๋ธ” ์ด๋ฆ„์œผ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

>>> id2label[str(2)]
'app_error'

์ „์ฒ˜๋ฆฌ[[preprocess]]

๋‹ค์Œ ๋‹จ๊ณ„๋Š” ์˜ค๋””์˜ค ์‹ ํ˜ธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด Wav2Vec2 ํŠน์ง• ์ถ”์ถœ๊ธฐ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

MinDS-14 ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ƒ˜ํ”Œ๋ง ์†๋„๋Š” 8khz์ด๋ฏ€๋กœ(์ด ์ •๋ณด๋Š” ๋ฐ์ดํ„ฐ์„ธํŠธ ์นด๋“œ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค), ์‚ฌ์ „ ํ›ˆ๋ จ๋œ Wav2Vec2 ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ 16kHz๋กœ ๋ฆฌ์ƒ˜ํ”Œ๋งํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([ 2.2098757e-05,  4.6582241e-05, -2.2803260e-05, ...,
         -2.8419291e-04, -2.3305941e-04, -1.1425107e-04], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 16000},
 'intent_class': 2}

์ด์ œ ์ „์ฒ˜๋ฆฌ ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค:

  1. ๊ฐ€์ ธ์˜ฌ ์˜ค๋””์˜ค ์—ด์„ ํ˜ธ์ถœํ•˜๊ณ  ํ•„์š”ํ•œ ๊ฒฝ์šฐ ์˜ค๋””์˜ค ํŒŒ์ผ์„ ๋ฆฌ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค.
  2. ์˜ค๋””์˜ค ํŒŒ์ผ์˜ ์ƒ˜ํ”Œ๋ง ์†๋„๊ฐ€ ๋ชจ๋ธ์— ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์˜ ์ƒ˜ํ”Œ๋ง ์†๋„์™€ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. ์ด ์ •๋ณด๋Š” Wav2Vec2 ๋ชจ๋ธ ์นด๋“œ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๊ธด ์ž…๋ ฅ์ด ์ž˜๋ฆฌ์ง€ ์•Š๊ณ  ์ผ๊ด„ ์ฒ˜๋ฆฌ๋˜๋„๋ก ์ตœ๋Œ€ ์ž…๋ ฅ ๊ธธ์ด๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
...     )
...     return inputs

์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ „์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ์„ ์ ์šฉํ•˜๋ ค๋ฉด ๐Ÿค— Datasets [~datasets.Dataset.map] ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. batched=True๋ฅผ ์„ค์ •ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์˜ ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋ฉด map์˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•„์š”ํ•˜์ง€ ์•Š์€ ์—ด์„ ์ œ๊ฑฐํ•˜๊ณ  intent_class์˜ ์ด๋ฆ„์„ ๋ชจ๋ธ์ด ์˜ˆ์ƒํ•˜๋Š” ์ด๋ฆ„์ธ label๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค:

>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
>>> encoded_minds = encoded_minds.rename_column("intent_class", "label")

ํ‰๊ฐ€ํ•˜๊ธฐ[[evaluate]]

ํ›ˆ๋ จ ์ค‘์— ๋ฉ”ํŠธ๋ฆญ์„ ํฌํ•จํ•˜๋ฉด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ๐Ÿค— Evaluate ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์„ ๋น ๋ฅด๊ฒŒ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ž‘์—…์—์„œ๋Š” accuracy(์ •ํ™•๋„) ๋ฉ”ํŠธ๋ฆญ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค(๋ฉ”ํŠธ๋ฆญ์„ ๊ฐ€์ ธ์˜ค๊ณ  ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๐Ÿค— Evalutate ๋น ๋ฅธ ๋‘˜๋Ÿฌ๋ณด๊ธฐ ์ฐธ์กฐํ•˜์„ธ์š”):

>>> import evaluate

>>> accuracy = evaluate.load("accuracy")

๊ทธ๋Ÿฐ ๋‹ค์Œ ์˜ˆ์ธก๊ณผ ๋ ˆ์ด๋ธ”์„ [~evaluate.EvaluationModule.compute]์— ์ „๋‹ฌํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค:

>>> import numpy as np


>>> def compute_metrics(eval_pred):
...     predictions = np.argmax(eval_pred.predictions, axis=1)
...     return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

์ด์ œ compute_metrics ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์œผ๋ฉฐ, ํŠธ๋ ˆ์ด๋‹์„ ์„ค์ •ํ•  ๋•Œ ์ด ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ›ˆ๋ จ[[train]]

[Trainer]๋กœ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๋ฐ ์ต์ˆ™ํ•˜์ง€ ์•Š๋‹ค๋ฉด ๊ธฐ๋ณธ ํŠœํ† ๋ฆฌ์–ผ ์—ฌ๊ธฐ์„ ์‚ดํŽด๋ณด์„ธ์š”!

์ด์ œ ๋ชจ๋ธ ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค! [AutoModelForAudioClassification]์„ ์ด์šฉํ•ด์„œ Wav2Vec2๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค. ์˜ˆ์ƒ๋˜๋Š” ๋ ˆ์ด๋ธ” ์ˆ˜์™€ ๋ ˆ์ด๋ธ” ๋งคํ•‘์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

>>> num_labels = len(id2label)
>>> model = AutoModelForAudioClassification.from_pretrained(
...     "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
... )

์ด์ œ ์„ธ ๋‹จ๊ณ„๋งŒ ๋‚จ์•˜์Šต๋‹ˆ๋‹ค:

  1. ํ›ˆ๋ จ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ [TrainingArguments]์— ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์œ ์ผํ•œ ํ•„์ˆ˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋ชจ๋ธ์„ ์ €์žฅํ•  ์œ„์น˜๋ฅผ ์ง€์ •ํ•˜๋Š” output_dir์ž…๋‹ˆ๋‹ค. push_to_hub = True๋ฅผ ์„ค์ •ํ•˜์—ฌ ์ด ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ๋กœ ํ‘ธ์‹œํ•ฉ๋‹ˆ๋‹ค(๋ชจ๋ธ์„ ์—…๋กœ๋“œํ•˜๋ ค๋ฉด ํ—ˆ๊น… ํŽ˜์ด์Šค์— ๋กœ๊ทธ์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค). ๊ฐ ์—ํญ์ด ๋๋‚  ๋•Œ๋งˆ๋‹ค [Trainer]๊ฐ€ ์ •ํ™•๋„๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ  ํ›ˆ๋ จ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ, ๋ฐ์ดํ„ฐ ์„ธํŠธ, ํ† ํฌ๋‚˜์ด์ €, ๋ฐ์ดํ„ฐ ์ฝœ๋ ˆ์ดํ„ฐ, compute_metrics ํ•จ์ˆ˜์™€ ํ•จ๊ป˜ ํ›ˆ๋ จ ์ธ์ž๋ฅผ [Trainer]์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
  3. [~Trainer.train]์„ ํ˜ธ์ถœํ•˜์—ฌ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_mind_model",
...     eval_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=3e-5,
...     per_device_train_batch_size=32,
...     gradient_accumulation_steps=4,
...     per_device_eval_batch_size=32,
...     num_train_epochs=10,
...     warmup_ratio=0.1,
...     logging_steps=10,
...     load_best_model_at_end=True,
...     metric_for_best_model="accuracy",
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=encoded_minds["train"],
...     eval_dataset=encoded_minds["test"],
...     processing_class=feature_extractor,
...     compute_metrics=compute_metrics,
... )

>>> trainer.train()

ํ›ˆ๋ จ์ด ์™„๋ฃŒ๋˜๋ฉด ๋ชจ๋“  ์‚ฌ๋žŒ์ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก [~transformers.Trainer.push_to_hub] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ—ˆ๋ธŒ์— ๊ณต์œ ํ•˜์„ธ์š”:

>>> trainer.push_to_hub()

For a more in-depth example of how to finetune a model for audio classification, take a look at the corresponding PyTorch notebook.

์ถ”๋ก [[inference]]

์ด์ œ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ–ˆ์œผ๋‹ˆ ์ถ”๋ก ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

์ถ”๋ก ์„ ์‹คํ–‰ํ•  ์˜ค๋””์˜ค ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ํ•„์š”ํ•œ ๊ฒฝ์šฐ ์˜ค๋””์˜ค ํŒŒ์ผ์˜ ์ƒ˜ํ”Œ๋ง ์†๋„๋ฅผ ๋ชจ๋ธ์˜ ์ƒ˜ํ”Œ๋ง ์†๋„์™€ ์ผ์น˜ํ•˜๋„๋ก ๋ฆฌ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ฒƒ์„ ์žŠ์ง€ ๋งˆ์„ธ์š”!

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> audio_file = dataset[0]["audio"]["path"]

์ถ”๋ก ์„ ์œ„ํ•ด ๋ฏธ์„ธ ์กฐ์ •ํ•œ ๋ชจ๋ธ์„ ์‹œํ—˜ํ•ด ๋ณด๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ [pipeline]์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ค๋””์˜ค ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ pipeline์„ ์ธ์Šคํ„ด์Šคํ™”ํ•˜๊ณ  ์˜ค๋””์˜ค ํŒŒ์ผ์„ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers import pipeline

>>> classifier = pipeline("audio-classification", model="stevhliu/my_awesome_minds_model")
>>> classifier(audio_file)
[
    {'score': 0.09766869246959686, 'label': 'cash_deposit'},
    {'score': 0.07998877018690109, 'label': 'app_error'},
    {'score': 0.0781070664525032, 'label': 'joint_account'},
    {'score': 0.07667109370231628, 'label': 'pay_bill'},
    {'score': 0.0755252093076706, 'label': 'balance'}
]

์›ํ•˜๋Š” ๊ฒฝ์šฐ pipeline์˜ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜๋™์œผ๋กœ ๋ณต์ œํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

ํŠน์ง• ์ถ”์ถœ๊ธฐ๋ฅผ ๊ฐ€์ ธ์™€์„œ ์˜ค๋””์˜ค ํŒŒ์ผ์„ ์ „์ฒ˜๋ฆฌํ•˜๊ณ  `์ž…๋ ฅ`์„ PyTorch ํ…์„œ๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:
>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("stevhliu/my_awesome_minds_model")
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

๋ชจ๋ธ์— ์ž…๋ ฅ์„ ์ „๋‹ฌํ•˜๊ณ  ๋กœ์ง“์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

>>> from transformers import AutoModelForAudioClassification

>>> model = AutoModelForAudioClassification.from_pretrained("stevhliu/my_awesome_minds_model")
>>> with torch.no_grad():
...     logits = model(**inputs).logits

ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์€ ํด๋ž˜์Šค๋ฅผ ๊ฐ€์ ธ์˜จ ๋‹ค์Œ ๋ชจ๋ธ์˜ id2label ๋งคํ•‘์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฅผ ๋ ˆ์ด๋ธ”๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

>>> import torch

>>> predicted_class_ids = torch.argmax(logits).item()
>>> predicted_label = model.config.id2label[predicted_class_ids]
>>> predicted_label
'cash_deposit'