AbdulElahGwaith's picture
Upload folder using huggingface_hub
a9bd396 verified

ํ”„๋กœ์„ธ์„œ[[processors]]

Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ํ”„๋กœ์„ธ์„œ๋Š” ๋‘ ๊ฐ€์ง€ ์˜๋ฏธ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค:

  • Wav2Vec2 (์Œ์„ฑ๊ณผ ํ…์ŠคํŠธ) ๋˜๋Š” CLIP (ํ…์ŠคํŠธ์™€ ๋น„์ „)๊ณผ ๊ฐ™์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์˜ ์ž…๋ ฅ์„ ์ „์ฒ˜๋ฆฌํ•˜๋Š” ๊ฐ์ฒด
  • GLUE ๋˜๋Š” SQUAD ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ์ด์ „ ๋ฒ„์ „์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋˜ ์‚ฌ์šฉ ์ค‘๋‹จ๋œ ๊ฐ์ฒด

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”„๋กœ์„ธ์„œ[[transformers.ProcessorMixin]]

๋ชจ๋“  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์€ ์—ฌ๋Ÿฌ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(ํ…์ŠคํŠธ, ๋น„์ „, ์˜ค๋””์˜ค)๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ฑฐ๋‚˜ ๋””์ฝ”๋”ฉํ•˜๋Š” ๊ฐ์ฒด๊ฐ€ ํ•„์š”ํ•œ๋ฐ, ์ด๊ฒƒ์€ ํ”„๋กœ์„ธ์„œ๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ๊ฐ์ฒด๊ฐ€ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœ์„ธ์„œ๋Š” ํ† ํฌ๋‚˜์ด์ €(ํ…์ŠคํŠธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์šฉ), ์ด๋ฏธ์ง€ ํ”„๋กœ์„ธ์„œ(๋น„์ „์šฉ), ํŠน์„ฑ ์ถ”์ถœ๊ธฐ(์˜ค๋””์˜ค์šฉ) ๊ฐ™์ด ๋‘ ๊ฐœ ์ด์ƒ์˜ ์ฒ˜๋ฆฌ ๊ฐ์ฒด๋ฅผ ํ•˜๋‚˜๋กœ ๋ฌถ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ํ”„๋กœ์„ธ์„œ๋Š” ์ €์žฅ ๋ฐ ๋กœ๋”ฉ ๊ธฐ๋Šฅ์„ ๊ตฌํ˜„ํ•˜๋Š” ๋‹ค์Œ ๊ธฐ๋ณธ ํด๋ž˜์Šค๋ฅผ ์ƒ์†๋ฐ›์Šต๋‹ˆ๋‹ค:

[[autodoc]] ProcessorMixin

์‚ฌ์šฉ ์ค‘๋‹จ๋œ ํ”„๋กœ์„ธ์„œ[[transformers.DataProcessor]]

๋ชจ๋“  ํ”„๋กœ์„ธ์„œ๋Š” [~data.processors.utils.DataProcessor]์™€ ๊ฐ™์€ ๋™์ผํ•œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ํ”„๋กœ์„ธ์„œ๋Š” [~data.processors.utils.InputExample]์˜ ๋ชฉ๋ก์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด [~data.processors.utils.InputExample]๋“ค์€ ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜๊ธฐ ์œ„ํ•ด [~data.processors.utils.InputFeatures]๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

[[autodoc]] data.processors.utils.DataProcessor

[[autodoc]] data.processors.utils.InputExample

[[autodoc]] data.processors.utils.InputFeatures

GLUE[[transformers.glue_convert_examples_to_features]]

General Language Understanding Evaluation (GLUE)๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ์กด NLU ์ž‘์—…์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ์ž…๋‹ˆ๋‹ค. GLUE: A multi-task benchmark and analysis platform for natural language understanding ๋…ผ๋ฌธ๊ณผ ํ•จ๊ป˜ ๋ฐœํ‘œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” MRPC, MNLI, MNLI (๋ถˆ์ผ์น˜), CoLA, SST2, STSB, QQP, QNLI, RTE, WNLI ์ด 10๊ฐœ ์ž‘์—…์— ๋Œ€ํ•œ ํ”„๋กœ์„ธ์„œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ํ”„๋กœ์„ธ์„œ๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • [~data.processors.utils.MrpcProcessor]
  • [~data.processors.utils.MnliProcessor]
  • [~data.processors.utils.MnliMismatchedProcessor]
  • [~data.processors.utils.Sst2Processor]
  • [~data.processors.utils.StsbProcessor]
  • [~data.processors.utils.QqpProcessor]
  • [~data.processors.utils.QnliProcessor]
  • [~data.processors.utils.RteProcessor]
  • [~data.processors.utils.WnliProcessor]

๋˜ํ•œ, ์•„๋ž˜์˜ ๋ฉ”์†Œ๋“œ๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํŒŒ์ผ๋กœ๋ถ€ํ„ฐ ๊ฐ’์„ ๊ฐ€์ ธ์™€ [~data.processors.utils.InputExample] ๋ชฉ๋ก์œผ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

[[autodoc]] data.processors.glue.glue_convert_examples_to_features

XNLI[[xnli]]

The Cross-Lingual NLI Corpus (XNLI)๋Š” ๊ต์ฐจ์–ธ์–ด ํ…์ŠคํŠธ ํ‘œํ˜„์˜ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ์ž…๋‹ˆ๋‹ค. XNLI๋Š” MultiNLI๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํฌ๋ผ์šฐ๋“œ์†Œ์‹ฑ ๋ฐ์ดํ„ฐ ์„ธํŠธ์ž…๋‹ˆ๋‹ค: ํ…์ŠคํŠธ ์Œ์€ 15๊ฐœ ์–ธ์–ด(์˜์–ด ๊ฐ™์€ ๊ณ ์ž์› ์–ธ์–ด๋ถ€ํ„ฐ ์Šค์™€ํž๋ฆฌ์–ด ๊ฐ™์€ ์ €์ž์› ์–ธ์–ด๊นŒ์ง€)์— ๋Œ€ํ•ด ํ…์ŠคํŠธ ํ•จ์˜ ์–ด๋…ธํ…Œ์ด์…˜์œผ๋กœ ๋ ˆ์ด๋ธ”๋ง๋ฉ๋‹ˆ๋‹ค.

XNLI: Evaluating Cross-lingual Sentence Representations ๋…ผ๋ฌธ๊ณผ ํ•จ๊ป˜ ๋ฐœํ‘œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” XNLI ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ํ”„๋กœ์„ธ์„œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • [~data.processors.utils.XnliProcessor]

ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๊ณจ๋“œ ๋ ˆ์ด๋ธ”์ด ์ œ๊ณต๋˜๋ฏ€๋กœ, ํ‰๊ฐ€๋Š” ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ํ”„๋กœ์„ธ์„œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์‹œ๋Š” run_xnli.py ์Šคํฌ๋ฆฝํŠธ์— ์ œ๊ณต๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

SQuAD[[squad]]

The Stanford Question Answering Dataset (SQuAD)๋Š” ์งˆ๋ฌธ ๋‹ต๋ณ€์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ์ž…๋‹ˆ๋‹ค. v1.1๊ณผ v2.0 ๋‘ ๊ฐ€์ง€ ๋ฒ„์ „์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฒ„์ „(v1.1)์€ SQuAD: 100,000+ Questions for Machine Comprehension of Text ๋…ผ๋ฌธ๊ณผ ํ•จ๊ป˜ ๋ฐœํ‘œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ๋ฒ„์ „(v2.0)์€ Know What You Don't Know: Unanswerable Questions for SQuAD ๋…ผ๋ฌธ๊ณผ ํ•จ๊ป˜ ๋ฐœํ‘œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๋‘ ๋ฒ„์ „ ๊ฐ๊ฐ์— ๋Œ€ํ•œ ํ”„๋กœ์„ธ์„œ๋ฅผ ํ˜ธ์ŠคํŒ…ํ•ฉ๋‹ˆ๋‹ค:

ํ”„๋กœ์„ธ์„œ[[transformers.data.processors.squad.SquadProcessor]]

์ด๋Ÿฌํ•œ ํ”„๋กœ์„ธ์„œ๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • [~data.processors.utils.SquadV1Processor]
  • [~data.processors.utils.SquadV2Processor]

๋‘˜ ๋‹ค ์ถ”์ƒ ํด๋ž˜์Šค [~data.processors.utils.SquadProcessor]๋ฅผ ์ƒ์†๋ฐ›์Šต๋‹ˆ๋‹ค.

[[autodoc]] data.processors.squad.SquadProcessor - all

๋˜ํ•œ, ๋‹ค์Œ ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ SQuAD ์˜ˆ์‹œ๋ฅผ ๋ชจ๋ธ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” [~data.processors.utils.SquadFeatures]๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

[[autodoc]] data.processors.squad.squad_convert_examples_to_features

์ด๋Ÿฌํ•œ ํ”„๋กœ์„ธ์„œ๋“ค๊ณผ ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋ฉ”์†Œ๋“œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋œ ํŒŒ์ผ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ tensorflow_datasets ํŒจํ‚ค์ง€์™€๋„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ์‹œ๋Š” ์•„๋ž˜์— ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ ์˜ˆ์‹œ[[example-usage]]

๋‹ค์Œ์€ ๋ฐ์ดํ„ฐ ํŒŒ์ผ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ”„๋กœ์„ธ์„œ์™€ ๋ณ€ํ™˜ ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค:

# V2 ํ”„๋กœ์„ธ์„œ ๊ฐ€์ ธ์˜ค๊ธฐ
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)

# V1 ํ”„๋กœ์„ธ์„œ ๊ฐ€์ ธ์˜ค๊ธฐ
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=max_query_length,
    is_training=not evaluate,
)

tensorflow_datasets ์‚ฌ์šฉ์€ ๋ฐ์ดํ„ฐ ํŒŒ์ผ ์‚ฌ์šฉ๋งŒํผ ์‰ฝ์Šต๋‹ˆ๋‹ค:

# tensorflow_datasets๋Š” Squad V1๋งŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=max_query_length,
    is_training=not evaluate,
)

์ด๋Ÿฌํ•œ ํ”„๋กœ์„ธ์„œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ์˜ˆ์‹œ๋Š” run_squad.py ์Šคํฌ๋ฆฝํŠธ์— ์ œ๊ณต๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.