ghrua's picture
Initial commit with Dockerfile
8b821fa

CodeGen[[Codegen]]

PyTorch

๊ฐœ์š”[[Overview]]

CodeGen ๋ชจ๋ธ์€ Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong์ด ์ž‘์„ฑํ•œ ๋…ผ๋ฌธ A Conversational Paradigm for Program Synthesis์—์„œ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

CodeGen ๋ชจ๋ธ์€ ํ”„๋กœ๊ทธ๋žจ ํ•ฉ์„ฑ(program synthesis)์„ ์œ„ํ•œ ์ž๊ธฐํšŒ๊ท€(autoregressive) ์–ธ์–ด ๋ชจ๋ธ๋กœ, The Pile, BigQuery, BigPython ๋ฐ์ดํ„ฐ๋กœ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

ํ”„๋กœ๊ทธ๋žจ ํ•ฉ์„ฑ(program synthesis)์€ ์ฃผ์–ด์ง„ ๋ฌธ์ œ ๋ช…์„ธ์— ๋Œ€ํ•œ ํ•ด๋‹ต์œผ๋กœ ํ”„๋กœ๊ทธ๋žจ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ํ™œ์šฉํ•œ ๋Œ€ํ™”ํ˜• ํ”„๋กœ๊ทธ๋žจ ํ•ฉ์„ฑ(conversational program synthesis) ์ ‘๊ทผ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ, ๊ธฐ์กด ์ ‘๊ทผ๋ฒ•์—์„œ์˜ ๋ฐฉ๋Œ€ํ•œ ํ”„๋กœ๊ทธ๋žจ ํƒ์ƒ‰ ๊ณต๊ฐ„๊ณผ ์‚ฌ์šฉ์ž์˜ ์˜๋„๋ฅผ ๋ช…์„ธํ™”ํ•˜๋Š” ๊ณผ์ •์—์„œ์˜ ์–ด๋ ค์›€์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ์‹์—์„œ๋Š” ํ”„๋กœ๊ทธ๋žจ ๋ช…์„ธ ์ž‘์„ฑ๊ณผ ์‹ค์ œ ํ”„๋กœ๊ทธ๋žจ ์ž‘์„ฑ์„ ์‚ฌ์šฉ์ž์™€ ์‹œ์Šคํ…œ ๊ฐ„ ๋‹คํšŒ ๋Œ€ํ™”(multi-turn conversation)๋กœ ๋ฐ”๋ผ๋ด…๋‹ˆ๋‹ค. ์ฆ‰, ํ”„๋กœ๊ทธ๋žจ ํ•ฉ์„ฑ ๊ณผ์ • ๋ช…์„ธ๋ฅผ ์ž์—ฐ์–ด๋กœ ํ‘œํ˜„ํ•˜๊ณ , ๊ธฐ๋Œ€ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ ํ•ฉ์„ฑ์„ ์กฐ๊ฑด๋ถ€๋กœ ์˜ˆ์ธกํ•˜์—ฌ ์ƒ์„ฑํ•˜๋Š” ์ผ์ข…์˜ ์ˆœ์ฐจ์  ์˜ˆ์ธก ๋ฌธ์ œ(sequence prediction problem)๋กœ ์ ‘๊ทผํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ž์—ฐ์–ด์™€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ CodeGen์ด๋ผ๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ ๊ทธ๋ฃน์„ ํ•™์Šต์‹œ์ผฐ์œผ๋ฉฐ, ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์•ฝํ•œ ์ง€๋„(weak supervision)์™€ ๋ฐ์ดํ„ฐ ๋ฐ ๋ชจ๋ธ ๊ทœ๋ชจ์˜ ํ™•์žฅ๋งŒ์œผ๋กœ๋„ ๋ชจ๋ธ์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋Œ€ํ™” ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๊ฒŒ ๋œ๋‹ค๋Š” ์ ์„ ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋”ํ•ด์„œ ๋ชจ๋ธ์˜ ๋Œ€ํ™”ํ˜• ํ”„๋กœ๊ทธ๋žจ ํ•ฉ์„ฑ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋‹คํšŒ ๋Œ€ํ™” ๊ธฐ๋ฐ˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฒค์น˜๋งˆํฌ(MTPB)๋ฅผ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฒค์น˜๋งˆํฌ๋Š” ๊ฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ์ž์™€ ๋ชจ๋ธ ๊ฐ„ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์˜ ๋Œ€ํ™”๋ฅผ ๊ฑฐ์ณ ํ”„๋กœ๊ทธ๋žจ์ด ์ ์ง„์ ์œผ๋กœ ํ•ฉ์„ฑ๋˜๋Š” ๊ณผ์ •์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌ ๊ฒฐ๊ณผ, CodeGen ๋ชจ๋ธ์€ ๋Œ€ํ™”ํ˜• ๋Šฅ๋ ฅ์„ ์„ฑ๊ณต์ ์œผ๋กœ ๋ฐœํœ˜ํ–ˆ์œผ๋ฉฐ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋Œ€ํ™”ํ˜• ํ•ฉ์„ฑ ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ์šฐ์ˆ˜์„ฑ๊ณผ ํšจ์œจ์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ 16B ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทœ๋ชจ๋กœ TPU-v4์—์„œ ํ•™์Šต๋œ CodeGen ๋ชจ๋ธ์€ HumanEval ๋ฒค์น˜๋งˆํฌ์—์„œ OpenAI์˜ Codex๋ฅผ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต๋œ ์‚ฌ์šฉ๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ธ JaxFormer์™€ ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ๋Š” ์˜คํ”ˆ์†Œ์Šค๋กœ ๊ณต๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค: ์ด https URL์—์„œ ํ™•์ธํ•˜์„ธ์š”.

์ด ๋ชจ๋ธ์€Hiroaki Hayashi๊ฐ€ ๊ธฐ์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์˜ ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์— ์žˆ์Šต๋‹ˆ๋‹ค.

์ฒดํฌํฌ์ธํŠธ ๋ช…๋ช… ๊ทœ์น™[[Checkpoint Naming]]

  • CodeGen ๋ชจ๋ธ์˜ ์ฒดํฌํฌ์ธํŠธ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์‚ฌ์ „ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ๋กœ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.
  • ์ฒดํฌํฌ์ธํŠธ์˜ ํ˜•์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: Salesforce/codegen-{size}-{data}
    • size: 350M, 2B, 6B, 16B
    • data:
      • nl: The Pile ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ
      • multi: nl ๋ชจ๋ธ์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ
      • mono: multi ๋ชจ๋ธ์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ์ถ”๊ฐ€๋กœ Python ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํ•™์Šต๋œ ๋ชจ๋ธ
  • ์˜ˆ๋ฅผ ๋“ค์–ด, Salesforce/codegen-350M-mono๋Š” 3์–ต 5์ฒœ๋งŒ(350M) ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ชจ๋ธ๋กœ, The Pile, ๋‹ค์–‘ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด, Python ๋ฐ์ดํ„ฐ์˜ ์ˆœ์„œ๋กœ ๋‹จ๊ณ„์ ์œผ๋กœ ํ•™์Šตํ•œ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ ์˜ˆ์‹œ[[Usage example]]

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> checkpoint = "Salesforce/codegen-350M-mono"
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)

>>> text = "def hello_world():"

>>> completion = model.generate(**tokenizer(text, return_tensors="pt"))

>>> print(tokenizer.decode(completion[0]))
def hello_world():
    print("Hello World")

hello_world()

์ž๋ฃŒ[[Resources]]

CodeGenConfig

[[autodoc]] CodeGenConfig - all

CodeGenTokenizer

[[autodoc]] CodeGenTokenizer - create_token_type_ids_from_sequences - save_vocabulary

CodeGenTokenizerFast

[[autodoc]] CodeGenTokenizerFast

CodeGenModel

[[autodoc]] CodeGenModel - forward

CodeGenForCausalLM

[[autodoc]] CodeGenForCausalLM - forward