DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

ESM [[esm]]

๊ฐœ์š” [[overview]]

์ด ํŽ˜์ด์ง€๋Š” Meta AI์˜ Fundamental AI Research ํŒ€์—์„œ ์ œ๊ณตํ•˜๋Š” Transformer ๋‹จ๋ฐฑ์งˆ ์–ธ์–ด ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ฝ”๋“œ์™€ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๊ฐ€์ค‘์น˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—๋Š” ์ตœ์ฒจ๋‹จ์ธ ESMFold์™€ ESM-2, ๊ทธ๋ฆฌ๊ณ  ์ด์ „์— ๊ณต๊ฐœ๋œ ESM-1b์™€ ESM-1v๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. Transformer ๋‹จ๋ฐฑ์งˆ ์–ธ์–ด ๋ชจ๋ธ์€ Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus์˜ ๋…ผ๋ฌธ Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์˜ ์ฒซ ๋ฒˆ์งธ ๋ฒ„์ „์€ 2019๋…„์— ์ถœํŒ ์ „ ๋…ผ๋ฌธ ํ˜•ํƒœ๋กœ ๊ณต๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ESM-2๋Š” ๋‹ค์–‘ํ•œ ๊ตฌ์กฐ ์˜ˆ์ธก ์ž‘์—…์—์„œ ํ…Œ์ŠคํŠธ๋œ ๋ชจ๋“  ๋‹จ์ผ ์‹œํ€€์Šค ๋‹จ๋ฐฑ์งˆ ์–ธ์–ด ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ์›์ž ์ˆ˜์ค€์˜ ๊ตฌ์กฐ ์˜ˆ์ธก์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives์˜ ๋…ผ๋ฌธ Language models of protein sequences at the scale of evolution enable accurate structure prediction์—์„œ ๊ณต๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ ํ•จ๊ป˜ ์†Œ๊ฐœ๋œ ESMFold๋Š” ESM-2 ์Šคํ…œ์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ตœ์ฒจ๋‹จ์˜ ์ •ํ™•๋„๋กœ ๋‹จ๋ฐฑ์งˆ ์ ‘ํž˜ ๊ตฌ์กฐ๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ํ—ค๋“œ๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. AlphaFold2์™€ ๋‹ฌ๋ฆฌ, ์ด๋Š” ๋Œ€ํ˜• ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋‹จ๋ฐฑ์งˆ ์–ธ์–ด ๋ชจ๋ธ ์Šคํ…œ์˜ ํ† ํฐ ์ž„๋ฒ ๋”ฉ์— ์˜์กดํ•˜๋ฉฐ, ์ถ”๋ก  ์‹œ ๋‹ค์ค‘ ์‹œํ€€์Šค ์ •๋ ฌ(MSA) ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” ESMFold ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์™„์ „ํžˆ "๋…๋ฆฝ์ "์ด๋ฉฐ, ์˜ˆ์ธก์„ ์œ„ํ•ด ์•Œ๋ ค์ง„ ๋‹จ๋ฐฑ์งˆ ์‹œํ€€์Šค์™€ ๊ตฌ์กฐ์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค, ๊ทธ๋ฆฌ๊ณ  ๊ทธ์™€ ๊ด€๋ จ ์™ธ๋ถ€ ์ฟผ๋ฆฌ ๋„๊ตฌ๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ ๊ฒฐ๊ณผ, ํ›จ์”ฌ ๋น ๋ฆ…๋‹ˆ๋‹ค.

"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences"์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์ธ๊ณต์ง€๋Šฅ ๋ถ„์•ผ์—์„œ๋Š” ๋Œ€๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ ์šฉ๋Ÿ‰์„ ๊ฐ–์ถ˜ ๋น„์ง€๋„ ํ•™์Šต์˜ ์กฐํ•ฉ์ด ํ‘œํ˜„ ํ•™์Šต๊ณผ ํ†ต๊ณ„์  ์ƒ์„ฑ์—์„œ ์ฃผ์š”ํ•œ ๋ฐœ์ „์„ ์ด๋Œ์–ด๋ƒˆ์Šต๋‹ˆ๋‹ค. ์ƒ๋ช… ๊ณผํ•™์—์„œ๋Š” ์‹œํ€€์‹ฑ ๊ธฐ์ˆ ์˜ ์„ฑ์žฅ์ด ์˜ˆ์ƒ๋˜๋ฉฐ, ์ž์—ฐ ์‹œํ€€์Šค ๋‹ค์–‘์„ฑ์— ๋Œ€ํ•œ ์ „๋ก€ ์—†๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์˜ฌ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋ฉ๋‹ˆ๋‹ค. ์ง„ํ™”์  ๋‹จ๊ณ„์—์„œ ๋ณผ ๋•Œ, ๋‹จ๋ฐฑ์งˆ ์–ธ์–ด ๋ชจ๋ธ๋ง์€ ์ƒ๋ฌผํ•™์„ ์œ„ํ•œ ์˜ˆ์ธก ๋ฐ ์ƒ์„ฑ ์ธ๊ณต์ง€๋Šฅ์„ ํ–ฅํ•œ ๋…ผ๋ฆฌ์ ์ธ ๋‹จ๊ณ„์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ์ง„ํ™”์  ๋‹ค์–‘์„ฑ์„ ์•„์šฐ๋ฅด๋Š” 2์–ต 5์ฒœ๋งŒ ๊ฐœ์˜ ๋‹จ๋ฐฑ์งˆ ์‹œํ€€์Šค์—์„œ ์ถ”์ถœํ•œ 860์–ต ๊ฐœ์˜ ์•„๋ฏธ๋…ธ์‚ฐ์— ๋Œ€ํ•ด ์‹ฌ์ธต ์ปจํ…์ŠคํŠธ ์–ธ์–ด ๋ชจ๋ธ์„ ๋น„์ง€๋„ ํ•™์Šต์œผ๋กœ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๋ชจ๋ธ์€ ๊ทธ ํ‘œํ˜„์—์„œ ์ƒ๋ฌผํ•™์  ์†์„ฑ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด ํ‘œํ˜„์€ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต๋œ ํ‘œํ˜„ ๊ณต๊ฐ„์€ ์•„๋ฏธ๋…ธ์‚ฐ์˜ ์ƒํ™”ํ•™์  ํŠน์„ฑ ์ˆ˜์ค€์—์„œ๋ถ€ํ„ฐ ๋‹จ๋ฐฑ์งˆ์˜ ์›๊ฑฐ๋ฆฌ ์ƒ๋™์„ฑ๊นŒ์ง€ ๊ตฌ์กฐ๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๋‹ค์ค‘ ๊ทœ๋ชจ์˜ ์กฐ์ง์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ํ‘œํ˜„์—๋Š” 2์ฐจ ๋ฐ 3์ฐจ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์ธ์ฝ”๋”ฉ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์„ ํ˜• ์ „์‚ฌ์— ์˜ํ•ด ์‹๋ณ„ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ‘œํ˜„ ํ•™์Šต์€ ๋Œ์—ฐ๋ณ€์ด์— ์˜ํ•œ ํšจ๊ณผ์™€ 2์ฐจ ๊ตฌ์กฐ์˜ ์ตœ์ฒจ๋‹จ ์ง€๋„ ์˜ˆ์ธก์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ , ๋„“์€ ๋ฒ”์œ„์˜ ์ ‘์ด‰ ๋ถ€์œ„ ์˜ˆ์ธก์„ ์œ„ํ•œ ์ตœ์ฒจ๋‹จ ํŠน์ง•์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

"Language models of protein sequences at the scale of evolution enable accurate structure prediction"์˜ ์ดˆ๋ก์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์€ ์ตœ๊ทผ ๊ทœ๋ชจ๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ ๊ธด๊ธ‰ํ•œ ๊ธฐ๋Šฅ์„ ๊ฐœ๋ฐœํ•˜์—ฌ ๋‹จ์ˆœํ•œ ํŒจํ„ด ๋งค์นญ์„ ๋„˜์–ด ๋” ๋†’์€ ์ˆ˜์ค€์˜ ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์ƒ์ƒํ•œ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. ๋” ์ž‘์€ ๊ทœ๋ชจ์—์„œ ํ›ˆ๋ จ๋œ ๋‹จ๋ฐฑ์งˆ ์‹œํ€€์Šค์˜ ์–ธ์–ด ๋ชจ๋ธ์ด ์—ฐ๊ตฌ๋˜์—ˆ์ง€๋งŒ, ๊ทธ๋“ค์ด ๊ทœ๋ชจ๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ ์ƒ๋ฌผํ•™์— ๋Œ€ํ•ด ๋ฌด์—‡์„ ๋ฐฐ์šฐ๋Š”์ง€๋Š” ๊ฑฐ์˜ ์•Œ๋ ค์ ธ ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ์—์„œ ์šฐ๋ฆฌ๋Š” ํ˜„์žฌ๊นŒ์ง€ ํ‰๊ฐ€๋œ ๊ฐ€์žฅ ํฐ 150์–ต ๊ฐœ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ชจ๋ธ์ด ๊ทœ๋ชจ๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ ๋‹จ์ผ ์•„๋ฏธ๋…ธ์‚ฐ์˜ ํ•ด์ƒ๋„๋กœ ๋‹จ๋ฐฑ์งˆ์˜ 3์ฐจ์› ๊ตฌ์กฐ๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๋ฅผ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ฐœ๋ณ„ ๋‹จ๋ฐฑ์งˆ ์‹œํ€€์Šค๋กœ๋ถ€ํ„ฐ ์ง์ ‘ ๊ณ ์ •๋ฐ€ ์›์ž ์ˆ˜์ค€์˜ ์—”๋“œ-ํˆฌ-์—”๋“œ ๊ตฌ์กฐ ์˜ˆ์ธก์„ ํ•˜๊ธฐ ์œ„ํ•œ ESMFold๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ESMFold๋Š” ์–ธ์–ด ๋ชจ๋ธ์— ์ž˜ ์ดํ•ด๋˜๋Š” ๋‚ฎ์€ ํผํ”Œ๋ ‰์„œํ‹ฐ๋ฅผ ๊ฐ€์ง„ ์‹œํ€€์Šค์— ๋Œ€ํ•ด AlphaFold2์™€ RoseTTAFold์™€ ์œ ์‚ฌํ•œ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ESMFold์˜ ์ถ”๋ก ์€ AlphaFold2๋ณด๋‹ค ํ•œ ์ž๋ฆฟ์ˆ˜ ๋น ๋ฅด๋ฉฐ, ๋ฉ”ํƒ€๊ฒŒ๋†ˆ ๋‹จ๋ฐฑ์งˆ์˜ ๊ตฌ์กฐ์  ๊ณต๊ฐ„์„ ์‹ค์šฉ์ ์ธ ์‹œ๊ฐ„ ๋‚ด์— ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, Meta AI์˜ Fundamental AI Research ํŒ€์—์„œ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ESM-1b, ESM-1v, ESM-2๋Š” jasonliu์™€ Matt์— ์˜ํ•ด HuggingFace์— ๊ธฐ์—ฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ESMFold๋Š” Matt์™€ Sylvain์— ์˜ํ•ด HuggingFace์— ๊ธฐ์—ฌ๋˜์—ˆ์œผ๋ฉฐ, ์ด ๊ณผ์ •์—์„œ ๋งŽ์€ ๋„์›€์„ ์ค€ Nikita Smetanin, Roshan Rao, Tom Sercu์—๊ฒŒ ํฐ ๊ฐ์‚ฌ๋ฅผ ๋“œ๋ฆฝ๋‹ˆ๋‹ค!

์‚ฌ์šฉ ํŒ [[usage-tips]]

  • ESM ๋ชจ๋ธ์€ ๋งˆ์Šคํฌ๋“œ ์–ธ์–ด ๋ชจ๋ธ๋ง(MLM) ๋ชฉํ‘œ๋กœ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • HuggingFace์˜ ESMFold ํฌํŠธ๋Š” openfold ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ์ผ๋ถ€๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. openfold ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” Apache License 2.0์— ๋”ฐ๋ผ ๋ผ์ด์„ ์Šค๊ฐ€ ๋ถ€์—ฌ๋ฉ๋‹ˆ๋‹ค.

๋ฆฌ์†Œ์Šค [[resources]]

EsmConfig [[transformers.EsmConfig]]

[[autodoc]] EsmConfig - all

EsmTokenizer [[transformers.EsmTokenizer]]

[[autodoc]] EsmTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

EsmModel [[transformers.EsmModel]]

[[autodoc]] EsmModel - forward

EsmForMaskedLM [[transformers.EsmForMaskedLM]]

[[autodoc]] EsmForMaskedLM - forward

EsmForSequenceClassification [[transformers.EsmForSequenceClassification]]

[[autodoc]] EsmForSequenceClassification - forward

EsmForTokenClassification [[transformers.EsmForTokenClassification]]

[[autodoc]] EsmForTokenClassification - forward

EsmForProteinFolding [[transformers.EsmForProteinFolding]]

[[autodoc]] EsmForProteinFolding - forward

TFEsmModel [[transformers.TFEsmModel]]

[[autodoc]] TFEsmModel - call

TFEsmForMaskedLM [[transformers.TFEsmForMaskedLM]]

[[autodoc]] TFEsmForMaskedLM - call

TFEsmForSequenceClassification [[transformers.TFEsmForSequenceClassification]]

[[autodoc]] TFEsmForSequenceClassification - call

TFEsmForTokenClassification [[transformers.TFEsmForTokenClassification]]

[[autodoc]] TFEsmForTokenClassification - call