--- title: Transformer Demo emoji: ๐Ÿค– colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 5.29.0 python_version: "3.10" app_file: app.py pinned: false license: mit --- # Transformer โ€” ๋…ผ๋ฌธ ์žฌํ˜„ ๋ฐ๋ชจ **๋…ผ๋ฌธ**: [Attention Is All You Need](https://arxiv.org/abs/1706.03762) (Vaswani et al., NIPS 2017) > RNN๊ณผ CNN์„ ๋ชจ๋‘ ๋ฒ„๋ฆฌ๊ณ  **์˜ค์ง attention๋งŒ์œผ๋กœ** ์ธ์ฝ”๋”-๋””์ฝ”๋”๋ฅผ ๊ตฌ์„ฑํ•œ > Transformer ๋…ผ๋ฌธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ์žฌํ˜„ํ•˜๊ณ , ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ง์ ‘ ์ฒดํ—˜ํ•  ์ˆ˜ ์žˆ๋Š” Space์ž…๋‹ˆ๋‹ค. --- ## ๋ฌด์—‡์„ ํ•  ์ˆ˜ ์žˆ๋‚˜์š”? ์ˆซ์ž ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅํ•˜๋ฉด **Transformer๊ฐ€ ๋’ค์ง‘์–ด** ์ค๋‹ˆ๋‹ค. ``` ์ž…๋ ฅ : 1 2 3 4 5 ์ถœ๋ ฅ : 5 4 3 2 1 ``` ๊ทธ๋ฆฌ๊ณ  ๋” ํฅ๋ฏธ๋กœ์šด ๊ฑด โ€” ๋””์ฝ”๋”์˜ **cross-attention ๊ฐ€์ค‘์น˜**๋ฅผ ์‹œ๊ฐํ™”ํ•ด์„œ ๋ชจ๋ธ์ด "์ถœ๋ ฅ i๋ฒˆ์งธ ์œ„์น˜๋ฅผ ๋งŒ๋“ค ๋•Œ ์ž…๋ ฅ ์–ด๋””๋ฅผ ๋ดค๋Š”์ง€"๋ฅผ ์ง์ ‘ ๋ณผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฑฐ์˜ˆ์š”. ๋’ค์ง‘๊ธฐ ํƒœ์Šคํฌ์—์„œ๋Š” **๋ฐ˜๋Œ€๊ฐ์„ (anti-diagonal) ํŒจํ„ด**์ด ๋˜๋ ท์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. --- ## ์™œ ๋ฒˆ์—ญ์ด ์•„๋‹ˆ๋ผ ์ˆซ์ž ๋’ค์ง‘๊ธฐ์ธ๊ฐ€์š”? ๋…ผ๋ฌธ์€ ์˜์–ดโ†’๋…์ผ์–ด ๋ฒˆ์—ญ์œผ๋กœ ๊ฒ€์ฆํ–ˆ์ง€๋งŒ, ๊ทธ๊ฑด 8ร— P100 GPU๋กœ 12์‹œ๊ฐ„ ํ•™์Šต์ด ํ•„์š”ํ•ด์š”. ๋ฌด๋ฃŒ Space์—์„œ ๊ทธ๊ฒŒ ์•ˆ ๋˜๋‹ˆ๊นŒ, **๋ถ€ํŒ… ์‹œ 30์ดˆ ์•ˆ์— ํ•™์Šต ๋๋‚˜๋Š” toy task**๋ฅผ ๊ณจ๋ž์Šต๋‹ˆ๋‹ค. ์ˆซ์ž ๋’ค์ง‘๊ธฐ์˜ ์žฅ์ : - ์–ดํœ˜๊ฐ€ ์ž‘์Œ (0~9 + ํŠน์ˆ˜ ํ† ํฐ = 13๊ฐœ) - ์ž…์ถœ๋ ฅ ๊ธธ์ด๊ฐ€ ๊ฐ™๊ณ  ์ •๋‹ต์ด ๋ช…ํ™• - **์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ**์„ ๊ฐ•์ œ โ€” ์ถœ๋ ฅ 1๋ฒˆ์งธ๋Š” ์ž…๋ ฅ ๋งˆ์ง€๋ง‰์„ ๋ด์•ผ ํ•จ - ์‹œ๊ฐํ™”๊ฐ€ ๊ทน์  (๋ฐ˜๋Œ€๊ฐ์„  ํŒจํ„ด) --- ## ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ ``` โ”œโ”€โ”€ app.py # Gradio ๋ฐ๋ชจ (ํ•™์Šต + ์ถ”๋ก  + ์‹œ๊ฐํ™”) โ”œโ”€โ”€ transformer.py # ๋…ผ๋ฌธ์„ ๊ทธ๋Œ€๋กœ ์žฌํ˜„ํ•œ Transformer ๋ณธ์ฒด โ”œโ”€โ”€ requirements.txt # ํŒจํ‚ค์ง€ ๋ชฉ๋ก โ””โ”€โ”€ README.md # ์ด ํŒŒ์ผ ``` --- ## ๋ชจ๋ธ ๊ตฌ์„ฑ ์ด ๋ฐ๋ชจ๋Š” ๋…ผ๋ฌธ base ๋ชจ๋ธ์˜ **1/8 ํฌ๊ธฐ**์ž…๋‹ˆ๋‹ค. ๊ตฌ์กฐ๋Š” ์™„์ „ํžˆ ๋™์ผํ•˜๊ณ  ํฌ๊ธฐ๋งŒ ์ค„์˜€์–ด์š”. | ํ•ญ๋ชฉ | ๋…ผ๋ฌธ base | ์ด ๋ฐ๋ชจ | |------|-----------|---------| | d_model | 512 | **64** | | ์ธต ์ˆ˜ N | 6 | **2** | | ํ—ค๋“œ ์ˆ˜ h | 8 | **4** | | d_ff | 2048 | **128** | | ์–ดํœ˜ ํฌ๊ธฐ | 37K (BPE) | **13** | | ํŒŒ๋ผ๋ฏธํ„ฐ | 65M | **~80K** | --- ## ํ•™์Šต ์„ค์ • ```python optimizer = Adam(lr=5e-4, betas=(0.9, 0.98), eps=1e-9) # ๋…ผ๋ฌธ ยง5.3 loss = CrossEntropy(ignore_index=PAD, label_smoothing=0.1) steps = 2000 batch = 128 ``` - ๋งค step๋งˆ๋‹ค ๊ธธ์ด 3~10์˜ ๋ฌด์ž‘์œ„ ์ˆซ์ž์—ด์„ ์ƒˆ๋กœ ์ƒ์„ฑ (๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ) - Gradient clipping = 1.0 - Greedy decoding์œผ๋กœ ์ถ”๋ก  ํ•™์Šต์€ ๋ถ€ํŒ…ํ•  ๋•Œ ์ž๋™์œผ๋กœ ์ง„ํ–‰๋˜๋ฉฐ, ๋๋‚œ ๋ชจ๋ธ์€ `model.pt`๋กœ ์บ์‹ฑ๋ฉ๋‹ˆ๋‹ค. --- ## ๋…ผ๋ฌธ ํ•ต์‹ฌ ๋ถ€๋ถ„ ์ฝ”๋“œ ๋งคํ•‘ | ๋…ผ๋ฌธ ์œ„์น˜ | ์ฝ”๋“œ ์œ„์น˜ | |-----------|-----------| | ์‹ (1) `softmax(QKแต€/โˆšd_k)V` | `transformer.py :: scaled_dot_product_attention` | | ยง3.2.2 Multi-Head | `MultiHeadAttention` | | ยง3.5 Positional Encoding | `PositionalEncoding` | | ์‹ (2) FFN | `FeedForward` | | ยง3.1 ์ธ์ฝ”๋” 1์ธต | `EncoderLayer` (Post-LN) | | ยง3.1 ๋””์ฝ”๋” 1์ธต | `DecoderLayer` (Post-LN) | | ยง3.4 ์ž„๋ฒ ๋”ฉ ร— โˆšd_model | `Transformer.encode` ๋‚ด๋ถ€ | --- ## ์–ด๋–ป๊ฒŒ ๋ด์•ผ ํ•˜๋‚˜์š”? (์‹œ๊ฐํ™” ํ•ด์„) **Cross-Attention ํžˆํŠธ๋งต**: - ๊ฐ€๋กœ์ถ•: ์ธ์ฝ”๋” ์œ„์น˜ (์ž…๋ ฅ ํ† ํฐ๋“ค, ์™ผ์ชฝ์ด ์‹œํ€€์Šค ์•ž์ชฝ) - ์„ธ๋กœ์ถ•: ๋””์ฝ”๋” ์œ„์น˜ (์ถœ๋ ฅ ํ† ํฐ๋“ค, ์œ„์ชฝ์ด ๋จผ์ € ์ƒ์„ฑ) - ์ƒ‰์ด ๋ฐ์„์ˆ˜๋ก ๊ฐ•ํ•œ attention ๋’ค์ง‘๊ธฐ ํƒœ์Šคํฌ์—์„œ ์ž˜ ํ•™์Šต๋œ ๋ชจ๋ธ์€: ``` ์ถœ๋ ฅ ์œ„์น˜ 0 (BOS ๋‹ค์Œ, ์ฒซ ์ถœ๋ ฅ ํ† ํฐ) โ†’ ์ž…๋ ฅ ๋งˆ์ง€๋ง‰ ํ† ํฐ์„ ๋ด„ ์ถœ๋ ฅ ์œ„์น˜ 1 โ†’ ์ž…๋ ฅ ๋์—์„œ ๋‘ ๋ฒˆ์งธ๋ฅผ ๋ด„ ... ``` ๋”ฐ๋ผ์„œ **์™ผ์ชฝ ์œ„ โ†’ ์˜ค๋ฅธ์ชฝ ์•„๋ž˜ ๋Œ€๊ฐ์„ **์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ, ์ฆ‰ **์˜ค๋ฅธ์ชฝ ์œ„ โ†’ ์™ผ์ชฝ ์•„๋ž˜๋กœ ํ๋ฅด๋Š” anti-diagonal**์ด ๋ณด์ด๋ฉด ์„ฑ๊ณต์ž…๋‹ˆ๋‹ค. --- ## Hugging Face Spaces ๋ฐฐํฌ ์‹œ ์ฃผ์˜์‚ฌํ•ญ ResNet ๋ฐ๋ชจ๋ฅผ ๋ฐฐํฌํ•  ๋•Œ ๊ฒช์—ˆ๋˜ ๋ฌธ์ œ๋“ค์ด ์—ฌ๊ธฐ์„œ๋„ ๋™์ผํ•˜๊ฒŒ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์–ด์š”: ### 1. YAML ํ”„๋ก ํŠธ๋งคํ„ฐ ํ•„์ˆ˜ ์ด README.md ์ตœ์ƒ๋‹จ์˜ `--- ... ---` ๋ธ”๋ก์ด ์—†์œผ๋ฉด Space๊ฐ€ ๋นŒ๋“œ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ### 2. `colorFrom`/`colorTo`๋Š” ์ •ํ•ด์ง„ 8์ƒ‰๋งŒ ํ—ˆ์šฉ๋˜๋Š” ์ƒ‰: `red, yellow, green, blue, indigo, purple, pink, gray` ### 3. Python 3.13 ํšŒํ”ผ `audioop` ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ 3.13์—์„œ ์ œ๊ฑฐ๋˜์–ด ์ผ๋ถ€ ํŒจํ‚ค์ง€ ๋นŒ๋“œ ์‹คํŒจ. **3.10** ๊ถŒ์žฅ. ### 4. PyTorch CPU ๋นŒ๋“œ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฌด๋ฃŒ Space๋Š” CPU๋งŒ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. `torch` ์„ค์น˜ ์‹œ CUDA ๋ฒ„์ „์ด ๋“ค์–ด๊ฐ€๋ฉด ๋””์Šคํฌ ์šฉ๋Ÿ‰์„ ์ดˆ๊ณผํ•  ์ˆ˜ ์žˆ์œผ๋‹ˆ ํ•„์š”์‹œ `torch --index-url https://download.pytorch.org/whl/cpu`๋กœ ๋ช…์‹œํ•˜์„ธ์š”. --- ## ๋กœ์ปฌ ์‹คํ–‰ ```bash # 1) ์˜์กด์„ฑ ์„ค์น˜ pip install -r requirements.txt # 2) ๋ฐ๋ชจ ์‹คํ–‰ (์ฒซ ์‹คํ–‰ ์‹œ ์ž๋™ ํ•™์Šต) python app.py ``` ๊ธฐ๋ณธ์ ์œผ๋กœ `http://127.0.0.1:7860` ์—์„œ ์—ด๋ฆฝ๋‹ˆ๋‹ค. --- ## ํ•™์Šต์ด ์ž˜ ์•ˆ ๋˜๋ฉด ์ฒดํฌ๋ฆฌ์ŠคํŠธ: - [ ] PyTorch ๋ฒ„์ „์ด 2.0 ์ด์ƒ์ธ๊ฐ€ - [ ] ํ•™์Šต step์ด 2000๋ฒˆ ์ด์ƒ ๋„๋Š”๊ฐ€ (์ฝ˜์†”์— step 200, 400, ... ๋กœ๊ทธ ํ™•์ธ) - [ ] step 1000์ฏค ๋˜๋ฉด `token_acc`๊ฐ€ 0.95 ์ด์ƒ์ธ๊ฐ€ - [ ] ์ถœ๋ ฅ์ด ํ•ญ์ƒ ๊ฐ™์€ ํ† ํฐ๋งŒ ๋ฐ˜๋ณตํ•œ๋‹ค๋ฉด โ†’ ํ•™์Šต์ด ๊ฑฐ์˜ ์•ˆ ๋œ ๊ฒƒ. step ๋Š˜๋ฆฌ๊ฑฐ๋‚˜ lr ์กฐ์ • - [ ] cross-attention์ด ๊ท ์ผ(uniform)ํ•˜๋‹ค๋ฉด โ†’ ๋” ํ•™์Šต ํ•„์š” --- ## ์ฐธ๊ณ  ```bibtex @inproceedings{vaswani2017attention, title = {Attention Is All You Need}, author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia}, booktitle = {Advances in Neural Information Processing Systems}, year = {2017} } ``` - ๐Ÿ“„ ๋…ผ๋ฌธ: [arXiv:1706.03762](https://arxiv.org/abs/1706.03762) - ๐Ÿ“ The Annotated Transformer: - ๐ŸŽฅ The Illustrated Transformer (Jay Alammar):