| --- |
| license: apache-2.0 |
| tags: [gpt2] |
| language: ko |
| --- |
| |
| # DLM-GPT2-small |
|
|
| | Model | Batch Size | Tokenizer | Vocab Size | Max Length | Parameter Size | |
| |:---: | :------: | :-----: | :------: | :----: | :------: | |
| |GPT2 | 64 | BPE | 30,000 | 1024 | 108M | |
|
|
|
|
| # DataSet |
| - AIhub - μΉλ°μ΄ν° κΈ°λ° νκ΅μ΄ λ§λμΉ λ°μ΄ν° (4.8M) |
| - KoWiki dump 230701 (1.4M) |
|
|
|
|
| # Inference Example |
|
|
| ```python |
| from transformers import AutoTokenizer, GPT2LMHeadModel |
| |
| text = "μ΄λμ΄ νλ€λ©΄?" |
| |
| tokenizer = AutoTokenizer.from_pretrained('dataslab/DLM-GPT2-small') |
| model = GPT2LMHeadModel.from_pretrained('dataslab/DLM-GPT2-small') |
| |
| inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=False) |
| |
| outputs = model.generate(inputs['input_ids'], max_length=128, |
| repetition_penalty=2.0, |
| pad_token_id=tokenizer.pad_token_id, |
| eos_token_id=tokenizer.eos_token_id, |
| bos_token_id=tokenizer.bos_token_id, |
| use_cache=True, |
| temperature = 0.5) |
| outputs = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| |
| # μΆλ ₯ κ²°κ³Ό : 'μ΄λμ΄ νλ€λ©΄ μ΄λμ νμ§ μλ κ²μ΄ μ’λ€. νμ§λ§ μ΄λ μκ°μ λ¦μΆλ κ²μ μ€νλ € 건κ°μ μ’μ§ μλ€.. νΉνλ μ₯μκ°μ μ΄λμΌλ‘ μΈν΄ νΌλ‘κ° μμ΄κ³ λ©΄μλ ₯μ΄ λ¨μ΄μ§λ©΄, νΌλ‘κ°μ΄ μ¬ν΄μ Έμ μ λ€κΈ° μ΄λ €μ΄ κ²½μ°κ° λ§λ€. μ΄λ° κ²½μ°λΌλ©΄ νμλ³΄λ€ λ λ§μ μμΌλ‘ κ³Όμμ νκ±°λ 무리ν λ€μ΄μ΄νΈλ₯Ό ν μ μλ€. λ°λΌμ μλ¨ μ‘°μ κ³Ό ν¨κ» μμ 보좩μ μ κ²½ μ¨μΌ νλ€. λν κ³Όλν μμμ΄ μ²΄μ€ κ°λμ λμμ μ£Όλ―λ‘ μ μ ν μ΄λλμ μ μ§νλ κ²λ μ€μνλ€.' |
| ``` |