polyglot-ko-txt2sql / README.md
castellina's picture
Update README.md
ff2110d verified
---
library_name: transformers
license: mit
datasets:
- shangrilar/ko_text2sql
base_model:
- EleutherAI/polyglot-ko-1.3b
---
# polyglot-ko-1b-txt2sql
`polyglot-ko-1b-txt2sql`์€ ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด ์งˆ๋ฌธ์„ SQL ์ฟผ๋ฆฌ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ํŒŒ์ธํŠœ๋‹๋œ ํ…์ŠคํŠธ ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ [`EleutherAI/polyglot-ko-1.3b`](https://huggingface.co/EleutherAI/polyglot-ko-1.3b)๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, LoRA๋ฅผ ํ†ตํ•ด ๊ฒฝ๋Ÿ‰ ํŒŒ์ธํŠœ๋‹๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
ํŒŒ์ธํŠœ๋‹์„ ์ฒ˜์Œ ํ•ด๋ณธ ๊ธ€์“ด์ด๊ฐ€ ์‹ค์Šต์šฉ์œผ๋กœ ๋งŒ๋“  ์ฒซ ๋ชจ๋ธ๋กœ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•  ์ˆœ ์—†์œผ๋‹ˆ ์ฐธ๊ณ ๋ฐ”๋ž๋‹ˆ๋‹ค.
---
## ๋ชจ๋ธ ์ •๋ณด
- **Base model**: EleutherAI/polyglot-ko-1.3b
- **Fine-tuning**: QLoRA (4bit quantization + PEFT)
- **Task**: Text2SQL (์ž์—ฐ์–ด โ†’ SQL ๋ณ€ํ™˜)
- **Tokenizer**: ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉ
---
## ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹
๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด SQL ๋ณ€ํ™˜ ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋œ ์ž์—ฐ์–ด ์งˆ๋ฌธ-์ฟผ๋ฆฌ ํŽ˜์–ด๋กœ ํŒŒ์ธํŠœ๋‹๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
- [shangrilar/ko_text2sql](https://huggingface.co/datasets/shangrilar/ko_text2sql) ๋ฐ์ดํ„ฐ์…‹ ์ผ๋ถ€
- ์ „์ฒ˜๋ฆฌ: DDL-Question-SQL ๊ตฌ์กฐ๋กœ prompt ๊ตฌ์„ฑ
- ํฌ๊ธฐ: ์•ฝ 25,000๊ฑด์˜ DDL + ์ž์—ฐ์–ด ์งˆ๋ฌธ + SQL ์ •๋‹ต ์Œ
---
## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
- ํ‰๊ฐ€ ๋ฐฉ์‹: GPT-4.1-nano ๋ชจ๋ธ์—๊ฒŒ gen_sql๊ณผ gt_sql ๋น„๊ต ํ›„ ํ‰๊ฐ€ ์š”์ฒญ
- ํ‰๊ฐ€ ๊ธฐ์ค€: ๊ฒฐ๊ณผ ๋™์ผ ์—ฌ๋ถ€ ๊ธฐ๋ฐ˜ yes/no ํŒ๋‹จ (JSON response: {"resolve_yn": "yes"})
- ํ‰๊ฐ€ ๊ฒฐ๊ณผ:
- **๋ฒ ์ด์Šค ๋ชจ๋ธ ์ •ํ™•๋„**: 68%
- **ํŒŒ์ธํŠœ๋‹ ๋ชจ๋ธ ์ •ํ™•๋„**: 19%
---
## ๋ฌธ์ œ์ 
- ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ์€ gen_sql์— SQL ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ณ , ์งˆ๋ฌธ์„ ๋ฐ˜๋ณตํ•˜๊ฑฐ๋‚˜ ์˜๋ฏธ ์—†๋Š” ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๋‹ค.
- ํŒŒ์ธํŠœ๋‹ ๋ชจ๋ธ์€ SQL ํ˜•ํƒœ๋ฅผ ํ‰๋‚ด๋‚ด๊ธด ํ–ˆ์ง€๋งŒ, ์กด์žฌํ•˜์ง€ ์•Š๋Š” ์ปฌ๋Ÿผ๋ช…์ด๋‚˜ ํ…Œ์ด๋ธ”๋ช…์„ ํฌํ•จํ•˜๋Š” ๋“ฑ ๋…ผ๋ฆฌ์ ์œผ๋กœ ํ‹€๋ฆฐ ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ–ˆ๋‹ค.
- ํ‰๊ฐ€ ๋ชจ๋ธ(GPT-4.1-nano)์€ ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ์ด ์ž˜๋ชป ์ƒ์„ฑํ•œ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด "resolve_yn": "yes"๋ผ๊ณ  ์ž˜๋ชป ํŒ๋‹จํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๋‹ค.
- ์˜ˆ๋ฅผ ๋“ค์–ด, gen_sql์ด SQL ํ˜•์‹์„ ์ „ํ˜€ ๋”ฐ๋ฅด์ง€ ์•Š๋”๋ผ๋„ resolve_yn = yes๋กœ ์ž˜๋ชป ํ‰๊ฐ€๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์—ˆ๋‹ค.
- ์ปฌ๋Ÿผ๋ช… ๋ฐ ํ…Œ์ด๋ธ”๋ช…์ด ์กด์žฌํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ ์ž˜๋ชป๋œ ์ฟผ๋ฆฌ์ž„์—๋„ resolve_yn = yes๋กœ ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ–ˆ๋‹ค.
- ํ‰๊ฐ€์ž(GPT ๋ชจ๋ธ)๋Š” ๋ฌธ๋ฒ•์  ํƒ€๋‹น์„ฑ์ด๋‚˜ ํ…Œ์ด๋ธ” ๊ตฌ์กฐ ๋ฐ˜์˜ ์—ฌ๋ถ€๋ฅผ ์ œ๋Œ€๋กœ ํŒ๋‹จํ•˜์ง€ ๋ชปํ•˜๊ณ , ๋‹จ์ˆœ ํ…์ŠคํŠธ ์œ ์‚ฌ์„ฑ์— ๊ธฐ๋ฐ˜ํ•ด ํŒ๋ณ„ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์˜€๋‹ค.
---
## ์‚ฌ์šฉ ์˜ˆ์‹œ
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model = AutoModelForCausalLM.from_pretrained("your-username/polyglot-ko-1b-txt2sql", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("your-username/polyglot-ko-1b-txt2sql")
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = """
๋‹น์‹ ์€ SQL ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค.
### DDL:
CREATE TABLE players (
player_id INT PRIMARY KEY AUTO_INCREMENT,
username VARCHAR(255) UNIQUE NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
date_joined DATETIME NOT NULL,
last_login DATETIME
);
### Question:
์‚ฌ์šฉ์ž ์ด๋ฆ„์— 'admin'์ด ํฌํ•จ๋œ ๊ณ„์ • ์ˆ˜๋Š”?
### SQL:
"""
outputs = generator(prompt, do_sample=False, max_new_tokens=128)
print(outputs[0]["generated_text"])