File size: 3,375 Bytes
8f62e25
 
832150e
 
 
 
 
8f62e25
 
ab0ecb2
8f62e25
832150e
 
ff2110d
124e842
8f62e25
832150e
8f62e25
ab0ecb2
8f62e25
832150e
 
 
 
8f62e25
832150e
8f62e25
124e842
8f62e25
832150e
 
124e842
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f62e25
832150e
8f62e25
ab0ecb2
8f62e25
832150e
 
8f62e25
832150e
 
8f62e25
832150e
8f62e25
832150e
 
8f62e25
832150e
 
 
 
 
 
 
 
 
8f62e25
832150e
 
8f62e25
832150e
 
8f62e25
832150e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
library_name: transformers
license: mit
datasets:
- shangrilar/ko_text2sql
base_model:
- EleutherAI/polyglot-ko-1.3b
---

# polyglot-ko-1b-txt2sql

`polyglot-ko-1b-txt2sql`์€ ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด ์งˆ๋ฌธ์„ SQL ์ฟผ๋ฆฌ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ํŒŒ์ธํŠœ๋‹๋œ ํ…์ŠคํŠธ ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.  
๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ [`EleutherAI/polyglot-ko-1.3b`](https://huggingface.co/EleutherAI/polyglot-ko-1.3b)๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, LoRA๋ฅผ ํ†ตํ•ด ๊ฒฝ๋Ÿ‰ ํŒŒ์ธํŠœ๋‹๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ธํŠœ๋‹์„ ์ฒ˜์Œ ํ•ด๋ณธ ๊ธ€์“ด์ด๊ฐ€ ์‹ค์Šต์šฉ์œผ๋กœ ๋งŒ๋“  ์ฒซ ๋ชจ๋ธ๋กœ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•  ์ˆœ ์—†์œผ๋‹ˆ ์ฐธ๊ณ ๋ฐ”๋ž๋‹ˆ๋‹ค.

---

## ๋ชจ๋ธ ์ •๋ณด

- **Base model**: EleutherAI/polyglot-ko-1.3b
- **Fine-tuning**: QLoRA (4bit quantization + PEFT)
- **Task**: Text2SQL (์ž์—ฐ์–ด โ†’ SQL ๋ณ€ํ™˜)
- **Tokenizer**: ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ € ์‚ฌ์šฉ

---

## ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹

๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด SQL ๋ณ€ํ™˜ ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋œ ์ž์—ฐ์–ด ์งˆ๋ฌธ-์ฟผ๋ฆฌ ํŽ˜์–ด๋กœ ํŒŒ์ธํŠœ๋‹๋˜์—ˆ์Šต๋‹ˆ๋‹ค.  
- [shangrilar/ko_text2sql](https://huggingface.co/datasets/shangrilar/ko_text2sql) ๋ฐ์ดํ„ฐ์…‹ ์ผ๋ถ€

- ์ „์ฒ˜๋ฆฌ: DDL-Question-SQL ๊ตฌ์กฐ๋กœ prompt ๊ตฌ์„ฑ
- ํฌ๊ธฐ: ์•ฝ 25,000๊ฑด์˜ DDL + ์ž์—ฐ์–ด ์งˆ๋ฌธ + SQL ์ •๋‹ต ์Œ

---

## ํ‰๊ฐ€ ๊ฒฐ๊ณผ 
- ํ‰๊ฐ€ ๋ฐฉ์‹: GPT-4.1-nano ๋ชจ๋ธ์—๊ฒŒ gen_sql๊ณผ gt_sql ๋น„๊ต ํ›„ ํ‰๊ฐ€ ์š”์ฒญ
- ํ‰๊ฐ€ ๊ธฐ์ค€: ๊ฒฐ๊ณผ ๋™์ผ ์—ฌ๋ถ€ ๊ธฐ๋ฐ˜ yes/no ํŒ๋‹จ (JSON response: {"resolve_yn": "yes"})
- ํ‰๊ฐ€ ๊ฒฐ๊ณผ:
  - **๋ฒ ์ด์Šค ๋ชจ๋ธ ์ •ํ™•๋„**: 68%
  - **ํŒŒ์ธํŠœ๋‹ ๋ชจ๋ธ ์ •ํ™•๋„**: 19%

---

## ๋ฌธ์ œ์ 
  - ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ์€ gen_sql์— SQL ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ณ , ์งˆ๋ฌธ์„ ๋ฐ˜๋ณตํ•˜๊ฑฐ๋‚˜ ์˜๋ฏธ ์—†๋Š” ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๋‹ค.
  - ํŒŒ์ธํŠœ๋‹ ๋ชจ๋ธ์€ SQL ํ˜•ํƒœ๋ฅผ ํ‰๋‚ด๋‚ด๊ธด ํ–ˆ์ง€๋งŒ, ์กด์žฌํ•˜์ง€ ์•Š๋Š” ์ปฌ๋Ÿผ๋ช…์ด๋‚˜ ํ…Œ์ด๋ธ”๋ช…์„ ํฌํ•จํ•˜๋Š” ๋“ฑ ๋…ผ๋ฆฌ์ ์œผ๋กœ ํ‹€๋ฆฐ ์ฟผ๋ฆฌ๋ฅผ ์ƒ์„ฑํ–ˆ๋‹ค.
    
  - ํ‰๊ฐ€ ๋ชจ๋ธ(GPT-4.1-nano)์€ ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ์ด ์ž˜๋ชป ์ƒ์„ฑํ•œ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด "resolve_yn": "yes"๋ผ๊ณ  ์ž˜๋ชป ํŒ๋‹จํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๋‹ค.
  - ์˜ˆ๋ฅผ ๋“ค์–ด, gen_sql์ด SQL ํ˜•์‹์„ ์ „ํ˜€ ๋”ฐ๋ฅด์ง€ ์•Š๋”๋ผ๋„ resolve_yn = yes๋กœ ์ž˜๋ชป ํ‰๊ฐ€๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์—ˆ๋‹ค.
  - ์ปฌ๋Ÿผ๋ช… ๋ฐ ํ…Œ์ด๋ธ”๋ช…์ด ์กด์žฌํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ ์ž˜๋ชป๋œ ์ฟผ๋ฆฌ์ž„์—๋„ resolve_yn = yes๋กœ ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ–ˆ๋‹ค.
  - ํ‰๊ฐ€์ž(GPT ๋ชจ๋ธ)๋Š” ๋ฌธ๋ฒ•์  ํƒ€๋‹น์„ฑ์ด๋‚˜ ํ…Œ์ด๋ธ” ๊ตฌ์กฐ ๋ฐ˜์˜ ์—ฌ๋ถ€๋ฅผ ์ œ๋Œ€๋กœ ํŒ๋‹จํ•˜์ง€ ๋ชปํ•˜๊ณ , ๋‹จ์ˆœ ํ…์ŠคํŠธ ์œ ์‚ฌ์„ฑ์— ๊ธฐ๋ฐ˜ํ•ด ํŒ๋ณ„ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์˜€๋‹ค.

---

## ์‚ฌ์šฉ ์˜ˆ์‹œ

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model = AutoModelForCausalLM.from_pretrained("your-username/polyglot-ko-1b-txt2sql", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("your-username/polyglot-ko-1b-txt2sql")

generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = """
๋‹น์‹ ์€ SQL ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค.

### DDL:
CREATE TABLE players (
  player_id INT PRIMARY KEY AUTO_INCREMENT,
  username VARCHAR(255) UNIQUE NOT NULL,
  email VARCHAR(255) UNIQUE NOT NULL,
  password_hash VARCHAR(255) NOT NULL,
  date_joined DATETIME NOT NULL,
  last_login DATETIME
);

### Question:
์‚ฌ์šฉ์ž ์ด๋ฆ„์— 'admin'์ด ํฌํ•จ๋œ ๊ณ„์ • ์ˆ˜๋Š”?

### SQL:
"""

outputs = generator(prompt, do_sample=False, max_new_tokens=128)
print(outputs[0]["generated_text"])