lanny xu commited on
Commit
dbd527a
ยท
1 Parent(s): 0d85198

modify reranker

Browse files
vectorization_implementation_steps.py ADDED
@@ -0,0 +1,555 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ๆ–‡ๅญ—่ฝฌๅ‘้‡็š„ๅ…ทไฝ“ๅฎž็Žฐๆญฅ้ชค๏ผˆไปฃ็ ๅฑ‚้ข๏ผ‰
3
+ ๅฑ•็คบ HuggingFace Embeddings ๅ†…้ƒจ็š„ๅฎž้™…ๆ“ไฝœ
4
+ """
5
+
6
+ print("=" * 80)
7
+ print("ๆ–‡ๅญ— โ†’ ๅ‘้‡็š„ๅ…ทไฝ“ๅฎž็Žฐๆญฅ้ชค")
8
+ print("=" * 80)
9
+
10
+ # ============================================================================
11
+ # ๅ‡†ๅค‡ๅทฅไฝœ๏ผšๆจกๆ‹ŸๅฎŒๆ•ด็š„ๅ‘้‡ๅŒ–่ฟ‡็จ‹
12
+ # ============================================================================
13
+ print("\n" + "=" * 80)
14
+ print("๐Ÿ”ง ๅ‡†ๅค‡๏ผšๅฎ‰่ฃ…ๅ’Œๅฏผๅ…ฅ้œ€่ฆ็š„ๅบ“")
15
+ print("=" * 80)
16
+
17
+ print("""
18
+ ้œ€่ฆ็š„ๅบ“๏ผš
19
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
20
+ pip install transformers torch sentence-transformers
21
+
22
+ ๅฏผๅ…ฅ๏ผš
23
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
24
+ from transformers import AutoTokenizer, AutoModel
25
+ import torch
26
+ import numpy as np
27
+ """)
28
+
29
+
30
+ # ============================================================================
31
+ # Step 1: ๅŠ ่ฝฝๆจกๅž‹ๅ’Œๅˆ†่ฏๅ™จ
32
+ # ============================================================================
33
+ print("\n" + "=" * 80)
34
+ print("Step 1: ๅŠ ่ฝฝ้ข„่ฎญ็ปƒๆจกๅž‹ๅ’Œๅˆ†่ฏๅ™จ")
35
+ print("=" * 80)
36
+
37
+ print("""
38
+ ไปฃ็ ๏ผš
39
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
40
+ from transformers import AutoTokenizer, AutoModel
41
+
42
+ model_name = "sentence-transformers/all-MiniLM-L6-v2"
43
+
44
+ # 1. ๅŠ ่ฝฝๅˆ†่ฏๅ™จ๏ผˆ่ดŸ่ดฃๆ–‡ๅญ— โ†’ ID๏ผ‰
45
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
46
+
47
+ # 2. ๅŠ ่ฝฝๆจกๅž‹๏ผˆ่ดŸ่ดฃ ID โ†’ ๅ‘้‡๏ผ‰
48
+ model = AutoModel.from_pretrained(model_name)
49
+ model.eval() # ่ฎพ็ฝฎไธบ่ฏ„ไผฐๆจกๅผ๏ผˆไธ่ฎญ็ปƒ๏ผ‰
50
+
51
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
52
+
53
+ ่ฟ™ไธคไธชไธœ่ฅฟๅšไป€ไนˆ๏ผŸ
54
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
55
+
56
+ Tokenizer๏ผˆๅˆ†่ฏๅ™จ๏ผ‰๏ผš
57
+ โ”œโ”€ ่ฏๆฑ‡่กจ๏ผˆvocabulary๏ผ‰๏ผš30,000+ ไธช่ฏ
58
+ โ”‚ ไพ‹ๅฆ‚๏ผš{"hello": 1, "world": 2, "machine": 3456, ...}
59
+ โ””โ”€ ๅˆ†่ฏ่ง„ๅˆ™๏ผšๅฆ‚ไฝ•ๅˆ‡ๅˆ†ๆ–‡ๅญ—
60
+
61
+ Model๏ผˆๆจกๅž‹๏ผ‰๏ผš
62
+ โ”œโ”€ Embedding ๅฑ‚๏ผš่ฏๆฑ‡่กจ โ†’ ๅˆๅง‹ๅ‘้‡
63
+ โ”‚ 30,000 ร— 384 ็š„็Ÿฉ้˜ต๏ผˆๆฏไธช่ฏๅฏนๅบ”ไธ€ไธช 384 ็ปดๅ‘้‡๏ผ‰
64
+ โ”œโ”€ Transformer ๅฑ‚๏ผš6 ๅฑ‚ BERT encoder
65
+ โ”‚ ๆฏๅฑ‚้ƒฝๆœ‰ Self-Attention + Feed Forward
66
+ โ””โ”€ ๅ‚ๆ•ฐ้‡๏ผš22M๏ผˆ2200ไธ‡ไธชๆ•ฐๅญ—๏ผ‰
67
+ """)
68
+
69
+
70
+ # ============================================================================
71
+ # Step 2: ๅˆ†่ฏ๏ผˆTokenization๏ผ‰
72
+ # ============================================================================
73
+ print("\n" + "=" * 80)
74
+ print("Step 2: ๅˆ†่ฏ - ๆ–‡ๅญ—่ฝฌไธบ Token IDs")
75
+ print("=" * 80)
76
+
77
+ print("""
78
+ ่พ“ๅ…ฅๆ–‡ๆœฌ๏ผš
79
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
80
+ text = "Machine learning is a subset of artificial intelligence"
81
+
82
+ ไปฃ็ ๏ผš
83
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
84
+ # ๅˆ†่ฏๅนถ่ฝฌๆขไธบๆจกๅž‹่พ“ๅ…ฅๆ ผๅผ
85
+ encoded_input = tokenizer(
86
+ text,
87
+ padding=True, # ๅกซๅ……ๅˆฐ็›ธๅŒ้•ฟๅบฆ
88
+ truncation=True, # ่ถ…้•ฟๆˆชๆ–ญ
89
+ max_length=512, # ๆœ€ๅคง้•ฟๅบฆ
90
+ return_tensors='pt' # ่ฟ”ๅ›ž PyTorch tensor
91
+ )
92
+
93
+ print(encoded_input)
94
+
95
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
96
+
97
+ ่พ“ๅ‡บ๏ผˆencoded_input ๅŒ…ๅซ๏ผ‰๏ผš
98
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
99
+ {
100
+ 'input_ids': tensor([[
101
+ 101, # [CLS] ็‰นๆฎŠๆ ‡่ฎฐ
102
+ 3698, # "machine"
103
+ 4083, # "learning"
104
+ 2003, # "is"
105
+ 1037, # "a"
106
+ 2042, # "subset"
107
+ 1997, # "of"
108
+ 7976, # "artificial"
109
+ 4454, # "intelligence"
110
+ 102 # [SEP] ็‰นๆฎŠๆ ‡่ฎฐ
111
+ ]]),
112
+
113
+ 'attention_mask': tensor([[
114
+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 # ๆ‰€ๆœ‰ไฝ็ฝฎ้ƒฝๆœ‰ๆ•ˆ๏ผˆ1่กจ็คบๅ…ณๆณจ๏ผŒ0่กจ็คบๅฟฝ็•ฅ๏ผ‰
115
+ ]])
116
+ }
117
+
118
+ ่ฏฆ็ป†่งฃ้‡Š๏ผš
119
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
120
+
121
+ input_ids:
122
+ ๆฏไธชๆ•ฐๅญ—ๅฏนๅบ”ไธ€ไธช่ฏ
123
+ 101 = [CLS]๏ผˆๅฅๅญๅผ€ๅง‹ๆ ‡่ฎฐ๏ผ‰
124
+ 3698 = "machine"
125
+ 102 = [SEP]๏ผˆๅฅๅญ็ป“ๆŸๆ ‡่ฎฐ๏ผ‰
126
+
127
+ attention_mask:
128
+ ๅ‘Š่ฏ‰ๆจกๅž‹ๅ“ชไบ›ไฝ็ฝฎๆ˜ฏ็œŸๅฎžๅ†…ๅฎน๏ผˆ1๏ผ‰๏ผŒๅ“ชไบ›ๆ˜ฏๅกซๅ……๏ผˆ0๏ผ‰
129
+ ไพ‹ๅฆ‚๏ผš[1, 1, 1, 0, 0] ่กจ็คบๅ‰3ไธชๆ˜ฏ็œŸๅฎž่ฏ๏ผŒๅŽ2ไธชๆ˜ฏๅกซๅ……
130
+ """)
131
+
132
+
133
+ # ============================================================================
134
+ # Step 3: ้€š่ฟ‡ Embedding ๅฑ‚่Žทๅ–ๅˆๅง‹ๅ‘้‡
135
+ # ============================================================================
136
+ print("\n" + "=" * 80)
137
+ print("Step 3: Token IDs โ†’ ๅˆๅง‹ๅ‘้‡๏ผˆEmbedding ๅฑ‚๏ผ‰")
138
+ print("=" * 80)
139
+
140
+ print("""
141
+ ่ฟ™ไธ€ๆญฅๅ‘็”Ÿๅœจๆจกๅž‹ๅ†…้ƒจ๏ผš
142
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
143
+
144
+ input_ids = [101, 3698, 4083, 2003, ...]
145
+ โ†“
146
+ Embedding ่กจๆŸฅ่ฏข
147
+ โ†“
148
+
149
+ Embedding ่กจ๏ผˆ็ฎ€ๅŒ–๏ผ‰๏ผš
150
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
151
+ ่ฟ™ๆ˜ฏไธ€ไธชๅทจๅคง็š„็Ÿฉ้˜ต๏ผš30,522 ร— 384
152
+ ๏ผˆ30,522 ๆ˜ฏ่ฏๆฑ‡่กจๅคงๅฐ๏ผŒ384 ๆ˜ฏๅ‘้‡็ปดๅบฆ๏ผ‰
153
+
154
+ ID | ็ฌฌ1็ปด ็ฌฌ2็ปด ็ฌฌ3็ปด ... ็ฌฌ384็ปด
155
+ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
156
+ 101 | 0.12 -0.34 0.56 ... 0.78 โ† [CLS]
157
+ 3698 | 0.23 0.45 -0.67 ... 0.89 โ† "machine"
158
+ 4083 | 0.34 -0.56 0.78 ... -0.90 โ† "learning"
159
+ 2003 | 0.45 0.67 -0.89 ... 0.12 โ† "is"
160
+ ...
161
+
162
+ ๆŸฅ่ฏข่ฟ‡็จ‹๏ผˆ็ฑปไผผๅญ—ๅ…ธๆŸฅ่ฏข๏ผ‰๏ผš
163
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
164
+ ID 101 โ†’ ๆŸฅ่กจ โ†’ [0.12, -0.34, 0.56, ..., 0.78]
165
+ ID 3698 โ†’ ๆŸฅ่กจ โ†’ [0.23, 0.45, -0.67, ..., 0.89]
166
+ ID 4083 โ†’ ๆŸฅ่กจ โ†’ [0.34, -0.56, 0.78, ..., -0.90]
167
+ ...
168
+
169
+ ็ป“ๆžœ๏ผš
170
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
171
+ token_embeddings = [
172
+ [0.12, -0.34, 0.56, ..., 0.78], # [CLS]
173
+ [0.23, 0.45, -0.67, ..., 0.89], # "machine"
174
+ [0.34, -0.56, 0.78, ..., -0.90], # "learning"
175
+ [0.45, 0.67, -0.89, ..., 0.12], # "is"
176
+ ...
177
+ ]
178
+ ๅฝข็Šถ๏ผš(10, 384) # 10 ไธช tokens๏ผŒๆฏไธช 384 ็ปด
179
+
180
+ โš ๏ธ ๆณจๆ„๏ผš่ฟ™ไบ›่ฟ˜ไธๆ˜ฏๆœ€็ปˆๅ‘้‡๏ผ้œ€่ฆ้€š่ฟ‡ Transformer ๅค„็†๏ผ
181
+ """)
182
+
183
+
184
+ # ============================================================================
185
+ # Step 4: Transformer ๅค„็†๏ผˆๆ ธๅฟƒ๏ผ๏ผ‰
186
+ # ============================================================================
187
+ print("\n" + "=" * 80)
188
+ print("Step 4: Transformer ๅค„็† - Self-Attention๏ผˆๆ ธๅฟƒๆญฅ้ชค๏ผ‰")
189
+ print("=" * 80)
190
+
191
+ print("""
192
+ ไปฃ็ ๏ผš
193
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
194
+ with torch.no_grad(): # ไธ่ฎก็ฎ—ๆขฏๅบฆ๏ผˆไธ่ฎญ็ปƒ๏ผ‰
195
+ outputs = model(**encoded_input)
196
+
197
+ # outputs.last_hidden_state ๅฐฑๆ˜ฏ Transformer ็š„่พ“ๅ‡บ
198
+ token_embeddings = outputs.last_hidden_state
199
+ print(token_embeddings.shape) # torch.Size([1, 10, 384])
200
+ # ๆ‰นๆฌก tokens ็ปดๅบฆ
201
+
202
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
203
+
204
+ Transformer ๅ†…้ƒจๅšไบ†ไป€ไนˆ๏ผŸ๏ผˆ6 ๅฑ‚ๅค„็†๏ผ‰
205
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
206
+
207
+ ่พ“ๅ…ฅ๏ผšๅˆๅง‹ embeddings
208
+ [CLS]: [0.12, -0.34, 0.56, ...]
209
+ machine: [0.23, 0.45, -0.67, ...]
210
+ learning: [0.34, -0.56, 0.78, ...]
211
+ is: [0.45, 0.67, -0.89, ...]
212
+ ...
213
+
214
+ โ†“
215
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
216
+ โ”‚ Layer 1: Self-Attention โ”‚
217
+ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
218
+ โ”‚ โ”‚
219
+ โ”‚ ๆฏไธช่ฏ"็œ‹"ๅ…ถไป–ๆ‰€ๆœ‰่ฏ๏ผŒๆ›ดๆ–ฐ่‡ชๅทฑ็š„ๅ‘้‡๏ผš โ”‚
220
+ โ”‚ โ”‚
221
+ โ”‚ "machine" ็œ‹ๅˆฐ "learning" โ†’ ็†่งฃ่ฟ™ๆ˜ฏไธ€ไธช่ฏ็ป„ โ”‚
222
+ โ”‚ "learning" ็œ‹ๅˆฐ "artificial" โ†’ ็†่งฃไธŽAI็›ธๅ…ณ โ”‚
223
+ โ”‚ "is" ็œ‹ๅˆฐๅ‰ๅŽ่ฏ โ†’ ็†่งฃๆ˜ฏ่ฟžๆŽฅ่ฏ โ”‚
224
+ โ”‚ โ”‚
225
+ โ”‚ ๆ›ดๆ–ฐๅŽ็š„ๅ‘้‡ๅŒ…ๅซไบ†ไธŠไธ‹ๆ–‡ไฟกๆฏ โ”‚
226
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
227
+ โ†“
228
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
229
+ โ”‚ Layer 2: Self-Attention โ”‚
230
+ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
231
+ โ”‚ ็ปง็ปญๆทฑๅŒ–็†่งฃ... โ”‚
232
+ โ”‚ "machine learning" ไฝœไธบๆ•ดไฝ“็†่งฃ โ”‚
233
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
234
+ โ†“
235
+ ... (Layer 3, 4, 5) ...
236
+ โ†“
237
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
238
+ โ”‚ Layer 6: Self-Attention (ๆœ€ๅŽไธ€ๅฑ‚) โ”‚
239
+ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
240
+ โ”‚ ๆฏไธช่ฏ็š„ๅ‘้‡็ŽฐๅœจๅŒ…ๅซไบ†๏ผš โ”‚
241
+ โ”‚ - ่‡ชๅทฑ็š„่ฏญไน‰ โ”‚
242
+ โ”‚ - ไธŠไธ‹ๆ–‡ไฟกๆฏ โ”‚
243
+ โ”‚ - ๆ•ดไธชๅฅๅญ็š„ๅซไน‰ โ”‚
244
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
245
+ โ†“
246
+ ๆœ€็ปˆ่พ“ๅ‡บ๏ผš
247
+ [CLS]: [0.234, 0.567, -0.890, ...] # ๆ›ดๆ–ฐๅŽ๏ผŒๅŒ…ๅซๅ…จๅฅไฟกๆฏ
248
+ machine: [0.345, -0.678, 0.123, ...] # ๅŒ…ๅซ "learning" ็š„ไฟกๆฏ
249
+ learning: [0.456, 0.789, -0.234, ...] # ๅŒ…ๅซ "machine" ็š„ไฟกๆฏ
250
+ ...
251
+
252
+ ๅฝข็Šถ๏ผš(1, 10, 384)
253
+ ๆ‰นๆฌก tokens ็ปดๅบฆ
254
+ """)
255
+
256
+
257
+ # ============================================================================
258
+ # Step 5: Mean Pooling - ๅˆๅนถๆˆไธ€ไธชๅฅๅญๅ‘้‡
259
+ # ============================================================================
260
+ print("\n" + "=" * 80)
261
+ print("Step 5: Mean Pooling - ๆŠŠๅคšไธช่ฏๅ‘้‡ๅˆๅนถๆˆไธ€ไธชๅฅๅญๅ‘้‡")
262
+ print("=" * 80)
263
+
264
+ print("""
265
+ ้—ฎ้ข˜๏ผš็Žฐๅœจๆœ‰ 10 ไธช่ฏ๏ผŒๆฏไธช่ฏไธ€ไธชๅ‘้‡
266
+ ๅฆ‚ไฝ•ๅ˜ๆˆ 1 ไธชๅฅๅญๅ‘้‡๏ผŸ
267
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
268
+
269
+ ไปฃ็ ๏ผš
270
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
271
+ def mean_pooling(token_embeddings, attention_mask):
272
+ \"\"\"
273
+ ๅฏนๆ‰€ๆœ‰่ฏๅ‘้‡ๆฑ‚ๅนณๅ‡๏ผˆ่€ƒ่™‘ attention_mask๏ผ‰
274
+ \"\"\"
275
+ # token_embeddings: (1, 10, 384)
276
+ # attention_mask: (1, 10)
277
+
278
+ # ๆ‰ฉๅฑ• mask ็š„็ปดๅบฆไปฅๅŒน้… embeddings
279
+ # (1, 10) โ†’ (1, 10, 1) โ†’ (1, 10, 384)
280
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(
281
+ token_embeddings.size()
282
+ ).float()
283
+
284
+ # ๅฐ† embeddings ไธŽ mask ็›ธไน˜๏ผˆๅฟฝ็•ฅๅกซๅ……้ƒจๅˆ†๏ผ‰
285
+ # ็„ถๅŽๅฏนๆ‰€ๆœ‰่ฏๆฑ‚ๅ’Œ
286
+ sum_embeddings = torch.sum(
287
+ token_embeddings * input_mask_expanded,
288
+ dim=1 # ๅœจ token ็ปดๅบฆๆฑ‚ๅ’Œ
289
+ )
290
+
291
+ # ่ฎก็ฎ—ๆœ‰ๆ•ˆ token ็š„ๆ•ฐ้‡
292
+ sum_mask = torch.clamp(
293
+ input_mask_expanded.sum(dim=1),
294
+ min=1e-9 # ้ฟๅ…้™ค้›ถ
295
+ )
296
+
297
+ # ๆฑ‚ๅนณๅ‡
298
+ mean_embeddings = sum_embeddings / sum_mask
299
+
300
+ return mean_embeddings
301
+
302
+ # ไฝฟ็”จ
303
+ sentence_embedding = mean_pooling(
304
+ token_embeddings,
305
+ encoded_input['attention_mask']
306
+ )
307
+
308
+ print(sentence_embedding.shape) # torch.Size([1, 384])
309
+ # ๆ‰นๆฌก ็ปดๅบฆ
310
+
311
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
312
+
313
+ ๅ…ทไฝ“่ฎก็ฎ—๏ผˆ็ฎ€ๅŒ–็คบไพ‹๏ผ‰๏ผš
314
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
315
+
316
+ 10 ไธช่ฏๅ‘้‡๏ผŒๆฏไธช 384 ็ปด๏ผš
317
+ Token 1: [0.234, 0.567, -0.890, ..., 0.123]
318
+ Token 2: [0.345, -0.678, 0.123, ..., 0.234]
319
+ Token 3: [0.456, 0.789, -0.234, ..., 0.345]
320
+ ...
321
+ Token 10: [0.567, 0.890, 0.345, ..., 0.456]
322
+
323
+ ๆฑ‚ๅนณๅ‡๏ผˆๅฏนๆฏไธ€็ปดๅˆ†ๅˆซๅนณๅ‡๏ผ‰๏ผš
324
+ ็ฌฌ1็ปด: (0.234 + 0.345 + 0.456 + ... + 0.567) / 10 = 0.412
325
+ ็ฌฌ2็ปด: (0.567 - 0.678 + 0.789 + ... + 0.890) / 10 = 0.523
326
+ ็ฌฌ3็ปด: (-0.890 + 0.123 - 0.234 + ... + 0.345) / 10 = -0.089
327
+ ...
328
+ ็ฌฌ384็ปด: (0.123 + 0.234 + 0.345 + ... + 0.456) / 10 = 0.289
329
+
330
+ ๅฅๅญๅ‘้‡ = [0.412, 0.523, -0.089, ..., 0.289] (384็ปด)
331
+ """)
332
+
333
+
334
+ # ============================================================================
335
+ # Step 6: ๅฝ’ไธ€ๅŒ–๏ผˆNormalization๏ผ‰
336
+ # ============================================================================
337
+ print("\n" + "=" * 80)
338
+ print("Step 6: L2 ๅฝ’ไธ€ๅŒ– - ๅฐ†ๅ‘้‡้•ฟๅบฆ็ผฉๆ”พๅˆฐ 1")
339
+ print("=" * 80)
340
+
341
+ print("""
342
+ ไปฃ็ ๏ผš
343
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”๏ฟฝ๏ฟฝ๏ฟฝโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
344
+ import torch.nn.functional as F
345
+
346
+ # L2 ๅฝ’ไธ€ๅŒ–
347
+ sentence_embedding = F.normalize(
348
+ sentence_embedding,
349
+ p=2, # L2 ่Œƒๆ•ฐ
350
+ dim=1 # ๅœจ็‰นๅพ็ปดๅบฆๅฝ’ไธ€ๅŒ–
351
+ )
352
+
353
+ print(sentence_embedding.shape) # torch.Size([1, 384])
354
+
355
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
356
+
357
+ ๅฝ’ไธ€ๅŒ–็š„ไฝœ็”จ๏ผš
358
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
359
+
360
+ ๅฝ’ไธ€ๅŒ–ๅ‰็š„ๅ‘้‡๏ผš
361
+ v = [0.412, 0.523, -0.089, ..., 0.289]
362
+ ้•ฟๅบฆ ||v|| = โˆš(0.412ยฒ + 0.523ยฒ + ... + 0.289ยฒ) = 2.37
363
+
364
+ ๅฝ’ไธ€ๅŒ–ๅŽ็š„ๅ‘้‡๏ผš
365
+ v_norm = v / ||v||
366
+ v_norm = [0.412/2.37, 0.523/2.37, ..., 0.289/2.37]
367
+ = [0.174, 0.221, -0.038, ..., 0.122]
368
+ ้•ฟๅบฆ ||v_norm|| = 1 โœ“
369
+
370
+ ๅฅฝๅค„๏ผš
371
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
372
+ โœ… ๆ‰€ๆœ‰ๅ‘้‡้•ฟๅบฆ็›ธๅŒ๏ผˆ้ƒฝๆ˜ฏ1๏ผ‰๏ผŒๆ–นไพฟๆฏ”่พƒ
373
+ โœ… ไฝ™ๅผฆ็›ธไผผๅบฆ = ็‚น็งฏ๏ผˆ่ฎก็ฎ—ๆ›ดๅฟซ๏ผ‰
374
+ cos_sim(a, b) = aยทb / (||a|| ร— ||b||)
375
+ ๅฆ‚ๆžœๅฝ’ไธ€ๅŒ–: cos_sim(a, b) = aยทb โ† ็ฎ€ๅŒ–ไบ†๏ผ
376
+
377
+ โœ… ๆถˆ้™คๅ‘้‡้•ฟๅบฆ็š„ๅฝฑๅ“๏ผŒๅชๅ…ณๆณจๆ–นๅ‘
378
+ """)
379
+
380
+
381
+ # ============================================================================
382
+ # Step 7: ๆœ€็ปˆ่พ“ๅ‡บ
383
+ # ============================================================================
384
+ print("\n" + "=" * 80)
385
+ print("Step 7: ๅพ—ๅˆฐๆœ€็ปˆ็š„ๅฅๅญๅ‘้‡")
386
+ print("=" * 80)
387
+
388
+ print("""
389
+ ๆœ€็ปˆ็ป“ๆžœ๏ผš
390
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
391
+
392
+ # ่ฝฌๆขไธบ numpy ๆ•ฐ็ป„๏ผˆๆ–นไพฟไฝฟ็”จ๏ผ‰
393
+ final_vector = sentence_embedding.cpu().numpy()[0]
394
+
395
+ print(final_vector.shape) # (384,)
396
+ print(final_vector[:5]) # ๅ‰5ไธชๆ•ฐๅญ—
397
+ # [0.174, 0.221, -0.038, 0.095, 0.312]
398
+
399
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
400
+
401
+ ่ฟ™ๅฐฑๆ˜ฏๆœ€็ปˆ็š„ๅฅๅญๅ‘้‡๏ผ
402
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
403
+
404
+ ่พ“ๅ…ฅ: "Machine learning is a subset of artificial intelligence"
405
+ ่พ“ๅ‡บ: [0.174, 0.221, -0.038, ..., 0.122] (384 ไธชๆ•ฐๅญ—)
406
+
407
+ ่ฟ™ไธชๅ‘้‡ๅŒ…ๅซไบ†๏ผš
408
+ โœ… ๆฏไธช่ฏ็š„่ฏญไน‰
409
+ โœ… ่ฏไธŽ่ฏไน‹้—ด็š„ๅ…ณ็ณป
410
+ โœ… ๆ•ดไธชๅฅๅญ็š„ๅซไน‰
411
+
412
+ ๅฏไปฅ็”จๆฅ๏ผš
413
+ โœ… ่ฎก็ฎ—ไธŽๅ…ถไป–ๅฅๅญ็š„็›ธไผผๅบฆ
414
+ โœ… ๅญ˜ๅ…ฅๅ‘้‡ๆ•ฐๆฎๅบ“
415
+ โœ… ่ฟ›่กŒ่ฏญไน‰ๆœ็ดข
416
+ """)
417
+
418
+
419
+ # ============================================================================
420
+ # ๅฎŒๆ•ดไปฃ็ ๆฑ‡ๆ€ป
421
+ # ============================================================================
422
+ print("\n" + "=" * 80)
423
+ print("๐Ÿ“ ๅฎŒๆ•ดไปฃ็ ๆฑ‡ๆ€ป๏ผˆๅฎž้™…ๅฏ่ฟ่กŒ๏ผ‰")
424
+ print("=" * 80)
425
+
426
+ print("""
427
+ from transformers import AutoTokenizer, AutoModel
428
+ import torch
429
+ import torch.nn.functional as F
430
+
431
+ def text_to_vector(text):
432
+ \"\"\"
433
+ ๅฎŒๆ•ด็š„ๆ–‡ๅญ—่ฝฌๅ‘้‡ๆต็จ‹
434
+ \"\"\"
435
+ # Step 1: ๅŠ ่ฝฝๆจกๅž‹
436
+ model_name = "sentence-transformers/all-MiniLM-L6-v2"
437
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
438
+ model = AutoModel.from_pretrained(model_name)
439
+ model.eval()
440
+
441
+ # Step 2: ๅˆ†่ฏ
442
+ encoded_input = tokenizer(
443
+ text,
444
+ padding=True,
445
+ truncation=True,
446
+ max_length=512,
447
+ return_tensors='pt'
448
+ )
449
+
450
+ # Step 3 & 4: ้€š่ฟ‡ๆจกๅž‹๏ผˆEmbedding + Transformer๏ผ‰
451
+ with torch.no_grad():
452
+ outputs = model(**encoded_input)
453
+ token_embeddings = outputs.last_hidden_state
454
+
455
+ # Step 5: Mean Pooling
456
+ attention_mask = encoded_input['attention_mask']
457
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(
458
+ token_embeddings.size()
459
+ ).float()
460
+
461
+ sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
462
+ sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
463
+ sentence_embedding = sum_embeddings / sum_mask
464
+
465
+ # Step 6: ๅฝ’ไธ€ๅŒ–
466
+ sentence_embedding = F.normalize(sentence_embedding, p=2, dim=1)
467
+
468
+ # Step 7: ่ฝฌไธบ numpy
469
+ return sentence_embedding.cpu().numpy()[0]
470
+
471
+
472
+ # ไฝฟ็”จ็คบไพ‹๏ผš
473
+ text = "Machine learning is a subset of artificial intelligence"
474
+ vector = text_to_vector(text)
475
+
476
+ print(f"่พ“ๅ…ฅ: {text}")
477
+ print(f"ๅ‘้‡็ปดๅบฆ: {vector.shape}") # (384,)
478
+ print(f"ๅ‰10ไธชๆ•ฐๅญ—: {vector[:10]}")
479
+ print(f"ๅ‘้‡้•ฟๅบฆ: {np.linalg.norm(vector)}") # ๅบ”่ฏฅๆ˜ฏ 1.0
480
+
481
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
482
+
483
+ ไฝ ็š„้กน็›ฎไธญ็š„็ฎ€ๅŒ–่ฐƒ็”จ๏ผš
484
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”๏ฟฝ๏ฟฝโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
485
+
486
+ from langchain_community.embeddings import HuggingFaceEmbeddings
487
+
488
+ embeddings = HuggingFaceEmbeddings(
489
+ model_name="sentence-transformers/all-MiniLM-L6-v2"
490
+ )
491
+
492
+ vector = embeddings.embed_query(text)
493
+ # โ†‘ ่ฟ™ไธ€่กŒๅ†…้ƒจๆ‰ง่กŒไบ†ไธŠ้ขๆ‰€ๆœ‰ 7 ไธชๆญฅ้ชค๏ผ
494
+
495
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
496
+ """)
497
+
498
+
499
+ # ============================================================================
500
+ # ๅ…ณ้”ฎๆญฅ้ชคๆ—ถ้—ดๅˆ†ๆž
501
+ # ============================================================================
502
+ print("\n" + "=" * 80)
503
+ print("โฑ๏ธ ๅ„ๆญฅ้ชค่€—ๆ—ถๅˆ†ๆž")
504
+ print("=" * 80)
505
+
506
+ print("""
507
+ ๅ‡่ฎพๅค„็†ไธ€ไธชๅฅๅญ๏ผˆ10ไธช่ฏ๏ผ‰๏ผš
508
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
509
+
510
+ Step 1: ๅŠ ่ฝฝๆจกๅž‹ 0.5-2็ง’ (ๅช้œ€ไธ€ๆฌก๏ผŒๅฏๅค็”จ)
511
+ Step 2: ๅˆ†่ฏ <1ๆฏซ็ง’ (้žๅธธๅฟซ)
512
+ Step 3: Embedding ๆŸฅ่กจ <1ๆฏซ็ง’ (็Ÿฉ้˜ต็ดขๅผ•)
513
+ Step 4: Transformer ๅค„็† 10-50ๆฏซ็ง’ (6ๅฑ‚่ฎก็ฎ—๏ผŒๆœ€ๆ…ข)
514
+ Step 5: Mean Pooling <1ๆฏซ็ง’ (็ฎ€ๅ•ๅนณๅ‡)
515
+ Step 6: ๅฝ’ไธ€ๅŒ– <1ๆฏซ็ง’ (็ฎ€ๅ•้™คๆณ•)
516
+ Step 7: ่ฝฌๆขๆ ผๅผ <1ๆฏซ็ง’
517
+
518
+ ๆ€ป่€—ๆ—ถ: 10-50ๆฏซ็ง’ (GPU) ๆˆ– 50-200ๆฏซ็ง’ (CPU)
519
+
520
+ ๆ‰น้‡ๅค„็†๏ผˆ20ไธชๅฅๅญ๏ผ‰:
521
+ ๅ•ไธชๅค„็†: 20 ร— 50ms = 1000ms
522
+ ๆ‰น้‡ๅค„็†: 100ms โ† ๅฟซ10ๅ€๏ผ(GPUๅนถ่กŒ)
523
+
524
+ ่ฟ™ๅฐฑๆ˜ฏไธบไป€ไนˆ่ฆๆ‰น้‡ๅ‘้‡ๅŒ–๏ผ
525
+ """)
526
+
527
+
528
+ print("\n" + "=" * 80)
529
+ print("โœ… ๆ–‡ๅญ—่ฝฌๅ‘้‡็š„ๅฎž็Žฐๆญฅ้ชค่ฎฒ่งฃๅฎŒๆฏ•๏ผ")
530
+ print("=" * 80)
531
+ print("""
532
+ ๆ ธๅฟƒๆญฅ้ชคๅ›ž้กพ๏ผš
533
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
534
+
535
+ ๆ–‡ๅญ—
536
+ โ†“ Step 1: ๅŠ ่ฝฝๆจกๅž‹
537
+ Tokenizer + Model
538
+ โ†“ Step 2: ๅˆ†่ฏ
539
+ Token IDs: [101, 3698, 4083, ...]
540
+ โ†“ Step 3: Embedding ๆŸฅ่กจ
541
+ ๅˆๅง‹ๅ‘้‡: [(10, 384)]
542
+ โ†“ Step 4: Transformer ๅค„็†
543
+ ๆ›ดๆ–ฐๅ‘้‡: [(10, 384)] ๅŒ…ๅซไธŠไธ‹ๆ–‡ไฟกๆฏ
544
+ โ†“ Step 5: Mean Pooling
545
+ ๅฅๅญๅ‘้‡: [(1, 384)]
546
+ โ†“ Step 6: ๅฝ’ไธ€ๅŒ–
547
+ ๅฝ’ไธ€ๅŒ–ๅ‘้‡: [(1, 384)] ้•ฟๅบฆ=1
548
+ โ†“ Step 7: ่พ“ๅ‡บ
549
+ ๆœ€็ปˆๅ‘้‡: [0.174, 0.221, ..., 0.122]
550
+
551
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
552
+
553
+ ็Žฐๅœจไฝ ็Ÿฅ้“ไบ†ๆฏไธ€ๆญฅ็š„ๅ…ทไฝ“ๆ“ไฝœ๏ผ
554
+ """)
555
+ print()
vectorization_process_explained.py ADDED
@@ -0,0 +1,528 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ๅ‘้‡ๅŒ–ๅ’Œ Chroma ๅญ˜ๅ‚จ่ฟ‡็จ‹่ฏฆ่งฃ
3
+ ไปŽๅˆ‡ๅ‰ฒๅŽ็š„ๆ–‡ๆกฃๅˆฐๅ‘้‡ๆ•ฐๆฎๅบ“็š„ๅฎŒๆ•ดๆต็จ‹
4
+ """
5
+
6
+ print("=" * 80)
7
+ print("ๅ‘้‡ๅŒ–ๅ’Œ Chroma ๅญ˜ๅ‚จ่ฟ‡็จ‹่ฏฆ่งฃ")
8
+ print("=" * 80)
9
+
10
+ # ============================================================================
11
+ # Part 1: ๅฎŒๆ•ดๆต็จ‹ๆฆ‚่งˆ
12
+ # ============================================================================
13
+ print("\n" + "=" * 80)
14
+ print("๐Ÿ“Š Part 1: ๅฎŒๆ•ดๆต็จ‹ๆฆ‚่งˆ")
15
+ print("=" * 80)
16
+
17
+ print("""
18
+ ไปŽๆ–‡ๆกฃๅˆ‡ๅ‰ฒๅˆฐๅ‘้‡ๆ•ฐๆฎๅบ“็š„ๅฎŒๆ•ดๆต็จ‹๏ผš
19
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
20
+
21
+ Step 1: ๆ–‡ๆกฃๅˆ‡ๅ‰ฒ
22
+ ๅŽŸๅง‹ๆ–‡ๆกฃ โ†’ RecursiveCharacterTextSplitter โ†’ 20 ไธช chunks
23
+ (5000 tokens) (ๆฏไธช 250 tokens)
24
+
25
+ Step 2: ๅ‘้‡ๅŒ– (Embedding)
26
+ ๆฏไธช chunk โ†’ HuggingFace ๆจกๅž‹ โ†’ ๅ‘้‡ (384็ปด)
27
+ "ไบบๅทฅๆ™บ่ƒฝๆ˜ฏ..." โ†’ [0.12, -0.34, 0.56, ...]
28
+
29
+ Step 3: ๅญ˜ๅ…ฅ Chroma
30
+ ๅ‘้‡ + ๅŽŸๆ–‡ + ๅ…ƒๆ•ฐๆฎ โ†’ Chroma ๆ•ฐๆฎๅบ“
31
+ โ””โ”€ ๆŒไน…ๅŒ–ๅญ˜ๅ‚จ
32
+
33
+ Step 4: ๆž„ๅปบ็ดขๅผ•
34
+ Chroma โ†’ HNSW ็ดขๅผ• โ†’ ๅฟซ้€Ÿ่ฟ‘ไผผๆฃ€็ดข
35
+ (ๅฑ‚ๆฌกๅŒ–ๅ›พ็ป“ๆž„)
36
+
37
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
38
+ """)
39
+
40
+
41
+ # ============================================================================
42
+ # Part 2: Embedding ๆจกๅž‹่ฏฆ่งฃ
43
+ # ============================================================================
44
+ print("\n" + "=" * 80)
45
+ print("๐Ÿค– Part 2: Embedding ๆจกๅž‹ - HuggingFaceEmbeddings")
46
+ print("=" * 80)
47
+
48
+ print("""
49
+ ไฝ ็š„้กน็›ฎ้…็ฝฎ๏ผš
50
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
51
+
52
+ self.embeddings = HuggingFaceEmbeddings(
53
+ model_name="sentence-transformers/all-MiniLM-L6-v2",
54
+ model_kwargs={'device': device}, # CPU ๆˆ– GPU
55
+ encode_kwargs={'normalize_embeddings': True} # ๅฝ’ไธ€ๅŒ–
56
+ )
57
+
58
+ ๆจกๅž‹่ฏดๆ˜Ž๏ผš
59
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
60
+
61
+ ๆจกๅž‹ๅ็งฐ: all-MiniLM-L6-v2
62
+ โ”œโ”€ ็ฑปๅž‹: Sentence-BERT (ๅŒ็ผ–็ ๅ™จ)
63
+ โ”œโ”€ ๅ‚ๆ•ฐ้‡: 22M (่ฝป้‡็บง)
64
+ โ”œโ”€ ่พ“ๅ‡บ็ปดๅบฆ: 384 ็ปดๅ‘้‡
65
+ โ”œโ”€ ่ฎญ็ปƒๆ•ฐๆฎ: 10ไบฟ+ ๅฅๅญๅฏน
66
+ โ””โ”€ ็‰น็‚น: ๅฟซ้€Ÿใ€ๅ‡†็กฎใ€้€‚ๅˆ่ฏญไน‰ๆฃ€็ดข
67
+
68
+ ๅทฅไฝœๅŽŸ็†๏ผš
69
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
70
+
71
+ ่พ“ๅ…ฅๆ–‡ๆœฌ: "ไบบๅทฅๆ™บ่ƒฝๆ˜ฏ่ฎก็ฎ—ๆœบ็ง‘ๅญฆ็š„ไธ€ไธชๅˆ†ๆ”ฏ"
72
+ โ†“
73
+ Tokenization (ๅˆ†่ฏ)
74
+ โ†“
75
+ Token IDs: [101, 782, 1435, 1819, 2510, 3221, ...]
76
+ โ†“
77
+ BERT Encoder (6 ๅฑ‚ Transformer)
78
+ โ†“
79
+ [CLS] Token ็š„ๅ‘้‡่กจ็คบ
80
+ โ†“
81
+ 384 ็ปดๅ‘้‡: [0.123, -0.456, 0.789, ...]
82
+ โ†“
83
+ L2 ๅฝ’ไธ€ๅŒ– (normalize_embeddings=True)
84
+ โ†“
85
+ ๆœ€็ปˆๅ‘้‡: ||v|| = 1 (ๅ•ไฝๅ‘้‡)
86
+ """)
87
+
88
+
89
+ # ============================================================================
90
+ # Part 3: ๅ‘้‡ๅŒ–่ฟ‡็จ‹ๅˆ†ๆญฅ่งฃๆž
91
+ # ============================================================================
92
+ print("\n" + "=" * 80)
93
+ print("๐Ÿ” Part 3: ๅ‘้‡ๅŒ–่ฟ‡็จ‹ - ้€ๆญฅ่งฃๆž")
94
+ print("=" * 80)
95
+
96
+ print("""
97
+ ๅ‡่ฎพๆˆ‘ไปฌๆœ‰ 3 ไธช chunks๏ผš
98
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
99
+
100
+ Chunk 1: "ไบบๅทฅๆ™บ่ƒฝๆ˜ฏ่ฎก็ฎ—ๆœบ็ง‘ๅญฆ็š„ไธ€ไธชๅˆ†ๆ”ฏใ€‚ๅฎƒ่‡ดๅŠ›ไบŽ..."
101
+ Chunk 2: "ๆœบๅ™จๅญฆไน ๆ˜ฏไบบๅทฅๆ™บ่ƒฝ็š„ๅญ้ข†ๅŸŸใ€‚ๅฎƒไฝฟ่ฎก็ฎ—ๆœบ..."
102
+ Chunk 3: "ๆทฑๅบฆๅญฆไน ไฝฟ็”จๅคšๅฑ‚็ฅž็ป็ฝ‘็ปœๆฅๅค„็†ๅคๆ‚็š„..."
103
+
104
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
105
+
106
+ ๅ‘้‡ๅŒ–่ฟ‡็จ‹๏ผˆๆ‰น้‡ๅค„็†๏ผ‰๏ผš
107
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
108
+
109
+ embeddings.embed_documents([chunk1, chunk2, chunk3])
110
+ โ†“
111
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
112
+ โ”‚ HuggingFace Embedding ๆจกๅž‹ โ”‚
113
+ โ”‚ (sentence-transformers/all-MiniLM-L6-v2) โ”‚
114
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
115
+ โ†“
116
+ ๅ†…้ƒจๅค„็†๏ผˆๆฏไธช chunk๏ผ‰๏ผš
117
+ โ†“
118
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
119
+ โ”‚ Step 1: Tokenization โ”‚
120
+ โ”‚ "ไบบๅทฅๆ™บ่ƒฝ..." โ†’ [101, 782, 1435, ...] โ”‚
121
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
122
+ โ†“
123
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
124
+ โ”‚ Step 2: ่ฝฌๆขไธบ Token Embeddings โ”‚
125
+ โ”‚ Token IDs โ†’ ๅˆๅง‹ๅ‘้‡่กจ็คบ โ”‚
126
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
127
+ โ†“
128
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
129
+ โ”‚ Step 3: BERT Encoder (6 ๅฑ‚) โ”‚
130
+ โ”‚ Self-Attention + Feed Forward โ”‚
131
+ โ”‚ ๆฏๅฑ‚ๆๅ–ๆ›ดๆทฑๅฑ‚็š„่ฏญไน‰ โ”‚
132
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
133
+ โ†“
134
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
135
+ โ”‚ Step 4: Mean Pooling โ”‚
136
+ โ”‚ ๆ‰€ๆœ‰ token ๅ‘้‡็š„ๅนณๅ‡ โ†’ ๅฅๅญๅ‘้‡ โ”‚
137
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
138
+ โ†“
139
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
140
+ โ”‚ Step 5: L2 Normalization โ”‚
141
+ โ”‚ ๅ‘้‡ๅฝ’ไธ€ๅŒ–ๅˆฐๅ•ไฝ้•ฟๅบฆ โ”‚
142
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
143
+ โ†“
144
+ ่พ“ๅ‡บ๏ผš3 ไธชๅ‘้‡
145
+ โ†“
146
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
147
+ โ”‚ Vector 1: [0.123, -0.456, 0.789, ..., 0.321] (384็ปด) โ”‚
148
+ โ”‚ Vector 2: [0.234, 0.567, -0.890, ..., 0.432] (384็ปด) โ”‚
149
+ โ”‚ Vector 3: [-0.345, 0.678, 0.901, ..., -0.543] (384็ปด) โ”‚
150
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
151
+
152
+ ๅ…ณ้”ฎ็‚น๏ผš
153
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
154
+ โœ… ๆฏไธช chunk โ†’ 1 ไธชๅ›บๅฎš็ปดๅบฆ็š„ๅ‘้‡ (384็ปด)
155
+ โœ… ่ฏญไน‰็›ธไผผ็š„ๆ–‡ๆœฌ โ†’ ๅ‘้‡่ท็ฆป่ฟ‘
156
+ โœ… ๅฝ’ไธ€ๅŒ–ๅŽๅฏ็”จไฝ™ๅผฆ็›ธไผผๅบฆๅฟซ้€Ÿๆฏ”่พƒ
157
+ """)
158
+
159
+
160
+ # ============================================================================
161
+ # Part 4: Chroma ๆ•ฐๆฎๅบ“ๅญ˜ๅ‚จ็ป“ๆž„
162
+ # ============================================================================
163
+ print("\n" + "=" * 80)
164
+ print("๐Ÿ’พ Part 4: Chroma ๆ•ฐๆฎๅบ“ๅญ˜ๅ‚จ็ป“ๆž„")
165
+ print("=" * 80)
166
+
167
+ print("""
168
+ Chroma.from_documents() ๆ‰ง่กŒ็š„ๆ“ไฝœ๏ผš
169
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
170
+
171
+ Chroma.from_documents(
172
+ documents=doc_splits, # 20 ไธช chunks
173
+ collection_name="rag-chroma", # ้›†ๅˆๅ็งฐ
174
+ embedding=self.embeddings # Embedding ๅ‡ฝๆ•ฐ
175
+ )
176
+
177
+ ๅ†…้ƒจๆต็จ‹๏ผš
178
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
179
+
180
+ Step 1: ๅˆ›ๅปบ/ๆ‰“ๅผ€้›†ๅˆ
181
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
182
+ โ”‚ Collection: "rag-chroma" โ”‚
183
+ โ”‚ ๅ…ƒๆ•ฐๆฎ: embedding_dimension=384 โ”‚
184
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
185
+
186
+ Step 2: ๆ‰น้‡ๅ‘้‡ๅŒ–
187
+ for chunk in doc_splits:
188
+ vector = embeddings.embed_documents([chunk.page_content])
189
+ โ†“
190
+
191
+ Step 3: ๅญ˜ๅ‚จๆ•ฐๆฎ๏ผˆๆฏไธช chunk๏ผ‰
192
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
193
+ โ”‚ ID: "chunk_1" โ”‚
194
+ โ”‚ โ”œโ”€ Vector: [0.123, -0.456, ..., 0.321] (384็ปด) โ”‚
195
+ โ”‚ โ”œโ”€ Document: "ไบบๅทฅๆ™บ่ƒฝๆ˜ฏ่ฎก็ฎ—ๆœบ็ง‘ๅญฆ็š„ไธ€ไธชๅˆ†ๆ”ฏ..." โ”‚
196
+ โ”‚ โ””โ”€ Metadata: { โ”‚
197
+ โ”‚ "source": "https://...", โ”‚
198
+ โ”‚ "chunk_index": 0, โ”‚
199
+ โ”‚ "total_chunks": 20 โ”‚
200
+ โ”‚ } โ”‚
201
+ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
202
+ โ”‚ ID: "chunk_2" โ”‚
203
+ โ”‚ โ”œโ”€ Vector: [0.234, 0.567, ..., 0.432] โ”‚
204
+ โ”‚ โ”œโ”€ Document: "ๆœบๅ™จๅญฆไน ๆ˜ฏไบบๅทฅๆ™บ่ƒฝ็š„ๅญ้ข†ๅŸŸ..." โ”‚
205
+ โ”‚ โ””โ”€ Metadata: {...} โ”‚
206
+ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
207
+ โ”‚ ID: "chunk_3" โ”‚
208
+ โ”‚ โ”œโ”€ Vector: [-0.345, 0.678, ..., -0.543] โ”‚
209
+ โ”‚ โ”œโ”€ Document: "ๆทฑๅบฆๅญฆไน ไฝฟ็”จๅคšๅฑ‚็ฅž็ป็ฝ‘็ปœ..." โ”‚
210
+ โ”‚ โ””โ”€ Metadata: {...} โ”‚
211
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
212
+
213
+ Step 4: ๆž„ๅปบ HNSW ็ดขๅผ•
214
+ ๅ‘้‡ โ†’ HNSW ๅ›พ็ป“ๆž„ โ†’ ๅฟซ้€Ÿๆฃ€็ดข
215
+ (ๅฑ‚ๆฌกๅŒ–ๅฏผ่ˆชๅฐไธ–็•Œๅ›พ)
216
+
217
+ ๅญ˜ๅ‚จไฝ็ฝฎ๏ผš
218
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
219
+ ้ป˜่ฎค่ทฏๅพ„: ./chroma/ (ๆœฌๅœฐ็›ฎๅฝ•)
220
+ โ”œโ”€ collections/
221
+ โ”‚ โ””โ”€ rag-chroma/
222
+ โ”‚ โ”œโ”€ data.parquet # ๅ‘้‡ๆ•ฐๆฎ
223
+ โ”‚ โ”œโ”€ metadata.json # ๅ…ƒๆ•ฐๆฎ
224
+ โ”‚ โ””โ”€ index.bin # HNSW ็ดขๅผ•
225
+ โ””โ”€ chroma.sqlite3 # SQLite ๆ•ฐๆฎๅบ“
226
+ """)
227
+
228
+
229
+ # ============================================================================
230
+ # Part 5: HNSW ็ดขๅผ•ๅทฅไฝœๅŽŸ็†
231
+ # ============================================================================
232
+ print("\n" + "=" * 80)
233
+ print("๐Ÿ”— Part 5: HNSW ็ดขๅผ• - ๅฟซ้€Ÿๆฃ€็ดข็š„็ง˜ๅฏ†")
234
+ print("=" * 80)
235
+
236
+ print("""
237
+ HNSW = Hierarchical Navigable Small World
238
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
239
+
240
+ ไธบไป€ไนˆ้œ€่ฆ็ดขๅผ•๏ผŸ
241
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
242
+ ๆšดๅŠ›ๆœ็ดข: O(n) - ่ฎก็ฎ—ๆŸฅ่ฏขๅ‘้‡ไธŽๆ‰€ๆœ‰ๅ‘้‡็š„่ท็ฆป
243
+ โ””โ”€ 10000 ไธชๅ‘้‡ โ†’ ้œ€่ฆ่ฎก็ฎ— 10000 ๆฌก่ท็ฆป
244
+ โ””โ”€ ๅคชๆ…ข๏ผ
245
+
246
+ HNSW ็ดขๅผ•: O(log n) - ๅฑ‚ๆฌกๅŒ–ๅ›พ็ป“ๆž„ๅฏผ่ˆช
247
+ โ””โ”€ 10000 ไธชๅ‘้‡ โ†’ ๅช้œ€ๆฃ€ๆŸฅ็บฆ 20-30 ไธช่Š‚็‚น
248
+ โ””โ”€ ๅฟซ 100+ ๅ€๏ผ
249
+
250
+ HNSW ็ป“ๆž„๏ผˆ็ฎ€ๅŒ–็คบไพ‹๏ผ‰๏ผš
251
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
252
+
253
+ Layer 2 (ๆœ€็จ€็–)
254
+ Vโ‚ โ†โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Vโ‚… โ†โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Vโ‚โ‚‚
255
+ โ†“ โ†“ โ†“
256
+
257
+ Layer 1
258
+ Vโ‚ โ†โ†’ Vโ‚ƒ โ†โ†’ Vโ‚… โ†โ†’ Vโ‚ˆ โ†โ†’ Vโ‚โ‚‚
259
+ โ†“ โ†“ โ†“ โ†“ โ†“
260
+
261
+ Layer 0 (ๆœ€ๅฏ†้›†)
262
+ Vโ‚ โ† Vโ‚‚ โ† Vโ‚ƒ โ† Vโ‚„ โ† Vโ‚… โ† Vโ‚† โ† ... โ† Vโ‚โ‚‚
263
+ ๆ‰€ๆœ‰ๅ‘้‡้ƒฝๅœจ่ฟ™ไธ€ๅฑ‚
264
+
265
+ ๆฃ€็ดข่ฟ‡็จ‹๏ผš
266
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
267
+
268
+ ๆŸฅ่ฏขๅ‘้‡: Q = [0.2, -0.3, 0.5, ...]
269
+
270
+ Step 1: ไปŽ Layer 2 ๅผ€ๅง‹๏ผˆ็ฒ—็•ฅๆœ็ดข๏ผ‰
271
+ ๅ…ฅๅฃ็‚น: Vโ‚
272
+ โ†’ ่ฎก็ฎ— dist(Q, Vโ‚), dist(Q, Vโ‚…), dist(Q, Vโ‚โ‚‚)
273
+ โ†’ Vโ‚… ๆœ€่ฟ‘ โ†’ ่ทณๅˆฐ Vโ‚…
274
+
275
+ Step 2: ไธ‹้™ๅˆฐ Layer 1๏ผˆไธญ็ญ‰็ฒพๅบฆ๏ผ‰
276
+ ไปŽ Vโ‚… ๅผ€ๅง‹
277
+ โ†’ ๆฃ€ๆŸฅ้‚ปๅฑ… Vโ‚ƒ, Vโ‚ˆ
278
+ โ†’ Vโ‚ˆ ๆœ€่ฟ‘ โ†’ ่ทณๅˆฐ Vโ‚ˆ
279
+
280
+ Step 3: ไธ‹้™ๅˆฐ Layer 0๏ผˆ้ซ˜็ฒพๅบฆ๏ผ‰
281
+ ไปŽ Vโ‚ˆ ๅผ€ๅง‹
282
+ โ†’ ๆฃ€ๆŸฅๆ‰€ๆœ‰้‚ปๅฑ…
283
+ โ†’ ๆ‰พๅˆฐๆœ€่ฟ‘็š„ K ไธชๅ‘้‡
284
+
285
+ ่ฟ”ๅ›ž็ป“ๆžœ: Top K ๆœ€็›ธไผผ็š„ chunks
286
+
287
+ ้€Ÿๅบฆๅฏนๆฏ”๏ผš
288
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
289
+ ๆšดๅŠ›ๆœ็ดข: 10000 ๆฌก่ท็ฆป่ฎก็ฎ— โ†’ 100ms
290
+ HNSW ็ดขๅผ•: 20-30 ๆฌก่ท็ฆป่ฎก็ฎ— โ†’ 1ms โ† ๅฟซ 100 ๅ€๏ผ
291
+ """)
292
+
293
+
294
+ # ============================================================================
295
+ # Part 6: ๆฃ€็ดข่ฟ‡็จ‹่ฏฆ่งฃ
296
+ # ============================================================================
297
+ print("\n" + "=" * 80)
298
+ print("๐Ÿ” Part 6: ๆฃ€็ดข่ฟ‡็จ‹ - ไปŽๆŸฅ่ฏขๅˆฐ็ป“ๆžœ")
299
+ print("=" * 80)
300
+
301
+ print("""
302
+ ็”จๆˆทๆŸฅ่ฏข: "ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ"
303
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
304
+
305
+ Step 1: ๆŸฅ่ฏขๅ‘้‡ๅŒ–
306
+ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
307
+ "ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ"
308
+ โ†“
309
+ embeddings.embed_query("ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ")
310
+ โ†“
311
+ Query Vector: [0.345, -0.678, 0.234, ...] (384็ปด)
312
+
313
+
314
+ Step 2: HNSW ่ฟ‘ไผผๆœ็ดข
315
+ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
316
+ vectorstore.similarity_search(
317
+ query="ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ",
318
+ k=20 # ่ฟ”ๅ›ž Top 20
319
+ )
320
+ โ†“
321
+ Chroma ๅ†…้ƒจ:
322
+ 1. ๆŸฅ่ฏขๅ‘้‡ๅŒ–
323
+ 2. HNSW ๅ›พๅฏผ่ˆช
324
+ 3. ่ฎก็ฎ—ไฝ™ๅผฆ็›ธไผผๅบฆ
325
+ โ†“
326
+ ่ฟ”ๅ›ž Top 20 chunks:
327
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
328
+ โ”‚ Chunk ID โ”‚ Score โ”‚ Content โ”‚
329
+ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
330
+ โ”‚ chunk_5 โ”‚ 0.92 โ”‚ "ๆœบๅ™จๅญฆไน ๆ˜ฏไบบๅทฅๆ™บ่ƒฝ็š„..." โ”‚
331
+ โ”‚ chunk_2 โ”‚ 0.88 โ”‚ "ไบบๅทฅๆ™บ่ƒฝๅŒ…ๆ‹ฌๆœบๅ™จๅญฆไน ..." โ”‚
332
+ โ”‚ chunk_11 โ”‚ 0.85 โ”‚ "็›‘็ฃๅญฆไน ๆ˜ฏๆœบๅ™จๅญฆไน ..." โ”‚
333
+ โ”‚ ... โ”‚ ... โ”‚ ... โ”‚
334
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
335
+
336
+
337
+ Step 3: CrossEncoder ้‡ๆŽ’๏ผˆไฝ ็š„้กน็›ฎ็‰น่‰ฒ๏ผ‰
338
+ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
339
+ reranker.rerank(query, top_20_chunks, top_k=5)
340
+ โ†“
341
+ ๆฏไธช chunk ้‡ๆ–ฐๆ‰“ๅˆ†๏ผˆๆทฑๅบฆไบคไบ’๏ผ‰
342
+ โ†“
343
+ ๆœ€็ปˆ Top 5:
344
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
345
+ โ”‚ Chunk ID โ”‚ Score โ”‚ Content โ”‚
346
+ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
347
+ โ”‚ chunk_5 โ”‚ 8.45 โ”‚ "ๆœบๅ™จๅญฆไน ๆ˜ฏไบบๅทฅๆ™บ่ƒฝ็š„..." โ”‚
348
+ โ”‚ chunk_11 โ”‚ 7.89 โ”‚ "็›‘็ฃๅญฆไน ๆ˜ฏๆœบๅ™จๅญฆไน ..." โ”‚
349
+ โ”‚ chunk_2 โ”‚ 7.23 โ”‚ "ไบบๅทฅๆ™บ่ƒฝๅŒ…ๆ‹ฌๆœบๅ™จๅญฆไน ..." โ”‚
350
+ โ”‚ chunk_14 โ”‚ 6.78 โ”‚ "ๆทฑๅบฆๅญฆไน ๆ˜ฏๆœบๅ™จๅญฆไน ..." โ”‚
351
+ โ”‚ chunk_8 โ”‚ 6.12 โ”‚ "ๅผบๅŒ–ๅญฆไน ๅ…่ฎธ..." โ”‚
352
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
353
+
354
+
355
+ Step 4: ่ฟ”ๅ›ž็ป™ LLM
356
+ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
357
+ context = "\\n\\n".join([chunk.page_content for chunk in top_5])
358
+ โ†“
359
+ LLM ็”Ÿๆˆ็ญ”ๆกˆ
360
+ """)
361
+
362
+
363
+ # ============================================================================
364
+ # Part 7: ๅ…ณ้”ฎๆŠ€ๆœฏ็ป†่Š‚
365
+ # ============================================================================
366
+ print("\n" + "=" * 80)
367
+ print("โš™๏ธ Part 7: ๅ…ณ้”ฎๆŠ€ๆœฏ็ป†่Š‚")
368
+ print("=" * 80)
369
+
370
+ print("""
371
+ 1. ไธบไป€ไนˆ่ฆๅฝ’ไธ€ๅŒ–ๅ‘้‡๏ผŸ
372
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
373
+ encode_kwargs={'normalize_embeddings': True}
374
+
375
+ ๅŽŸๅง‹ๅ‘้‡: [1.23, -4.56, 7.89, ...] # ้•ฟๅบฆไธไธ€
376
+ ๅฝ’ไธ€ๅŒ–ๅŽ: [0.12, -0.45, 0.78, ...] # ้•ฟๅบฆ = 1
377
+
378
+ ๅฅฝๅค„:
379
+ โœ… ไฝ™ๅผฆ็›ธไผผๅบฆ = ็‚น็งฏ๏ผˆ่ฎก็ฎ—ๆ›ดๅฟซ๏ผ‰
380
+ โœ… ๆ‰€ๆœ‰ๅ‘้‡ๅœจๅŒไธ€ๅฐบๅบฆไธŠ
381
+ โœ… ้ฟๅ…้•ฟๅบฆๅฝฑๅ“็›ธไผผๅบฆ่ฎก็ฎ—
382
+
383
+
384
+ 2. ไฝ™ๅผฆ็›ธไผผๅบฆ vs ๆฌงๆฐ่ท็ฆป
385
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
386
+
387
+ ไฝ™ๅผฆ็›ธไผผๅบฆ๏ผˆไฝ ็š„้กน็›ฎไฝฟ็”จ๏ผ‰โญ:
388
+ similarity = vโ‚ ยท vโ‚‚ / (||vโ‚|| ร— ||vโ‚‚||)
389
+ ่Œƒๅ›ด: [-1, 1]๏ผŒ1 ่กจ็คบๅฎŒๅ…จ็›ธๅŒ
390
+ ็‰น็‚น: ๅ…ณๆณจๆ–นๅ‘๏ผŒๅฟฝ็•ฅ้•ฟๅบฆ
391
+
392
+ ๆฌงๆฐ่ท็ฆป:
393
+ distance = โˆšฮฃ(vโ‚แตข - vโ‚‚แตข)ยฒ
394
+ ่Œƒๅ›ด: [0, โˆž]๏ผŒ0 ่กจ็คบๅฎŒๅ…จ็›ธๅŒ
395
+ ็‰น็‚น: ๅ…ณๆณจ็ปๅฏนไฝ็ฝฎๅทฎๅผ‚
396
+
397
+ ๅฝ’ไธ€ๅŒ–ๅŽ๏ผŒไธค่€…็ญ‰ไปท๏ผ
398
+
399
+
400
+ 3. ๆ‰น้‡ๅค„็†ไผ˜ๅŒ–
401
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
402
+
403
+ ไธๆŽจ่๏ผˆๆ…ข๏ผ‰:
404
+ for chunk in chunks:
405
+ vector = embed_documents([chunk]) # ๅ•็‹ฌๅค„็†
406
+
407
+ ๆŽจ่๏ผˆๅฟซ 10 ๅ€๏ผ‰โญ:
408
+ vectors = embed_documents(chunks) # ๆ‰น้‡ๅค„็†
409
+ โ””โ”€ GPU ๅนถ่กŒ่ฎก็ฎ—
410
+ โ””โ”€ ๅ‡ๅฐ‘ๆจกๅž‹ๅŠ ่ฝฝๅผ€้”€
411
+
412
+
413
+ 4. ๅ†…ๅญ˜ไผ˜ๅŒ–
414
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
415
+
416
+ ๅ‘้‡็ปดๅบฆ้€‰ๆ‹ฉ:
417
+ 384 ็ปด (all-MiniLM-L6-v2) โ† ไฝ ็š„้กน็›ฎ โญ
418
+ โ””โ”€ ๅนณ่กก๏ผšๅ‡†็กฎ็އ vs ๅญ˜ๅ‚จ
419
+
420
+ 768 ็ปด (BERT-base)
421
+ โ””โ”€ ๆ›ดๅ‡†็กฎไฝ†ๅญ˜ๅ‚จ็ฟปๅ€
422
+
423
+ 1024 ็ปด (large models)
424
+ โ””โ”€ ๆœ€ๅ‡†็กฎไฝ†ๅญ˜ๅ‚จ 3 ๅ€
425
+
426
+ ๅญ˜ๅ‚จ่ฎก็ฎ—:
427
+ 20 ไธช chunks ร— 384 ็ปด ร— 4 bytes = 30KB
428
+ 1000 ไธช chunks ร— 384 ็ปด ร— 4 bytes = 1.5MB
429
+ โ””โ”€ ้žๅธธ้ซ˜ๆ•ˆ๏ผ
430
+ """)
431
+
432
+
433
+ # ============================================================================
434
+ # Part 8: ๅฎŒๆ•ดไปฃ็ ๆต็จ‹
435
+ # ============================================================================
436
+ print("\n" + "=" * 80)
437
+ print("๐Ÿ’ป Part 8: ๅฎŒๆ•ดไปฃ็ ๆต็จ‹ๆ€ป็ป“")
438
+ print("=" * 80)
439
+
440
+ print("""
441
+ ไฝ ็š„้กน็›ฎๅฎŒๆ•ดๆต็จ‹๏ผš
442
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
443
+
444
+ # 1. ๅˆๅง‹ๅŒ– Embedding ๆจกๅž‹
445
+ embeddings = HuggingFaceEmbeddings(
446
+ model_name="sentence-transformers/all-MiniLM-L6-v2",
447
+ model_kwargs={'device': 'cpu'},
448
+ encode_kwargs={'normalize_embeddings': True}
449
+ )
450
+
451
+ # 2. ๆ–‡ๆกฃๅˆ‡ๅ‰ฒ
452
+ text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
453
+ chunk_size=250,
454
+ chunk_overlap=50 # โ† ไฝ ๅˆšไฟฎๆ”น็š„
455
+ )
456
+ doc_splits = text_splitter.split_documents(docs)
457
+
458
+ # 3. ๅ‘้‡ๅŒ– + ๅญ˜ๅ‚จๅˆฐ Chroma
459
+ vectorstore = Chroma.from_documents(
460
+ documents=doc_splits, # ่พ“ๅ…ฅ: 20 ไธช chunks
461
+ collection_name="rag-chroma",
462
+ embedding=embeddings # ๅ‘้‡ๅŒ–ๅ‡ฝๆ•ฐ
463
+ )
464
+ # โ†“ ๅ†…้ƒจ่‡ชๅŠจๅฎŒๆˆ:
465
+ # - ๆ‰น้‡ๅ‘้‡ๅŒ–: chunks โ†’ 384็ปดๅ‘้‡
466
+ # - ๅญ˜ๅ‚จ: ๅ‘้‡ + ๅŽŸๆ–‡ + ๅ…ƒๆ•ฐๆฎ
467
+ # - ๆž„ๅปบ HNSW ็ดขๅผ•
468
+
469
+ # 4. ๅˆ›ๅปบๆฃ€็ดขๅ™จ
470
+ retriever = vectorstore.as_retriever()
471
+
472
+ # 5. ๆฃ€็ดข
473
+ docs = retriever.get_relevant_documents("ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ")
474
+ # โ†“ ๅ†…้ƒจๆต็จ‹:
475
+ # - ๆŸฅ่ฏขๅ‘้‡ๅŒ–
476
+ # - HNSW ๅฟซ้€Ÿๆฃ€็ดข
477
+ # - ่ฟ”ๅ›ž Top K chunks
478
+
479
+ # 6. CrossEncoder ้‡ๆŽ’๏ผˆๅฏ้€‰๏ผŒไฝ ็š„้กน็›ฎๆœ‰๏ผ‰
480
+ reranked = crossencoder.rerank(query, docs, top_k=5)
481
+
482
+ # 7. ๅ–‚็ป™ LLM ็”Ÿๆˆ็ญ”ๆกˆ
483
+ answer = llm.generate(context=docs, question=query)
484
+
485
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
486
+ """)
487
+
488
+
489
+ # ============================================================================
490
+ # Part 9: ๆ€ง่ƒฝไผ˜ๅŒ–ๅปบ่ฎฎ
491
+ # ============================================================================
492
+ print("\n" + "=" * 80)
493
+ print("๐Ÿš€ Part 9: ๆ€ง่ƒฝไผ˜ๅŒ–ๅปบ่ฎฎ")
494
+ print("=" * 80)
495
+
496
+ print("""
497
+ ๅฝ“ๅ‰้…็ฝฎ่ฏ„ๅˆ†๏ผš
498
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
499
+
500
+ โœ… Embedding ๆจกๅž‹: all-MiniLM-L6-v2 (่ฝป้‡้ซ˜ๆ•ˆ) โญโญโญโญโญ
501
+ โœ… ๅ‘้‡ๅฝ’ไธ€ๅŒ–: True (ไฝ™ๅผฆ็›ธไผผๅบฆไผ˜ๅŒ–) โญโญโญโญโญ
502
+ โœ… ็ดขๅผ•็ฑปๅž‹: HNSW (ๅฟซ้€Ÿๆฃ€็ดข) โญโญโญโญโญ
503
+ โœ… Chunk overlap: 50 (ไฟๆŒไธŠไธ‹ๆ–‡) โญโญโญโญโญ
504
+ โœ… CrossEncoder ้‡ๆŽ’ (็ฒพๅ‡†ๆŽ’ๅบ) โญโญโญโญโญ
505
+
506
+ ๆ€ป่ฏ„: ๐Ÿ† ็”Ÿไบง็บง้…็ฝฎ๏ผ
507
+
508
+ ๅฏ้€‰ไผ˜ๅŒ–๏ผˆๅฆ‚้œ€่ฟ›ไธ€ๆญฅๆๅ‡๏ผ‰๏ผš
509
+ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
510
+
511
+ 1. GPU ๅŠ ้€Ÿ
512
+ model_kwargs={'device': 'cuda'} # ๅ‘้‡ๅŒ–้€Ÿๅบฆ 10x โ†‘
513
+
514
+ 2. ๆ›ดๅคง็š„ Embedding ๆจกๅž‹๏ผˆๅฆ‚้œ€ๆ›ด้ซ˜ๅ‡†็กฎ็އ๏ผ‰
515
+ "BAAI/bge-large-en-v1.5" # 1024็ปด๏ผŒๅ‡†็กฎ็އ +5%
516
+
517
+ 3. ๆ‰น้‡ๅคงๅฐ่ฐƒๆ•ด
518
+ batch_size=32 # ๅŠ ๅฟซๅ‘้‡ๅŒ–
519
+
520
+ 4. Chroma ๆŒไน…ๅŒ–้…็ฝฎ
521
+ persist_directory="./chroma_db" # ้ฟๅ…้‡ๅคๅ‘้‡ๅŒ–
522
+ """)
523
+
524
+
525
+ print("\n" + "=" * 80)
526
+ print("โœ… ่งฃๆžๅฎŒๆˆ๏ผไฝ ็Žฐๅœจ็†่งฃไบ†ไปŽๅˆ‡ๅ‰ฒๅˆฐๅ‘้‡ๆ•ฐๆฎๅบ“็š„ๅฎŒๆ•ดๆต็จ‹")
527
+ print("=" * 80)
528
+ print()