WangKaiLin commited on
Commit
2b2ade4
·
verified ·
1 Parent(s): e594efd

Upload 6 files

Browse files
Files changed (6) hide show
  1. README.md +106 -0
  2. config.json +6 -0
  3. engine.py +549 -0
  4. pipeowl.safetensors +3 -0
  5. quickstart.py +38 -0
  6. tokenizer.json +0 -0
README.md CHANGED
@@ -1,3 +1,109 @@
1
  ---
 
 
 
 
 
 
 
 
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - multilingual
4
+ tags:
5
+ - embeddings
6
+ - retrieval
7
+ - transformer-free
8
+ - safetensors
9
+ - edge-ai
10
  license: mit
11
  ---
12
+
13
+ # Pipeowl-1.10-multilingual (Geometric Embedding)
14
+
15
+ A transformer-free semantic retrieval engine.
16
+
17
+ Features:
18
+ - O(n) over vocabulary.
19
+ - No attention.
20
+ - No transformer weights.
21
+
22
+ ## Architecture
23
+
24
+ - Static embedding table (V × D)
25
+ - Aligned vocabulary index
26
+ - Linear scoring
27
+ - Pluggable decoder stage
28
+
29
+ ## Model Specs
30
+
31
+ | item | value |
32
+ |-----|------|
33
+ | token size | 734803 |
34
+ | embedding dim | 512 |
35
+ | storage format | safetensors (FP16) |
36
+ | data size | ~728 MB |
37
+ | languages | multilingual |
38
+ | startup time | ~912 ms |
39
+ | query latency | ~65-72 ms |
40
+
41
+ ## Quickstart
42
+
43
+ ```bash
44
+ git clone https://huggingface.co/WangKaiLin/PipeOwl-1.10-multilingual
45
+ cd PipeOwl-1.10-multilingual
46
+
47
+ pip install numpy safetensors
48
+
49
+ python quickstart.py
50
+ ```
51
+
52
+ ## Example:
53
+
54
+ Example semantic retrieval results:
55
+
56
+ ```bash
57
+ 請輸入句子: 確實
58
+
59
+ Top-K Tokens:
60
+ 1.000 | 確實
61
+ 0.871 | 的確
62
+ 0.848 | 确实
63
+ 0.825 | 確かに
64
+ 0.796 | дійсно
65
+
66
+ 請輸入句子: 今天好想睡覺
67
+
68
+ Top-K Tokens:
69
+ 0.711 | 今天
70
+ 0.691 | 今天的
71
+ 0.677 | 睡觉
72
+ 0.658 | 睡覺
73
+ 0.653 | 今日は
74
+
75
+ 請輸入句子: i want to sleep
76
+
77
+ Top-K Tokens:
78
+ 0.735 | sleep
79
+ 0.686 | спать
80
+ 0.671 | schlafen
81
+ 0.642 | tidur
82
+ 0.638 | want
83
+
84
+ 請輸入句子: 哈囉你好阿
85
+
86
+ Top-K Tokens:
87
+ 0.823 | 哈囉
88
+ 0.808 | 你好
89
+ 0.777 | こんにちは
90
+ 0.767 | 嘿
91
+ 0.765 | 嗨
92
+ ```
93
+
94
+ ## Repository Structure
95
+
96
+ ```bash
97
+ PipeOwl-1.10-multilingual/
98
+ ├ README.md
99
+ ├ config.json
100
+ ├ LICENSE
101
+ ├ quickstart.py
102
+ ├ engine.py
103
+ ├ tokenizer.json
104
+ └ pipeowl.safetensors
105
+ ```
106
+
107
+ ## LICENSE
108
+
109
+ MIT
config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "pipeowl",
3
+ "architecture": "semantic-field-retrieval",
4
+ "embedding_dim": 1024,
5
+ "vocab_size": 734803
6
+ }
engine.py ADDED
@@ -0,0 +1,549 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import os
5
+ import re
6
+ import math
7
+ from dataclasses import dataclass
8
+ from safetensors.numpy import load_file
9
+ from typing import Dict, List, Tuple, Optional
10
+ import numpy as np # type: ignore
11
+ from pathlib import Path
12
+
13
+ BASE_DIR = Path(__file__).resolve()
14
+ data = load_file("pipeowl.safetensors")
15
+
16
+ @dataclass
17
+ class PipeOwlConfig:
18
+ """
19
+ 全域設定。
20
+
21
+ embeddings_path:
22
+ 語義場的基底向量矩陣 (V, D)
23
+ V = 詞彙數
24
+ D = 向量維度
25
+
26
+ delta_scalar_path:
27
+ 每個 token 對應的一維場偏移量 (V,)
28
+ 用來做 score 偏移(目前為靜態 bias)
29
+
30
+ vocab_path:
31
+ vocab list,必須與 embeddings 順序完全對齊。
32
+ index i <-> emb[i] <-> delta[i]
33
+
34
+ alpha:
35
+ base 相似度權重
36
+
37
+ beta:
38
+ delta 權重(目前為 logit bias,不是動態 loss)
39
+
40
+ top_k:
41
+ retrieval 預設回傳數量
42
+
43
+ temperature:
44
+ decode 階段採樣溫度
45
+
46
+ max_new_tokens:
47
+ decode 最大生成長度
48
+ """
49
+ ROOT_DIR = BASE_DIR.parent
50
+ vocab_path: str = str(ROOT_DIR / "tokenizer.json")
51
+
52
+ normalize_rows: bool = False # True: enforce row-wise normalization for cosine==dot
53
+ ensure_contiguous: bool = True # True: make emb contiguous for faster GEMV
54
+ max_token_len_cap: int = 32 # cap tokenizer max token length to prevent slow path / garbage vocab
55
+
56
+ #=============================
57
+ alpha: float = 1
58
+ #=============================
59
+
60
+ #=============================
61
+ # scoring mode
62
+ #=============================
63
+ score_mode: str = "residual"
64
+ # options:
65
+ # "linear" -> α*base + β*delta + γ*syntax
66
+ # "residual" -> α*base + (1 - α*base)*delta
67
+ #=============================
68
+
69
+ ##=============================
70
+ ## "linear"
71
+ ## score = α*base + β*delta + γ*syntax
72
+ ##=============================
73
+ beta: float = 0.05
74
+ ##gamma: float = 1.5
75
+ ##=============================
76
+ ## if linear
77
+ ## α=0.97 β=0.03 performance well
78
+ ## α=1 β=0.00 just same as model
79
+ ##=============================
80
+
81
+ ##=============================
82
+ ## "residual"
83
+ ## score = α*base + (1 - α*base)*delta
84
+ ##=============================
85
+ ## if residual
86
+ ## α=1 just same as model
87
+ ## α=0.9 performance well
88
+ ## α=0.5 find more meaning
89
+ ##=============================
90
+
91
+ ##=============================
92
+ ## retrieval
93
+ ##=============================
94
+ top_k: int = 16
95
+ ##=============================
96
+
97
+ ##=============================
98
+ ## decode
99
+ ##=============================
100
+ temperature: float = 0.13
101
+ ##=============================
102
+ ## temperature = 0.13 performance well
103
+ ##=============================
104
+ max_new_tokens: int = 64
105
+ ##=============================
106
+
107
+ """
108
+ def softmax(scores, temperature=0.3):
109
+ scores = np.array(scores) / temperature
110
+ exp = np.exp(scores - np.max(scores))
111
+ return exp / exp.sum()
112
+ """
113
+
114
+ def eval_token_nll(engine, text):
115
+ tokens = engine.tokenizer.tokenize(text)
116
+ if len(tokens) < 2:
117
+ return float("inf")
118
+
119
+ total_bits = 0.0
120
+ count = 0
121
+
122
+ for i in range(len(tokens) - 1):
123
+ context = "".join(tokens[:i+1])
124
+ target_token = tokens[i+1]
125
+
126
+ q = engine.encode(context)
127
+ logits = engine.score_vocab(q)
128
+ probs = engine.logits_to_probs(logits)
129
+
130
+ idx = engine.token_to_id.get(target_token)
131
+ p = float(probs[idx]) if idx is not None else 1e-9
132
+ p = max(p, 1e-9)
133
+
134
+ total_bits += -math.log2(p)
135
+ count += 1
136
+
137
+ return total_bits / count
138
+
139
+ ## semanticizer
140
+ class VocabTokenizer:
141
+ """
142
+ 字串最大匹配 tokenizer。
143
+
144
+ 設計目標:
145
+ 將輸入文字拆成 vocab 中存在的 token。
146
+
147
+ 方法:
148
+ - 使用最大長度優先匹配
149
+
150
+ 適用情境:
151
+ vocab 是字 / 詞 級別,且已對齊 embedding。
152
+ """
153
+ def __init__(self, vocab_list, *, max_len_cap: Optional[int] = None):
154
+ self.vocab_set = set(vocab_list)
155
+
156
+ mx = max(len(t) for t in vocab_list)
157
+ if max_len_cap is not None:
158
+ mx = min(mx, int(max_len_cap))
159
+ self.max_len = mx
160
+
161
+ def tokenize(self, text):
162
+ text = text.lower().strip()
163
+
164
+ tokens = []
165
+ i = 0
166
+ n = len(text)
167
+
168
+ while i < n:
169
+ matched = False
170
+
171
+ for L in range(self.max_len, 0, -1):
172
+ if i + L <= n:
173
+ piece = text[i:i+L]
174
+
175
+ if piece in self.vocab_set:
176
+ tokens.append(piece)
177
+ i += L
178
+ matched = True
179
+ break
180
+
181
+ if not matched:
182
+ # 🔥 fallback char(最後才做)
183
+ tokens.append(text[i])
184
+ i += 1
185
+
186
+ return tokens
187
+
188
+ class PipeOwlEngine:
189
+ """
190
+ PipeOwl 幾何語義引擎核心。
191
+
192
+ 設計哲學:
193
+ index = 語義場座標
194
+
195
+ emb[i] -> 詞向量
196
+ delta[i] -> 詞的場偏移量
197
+ vocab[i] -> 詞本身
198
+
199
+ 核心流程:
200
+ text
201
+
202
+ tokenize
203
+
204
+ mean embedding
205
+
206
+ score = alpha*base + beta*delta
207
+
208
+ top-k
209
+
210
+ decode
211
+
212
+ 這是一個:
213
+ Field-based retrieval language system
214
+ """
215
+
216
+ def __init__(self, cfg: PipeOwlConfig):
217
+ self.cfg = cfg
218
+
219
+ #self.emb: np.ndarray = None # (V, D) float32
220
+ #self.delta: np.ndarray = None # (V,) float32
221
+ self.emb = data["embeddings"].astype(np.float32)
222
+ self.delta = data["delta_field"].astype(np.float32)
223
+ self.token_to_id: Dict[str, int] = {}
224
+ self.id_to_token: List[str] = []
225
+
226
+ # Decoder (optional)
227
+ self.decoder = MicroGPTDecoder() # inference-only stub; plug your trained weights later
228
+
229
+ self._load_assets()
230
+
231
+ # -------------------------
232
+ # asset loading
233
+ # -------------------------
234
+
235
+ def _load_assets(self) -> None:
236
+ """
237
+ 載入語義場資產。
238
+
239
+ 載入內容:
240
+ 1. embeddings (V, D)
241
+ 2. delta scalar (V,)
242
+ 3. vocab list (V,)
243
+
244
+ 關鍵假設:
245
+ 三者必須 index 完全對齊。
246
+
247
+ 幾何意義:
248
+ 每個 index i 對應語義空間中的一個固定場點。
249
+
250
+ """
251
+ if not os.path.exists(self.cfg.vocab_path):
252
+ raise FileNotFoundError(self.cfg.vocab_path)
253
+
254
+ emb = self.emb
255
+
256
+ # embeddings: (V, D)
257
+
258
+ if emb.dtype != np.float32:
259
+ emb = emb.astype(np.float32, copy=False)
260
+
261
+ # ChatGPT note: make C-contiguous for faster GEMV
262
+ if self.cfg.ensure_contiguous and not emb.flags["C_CONTIGUOUS"]:
263
+ emb = np.ascontiguousarray(emb)
264
+
265
+ if self.cfg.normalize_rows:
266
+ norms = np.linalg.norm(emb, axis=1, keepdims=True) + 1e-12
267
+ emb = emb / norms
268
+
269
+ # delta: (V,)
270
+ self.delta = data["delta_field"]
271
+ if self.delta.dtype != np.float32:
272
+ self.delta = self.delta.astype(np.float32, copy=False)
273
+
274
+ if self.emb.ndim != 2:
275
+ raise ValueError(f"embeddings must be 2D (V, D), got shape={self.emb.shape}")
276
+
277
+ # (V, D)
278
+ V, _ = self.emb.shape
279
+
280
+ if self.delta.ndim != 1 or self.delta.shape[0] != V:
281
+ raise ValueError(f"delta must be shape (V,), got {self.delta.shape}, expected ({V},)")
282
+
283
+ # vocab json: build token_to_id and id_to_token
284
+ with open(self.cfg.vocab_path, "r", encoding="utf-8-sig") as f:
285
+ vocab_list = json.load(f)
286
+
287
+ if not isinstance(vocab_list, list):
288
+ raise ValueError("vocab must be a list for geometric field mode")
289
+
290
+ if len(vocab_list) != V:
291
+ raise ValueError(f"vocab size {len(vocab_list)} != embeddings V {V}")
292
+
293
+ self.vocab = vocab_list
294
+ self.id_to_token = vocab_list
295
+ self.token_to_id = {t: i for i, t in enumerate(vocab_list)}
296
+
297
+ self.tokenizer = VocabTokenizer(self.vocab)
298
+
299
+ # -------------------------
300
+ # encode (from vector library)
301
+ # -------------------------
302
+
303
+ def encode(self, text: str):
304
+ """
305
+ 將文字投影到語義場中。
306
+
307
+ 流程:
308
+ 1. tokenize -> token list
309
+ 2. 取每個 token 對應 emb
310
+ 3. 做 mean pooling
311
+ 4. normalize
312
+
313
+ 數學形式:
314
+ q = normalize( mean( emb[token_i] ) )
315
+
316
+ 幾何意義:
317
+ 這是在語義場中求質心。
318
+
319
+ 風險:
320
+ - mean pooling 會削弱方向性
321
+ """
322
+ # ChatGPT note: exact token fast-path (prevents "貓頭鷹 = mean(貓,頭,鷹)" pollution)
323
+ idx0 = self.token_to_id.get(text)
324
+ if idx0 is not None:
325
+ v = self.emb[idx0].astype(np.float32, copy=False)
326
+ # emb rows already normalized if cfg.normalize_rows=True; keep safe anyway:
327
+ v = v / (np.linalg.norm(v) + 1e-12)
328
+ return v
329
+
330
+ tokens = self.tokenizer.tokenize(text)
331
+ if not tokens:
332
+ return np.zeros(self.emb.shape[1], dtype=np.float32)
333
+
334
+ vecs = []
335
+ wts = []
336
+
337
+ for t in tokens:
338
+ idx = self.token_to_id.get(t)
339
+ if idx is None:
340
+ continue
341
+
342
+ vecs.append(self.emb[idx])
343
+ wts.append(max(1, len(t)))
344
+
345
+ if not vecs:
346
+ return np.zeros(self.emb.shape[1], dtype=np.float32)
347
+
348
+ vecs = np.stack(vecs, axis=0).astype(np.float32, copy=False)
349
+ wts = np.asarray(wts, dtype=np.float32)
350
+ q = np.average(vecs, axis=0, weights=wts)
351
+ q /= (np.linalg.norm(q) + 1e-12)
352
+ return q
353
+
354
+ # -------------------------
355
+ # probs (decode)
356
+ # -------------------------
357
+
358
+ def logits_to_probs(self, logits: np.ndarray, temperature: Optional[float] = None) -> np.ndarray:
359
+ T = self.cfg.temperature if temperature is None else float(temperature)
360
+ x = logits.astype(np.float64) / max(T, 1e-8)
361
+ x = x - np.max(x)
362
+ exp_x = np.exp(x)
363
+ return (exp_x / np.sum(exp_x)).astype(np.float32)
364
+
365
+ # -------------------------
366
+ # loss / scoring (delta)
367
+ # -------------------------
368
+ def score_vocab(self, q: np.ndarray, alpha: Optional[float] = None, beta: Optional[float] = None) -> np.ndarray:
369
+ """
370
+ 計算每個 vocab token 的場分數。
371
+
372
+ base:
373
+ emb @ q
374
+ 若 emb 與 q 已正規化,則為 cosine similarity。
375
+
376
+ delta:
377
+ 每個 token 的靜態場偏移量。
378
+
379
+ 目前語義:
380
+ delta 是 logit bias。
381
+ 不是 loss、不是 energy gradient。s
382
+
383
+ """
384
+ a = self.cfg.alpha if alpha is None else float(alpha)
385
+ b = self.cfg.beta if beta is None else float(beta)
386
+
387
+ base = self.emb @ q
388
+
389
+ if self.cfg.score_mode == "linear":
390
+ score = a * base + b * self.delta
391
+
392
+ elif self.cfg.score_mode == "residual":
393
+ score = a * base + (1 - a * base) * self.delta
394
+
395
+ else:
396
+ raise ValueError(f"Unknown score_mode: {self.cfg.score_mode}")
397
+
398
+ return score.astype(np.float32, copy=False)
399
+
400
+ def topk(self, score: np.ndarray, k: Optional[int] = None) -> List[Tuple[str, float]]:
401
+ """
402
+ 取前 k 高分 token。
403
+
404
+ 使用 argpartition 提升效率。
405
+
406
+ 回傳:
407
+ [(token_string, score), ...]
408
+
409
+ 幾何意義:
410
+ 找出最接近 query 向量(含場偏移)的場點。
411
+
412
+ 注意:
413
+ score 可能 > 1(因為加入 delta)。
414
+ """
415
+ k = self.cfg.top_k if k is None else int(k)
416
+ k = max(1, min(k, score.shape[0]))
417
+
418
+ # argpartition for speed
419
+ idx = np.argpartition(-score, k - 1)[:k]
420
+ idx = idx[np.argsort(-score[idx])]
421
+
422
+ out = []
423
+ for i in idx:
424
+ tok = self.id_to_token[i] if i < len(self.id_to_token) else str(i)
425
+ out.append((tok, float(score[i])))
426
+ return out
427
+
428
+ # -------------------------
429
+ # decode (microgpt inference-only)
430
+ # -------------------------
431
+ def decode(self, prompt_tokens: List[str]) -> str:
432
+ """
433
+ Decode 階段。
434
+
435
+ 目前行為:
436
+ 將 top tokens 拼成 prompt 字串,
437
+ 丟給 microgpt stub。
438
+
439
+ 設計定位:
440
+ retrieval 與 generation 分離。
441
+
442
+ 現狀:
443
+ microgpt 尚未接上真實權重,
444
+ 目前只是 pipeline 占位。
445
+ """
446
+
447
+ prompt = " ".join([t for t in prompt_tokens if t])
448
+ return self.decoder.generate(
449
+ prompt=prompt,
450
+ temperature=self.cfg.temperature,
451
+ max_new_tokens=self.cfg.max_new_tokens,
452
+ )
453
+
454
+ # -------------------------
455
+ # one-shot pipeline
456
+ # -------------------------
457
+ def pipeowl(
458
+ self,
459
+ text: str,
460
+ *,
461
+ top_k: Optional[int] = None,
462
+ alpha: Optional[float] = None,
463
+ beta: Optional[float] = None,
464
+ temperature: Optional[float] = None,
465
+ max_new_tokens: Optional[int] = None,
466
+ ) -> Dict[str, object]:
467
+ """
468
+ 單次完整 pipeline。
469
+
470
+ 流程:
471
+ text
472
+
473
+ encode
474
+
475
+ score_vocab
476
+
477
+ topk
478
+
479
+ decode
480
+
481
+ 回傳:
482
+ {
483
+ "query": 原始文字,
484
+ "retrieved": top-k token + 分數,
485
+ "prompt": 用於 decode 的 token 串,
486
+ "decoded": 生成結果
487
+ }
488
+
489
+ 這是語義場查詢的一次完整觀測。
490
+ """
491
+
492
+ q = self.encode(text)
493
+ s = self.score_vocab(q, alpha=alpha, beta=beta)
494
+ retrieved = self.topk(s, k=top_k)
495
+
496
+ # build a prompt from top tokens (simple & deterministic)
497
+ prompt_tokens = [t for (t, _) in retrieved[: min(len(retrieved), 8)]]
498
+ if temperature is not None:
499
+ self.cfg.temperature = float(temperature)
500
+ if max_new_tokens is not None:
501
+ self.cfg.max_new_tokens = int(max_new_tokens)
502
+
503
+ decoded = self.decode(prompt_tokens)
504
+ return {
505
+ "query": text,
506
+ "retrieved": retrieved,
507
+ "prompt": " ".join(prompt_tokens),
508
+ "decoded": decoded,
509
+ }
510
+
511
+
512
+ # ----------------------------------------------------------------------
513
+ # microgpt inference-only stub
514
+ # ----------------------------------------------------------------------
515
+ class MicroGPTDecoder:
516
+ """
517
+ 推理階段占位 decoder。
518
+
519
+ 設計目的:
520
+ 讓 pipeline 可運行,
521
+ 未來可替換為:
522
+ - 已訓練 microGPT
523
+ - 外部 LLM
524
+ - 或場驅動 sampling 模型
525
+
526
+ 現在只是 scaffold。
527
+
528
+ Inference-only placeholder.
529
+
530
+ Why placeholder?
531
+ - Your pasted microGPT file trains its own weights in-process.
532
+ - For a real decode stage, you want:
533
+ (A) load a trained state_dict from disk, OR
534
+ (B) keep a tiny trained model in memory, OR
535
+ (C) use microGPT purely as a sampler over a learned char vocab.
536
+
537
+ This class is the stable interface. Plug your implementation later.
538
+ """
539
+
540
+ def __init__(self):
541
+ # If you already have trained weights, add:
542
+ # self.state_dict = load(...)
543
+ pass
544
+
545
+ def generate(self, prompt: str, temperature: float = 0.8, max_new_tokens: int = 64) -> str:
546
+ # Minimal safe fallback: return prompt as “decoded” scaffold.
547
+ # Replace this with your microgpt forward+sampling once you have weights.
548
+ # (This keeps the pipeline callable today.)
549
+ return f"[microgpt_stub] {prompt}"
pipeowl.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7aad2dab21d1d1e7322ad1304d8e38f16e1ab92e42b714337c7d1701bf3d6b96
3
+ size 753908174
quickstart.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from engine import PipeOwlEngine, PipeOwlConfig
2
+ import time
3
+
4
+ #=== timetest ===
5
+ """
6
+ t0 = time.perf_counter()
7
+ """
8
+ #================
9
+
10
+ engine = PipeOwlEngine(PipeOwlConfig())
11
+
12
+ #=== timetest ===
13
+ """
14
+ t1 = time.perf_counter()
15
+ print(f"\n🚀 Cold start time: {(t1 - t0)*1000:.2f} ms\n")
16
+ #""
17
+ for _ in range(20):
18
+ t0 = time.perf_counter()
19
+ engine.pipeowl("雪鴞")
20
+ print((time.perf_counter() - t0) * 1000, "ms")
21
+ """
22
+ #================
23
+
24
+ while True:
25
+
26
+ print()
27
+ query = input("請輸入句子: ")
28
+
29
+ out = engine.pipeowl(query, top_k=5)
30
+
31
+ print("\nTop-K Tokens:")
32
+ for text, score in out["retrieved"]:
33
+ print(f"{score:.3f} | {text}")
34
+
35
+ # print("\nDecoded:")
36
+ # print(out["decoded"])
37
+
38
+ print()
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff