LisaMegaWatts commited on
Commit
0df9b5d
Β·
verified Β·
1 Parent(s): 39b44a3

Initial space setup: GPT-2 style OpenAI-compatible server

Browse files
Files changed (4) hide show
  1. Dockerfile +16 -0
  2. README.md +38 -5
  3. requirements.txt +6 -0
  4. server.py +429 -0
Dockerfile ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ RUN useradd -m -u 1000 user
4
+
5
+ WORKDIR /home/user/app
6
+ COPY --chown=user requirements.txt .
7
+ RUN pip install --no-cache-dir -r requirements.txt
8
+
9
+ COPY --chown=user server.py .
10
+
11
+ USER user
12
+ ENV HOME=/home/user
13
+
14
+ EXPOSE 7860
15
+
16
+ CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,10 +1,43 @@
1
  ---
2
- title: JuliaGPT V2 Space
3
- emoji: πŸ’»
4
- colorFrom: indigo
5
- colorTo: gray
6
  sdk: docker
 
7
  pinned: false
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: JuliaGPT-v2
3
+ emoji: "🧠"
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: docker
7
+ app_port: 7860
8
  pinned: false
9
+ license: mit
10
+ tags:
11
+ - julia
12
+ - flux-jl
13
+ - gpt2-style
14
+ - philosophy
15
+ - openai-compatible
16
+ - char-level
17
  ---
18
 
19
+ # JuliaGPT-v2 Space
20
+
21
+ GPT-2 style decoder model (384d, 6L, 6H) trained on classical philosophy. Character-level tokenizer (38 chars). Trained in Julia/Flux.jl, served via PyTorch.
22
+
23
+ ## Endpoints
24
+
25
+ - `GET /` β€” Health check and model info
26
+ - `GET /v1/models` β€” List available models
27
+ - `POST /v1/chat/completions` β€” Generate text (supports streaming)
28
+
29
+ ## Usage
30
+
31
+ ```bash
32
+ curl -X POST https://LisaMegaWatts-JuliaGPT-v2-space.hf.space/v1/chat/completions \
33
+ -H "Content-Type: application/json" \
34
+ -d '{"messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200}'
35
+ ```
36
+
37
+ ## Architecture
38
+
39
+ - **Model**: 384d embed, 6 layers, 6 heads, ~4.7M params
40
+ - **Tokenizer**: Character-level (38 chars)
41
+ - **Normalization**: LayerNorm (pre-norm)
42
+ - **Feed-forward**: GELU activation
43
+ - **Framework**: Flux.jl (training) / PyTorch (serving)
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ fastapi>=0.110.0
2
+ uvicorn>=0.29.0
3
+ torch>=2.0.0
4
+ h5py>=3.10.0
5
+ huggingface_hub>=0.20.0
6
+ pydantic>=2.0.0
server.py ADDED
@@ -0,0 +1,429 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ server.py β€” JuliaGPT-v2 OpenAI-compatible inference server
3
+ Serves POST /v1/chat/completions (streaming + non-streaming) and GET /v1/models.
4
+
5
+ Loads the Flux.jl GPT-2 model from best_model.jld2 on HF Hub.
6
+ Architecture: GPT-2 style β€” LayerNorm, GELU, combined QKV, learned position embeddings.
7
+ 6 layers, 384-dim, 6 heads, 38-char vocab, val_loss=2.91.
8
+
9
+ Weights are extracted from JLD2 (HDF5-based) via h5py, loaded into PyTorch.
10
+ Follows the RandyGPT FastAPI/uvicorn pattern for proven HF Spaces compatibility.
11
+ """
12
+
13
+ import json
14
+ import math
15
+ import time
16
+ import uuid
17
+ import os
18
+ import h5py
19
+ import numpy as np
20
+ import torch
21
+ import torch.nn as nn
22
+ import torch.nn.functional as F
23
+ from pathlib import Path
24
+ from fastapi import FastAPI, HTTPException, Request
25
+ from fastapi.responses import JSONResponse, StreamingResponse
26
+ from fastapi.middleware.cors import CORSMiddleware
27
+ from fastapi.exceptions import RequestValidationError
28
+ from pydantic import BaseModel
29
+ from typing import List, Optional
30
+ from huggingface_hub import hf_hub_download
31
+
32
+
33
+ # ── Model definition (GPT-2 style, matches Flux training) ────────────────────
34
+
35
+ class CausalSelfAttention(nn.Module):
36
+ def __init__(self, n_embd, n_head):
37
+ super().__init__()
38
+ self.n_head = n_head
39
+ self.head_dim = n_embd // n_head
40
+ self.scale = 1.0 / math.sqrt(self.head_dim)
41
+ self.qkv = nn.Linear(n_embd, 3 * n_embd, bias=False)
42
+ self.proj = nn.Linear(n_embd, n_embd, bias=False)
43
+
44
+ def forward(self, x):
45
+ B, T, C = x.shape
46
+ qkv = self.qkv(x)
47
+ q, k, v = qkv.split(C, dim=-1)
48
+ q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
49
+ k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
50
+ v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
51
+ scores = q @ k.transpose(-2, -1) * self.scale
52
+ mask = torch.full((T, T), float('-inf'), device=x.device).triu(1)
53
+ attn = F.softmax(scores + mask, dim=-1)
54
+ out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
55
+ return self.proj(out)
56
+
57
+
58
+ class FeedForward(nn.Module):
59
+ def __init__(self, n_embd):
60
+ super().__init__()
61
+ self.fc1 = nn.Linear(n_embd, 4 * n_embd, bias=False)
62
+ self.fc2 = nn.Linear(4 * n_embd, n_embd, bias=False)
63
+
64
+ def forward(self, x):
65
+ return self.fc2(F.gelu(self.fc1(x)))
66
+
67
+
68
+ class TransformerBlock(nn.Module):
69
+ def __init__(self, n_embd, n_head):
70
+ super().__init__()
71
+ self.ln1 = nn.LayerNorm(n_embd)
72
+ self.attn = CausalSelfAttention(n_embd, n_head)
73
+ self.ln2 = nn.LayerNorm(n_embd)
74
+ self.ffwd = FeedForward(n_embd)
75
+
76
+ def forward(self, x):
77
+ x = x + self.attn(self.ln1(x))
78
+ x = x + self.ffwd(self.ln2(x))
79
+ return x
80
+
81
+
82
+ class GPT(nn.Module):
83
+ def __init__(self, vocab_size, n_embd, n_head, n_layer, block_size):
84
+ super().__init__()
85
+ self.block_size = block_size
86
+ self.wte = nn.Embedding(vocab_size, n_embd)
87
+ self.wpe = nn.Embedding(block_size, n_embd)
88
+ self.blocks = nn.ModuleList([TransformerBlock(n_embd, n_head) for _ in range(n_layer)])
89
+ self.ln_f = nn.LayerNorm(n_embd)
90
+ self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
91
+
92
+ def forward(self, ids):
93
+ B, T = ids.shape
94
+ x = self.wte(ids) + self.wpe(torch.arange(T, device=ids.device).unsqueeze(0))
95
+ for block in self.blocks:
96
+ x = block(x)
97
+ x = self.ln_f(x)
98
+ return self.lm_head(x)
99
+
100
+ @torch.no_grad()
101
+ def generate_stream(self, ids, max_new_tokens=200, temperature=0.1,
102
+ top_k=8, repetition_penalty=1.3):
103
+ self.eval()
104
+ generated = []
105
+ for i in range(max_new_tokens):
106
+ ctx = ids[:, -self.block_size:]
107
+ logits = self(ctx)[:, -1, :]
108
+ logits = logits[0]
109
+
110
+ if repetition_penalty > 1.0:
111
+ seen = set()
112
+ for t in generated[-self.block_size:]:
113
+ seen.add(t)
114
+ for t in ctx[0].tolist():
115
+ seen.add(t)
116
+ for t in seen:
117
+ if 0 <= t < logits.shape[0]:
118
+ if logits[t] > 0:
119
+ logits[t] /= repetition_penalty
120
+ else:
121
+ logits[t] *= repetition_penalty
122
+
123
+ logits = logits / max(temperature, 0.01)
124
+
125
+ if top_k > 0 and top_k < logits.shape[0]:
126
+ topk_vals, _ = torch.topk(logits, top_k)
127
+ logits[logits < topk_vals[-1]] = float('-inf')
128
+
129
+ probs = F.softmax(logits, dim=-1)
130
+ nxt = torch.multinomial(probs, 1)
131
+ ids = torch.cat([ids, nxt.view(1, 1)], dim=1)
132
+ token_id = nxt.item()
133
+ generated.append(token_id)
134
+ is_last = (i == max_new_tokens - 1)
135
+ yield token_id, is_last
136
+
137
+ @torch.no_grad()
138
+ def generate(self, ids, max_new_tokens=200, temperature=0.1,
139
+ top_k=8, repetition_penalty=1.3):
140
+ self.eval()
141
+ generated = []
142
+ for token_id, _ in self.generate_stream(ids, max_new_tokens, temperature,
143
+ top_k, repetition_penalty):
144
+ generated.append(token_id)
145
+ return generated
146
+
147
+
148
+ # ── Char-level tokenizer ──────────────────────────────────────────────────────
149
+
150
+ class CharTokenizer:
151
+ def __init__(self, uchars):
152
+ self.uchars = uchars
153
+ self.stoi = {c: i for i, c in enumerate(uchars)}
154
+ self.itos = {i: c for i, c in enumerate(uchars)}
155
+ self.vocab_size = len(uchars)
156
+
157
+ def encode(self, text):
158
+ return [self.stoi[c] for c in text.lower() if c in self.stoi]
159
+
160
+ def decode(self, ids):
161
+ return "".join(self.itos.get(i, "?") for i in ids)
162
+
163
+
164
+ # ── Load JLD2 weights via h5py ───────────────────────────────────────────────
165
+
166
+ def load_jld2_gpt2(jld2_path, vocab_path=None):
167
+ """Load Flux GPT-2 weights from JLD2, build PyTorch model."""
168
+ print(f"Loading JLD2 from {jld2_path} ...")
169
+ f = h5py.File(jld2_path, "r")
170
+ ms = f["model_state"][()]
171
+
172
+ def deref(ref):
173
+ return np.array(f[ref])
174
+
175
+ # Get architecture params
176
+ b1 = ms["blocks"]["layers"]["1"]
177
+ n_head = int(b1["attn"]["n_head"])
178
+ wte_w = deref(ms["wte"]["weight"])
179
+ vocab_size, n_embd = wte_w.shape
180
+ wpe_w = deref(ms["wpe"]["weight"])
181
+ block_size = wpe_w.shape[0]
182
+
183
+ layer_names = sorted(ms["blocks"]["layers"].dtype.names, key=int)
184
+ n_layer = len(layer_names)
185
+
186
+ step = int(f["step"][()])
187
+ best_val = float(f["best_val_loss"][()])
188
+
189
+ print(f" vocab={vocab_size}, embd={n_embd}, heads={n_head}, layers={n_layer}, block={block_size}")
190
+ print(f" step={step}, best_val_loss={best_val:.4f}")
191
+
192
+ # Build PyTorch model
193
+ model = GPT(vocab_size, n_embd, n_head, n_layer, block_size)
194
+
195
+ state = {}
196
+ # Embeddings: h5py (vocab, embd) = PyTorch (vocab, embd), no transpose
197
+ state["wte.weight"] = torch.tensor(wte_w, dtype=torch.float32)
198
+ state["wpe.weight"] = torch.tensor(wpe_w, dtype=torch.float32)
199
+
200
+ # Dense weights: h5py gives (in, out) due to Julia column-major β†’ need .T for PyTorch (out, in)
201
+ for i, lname in enumerate(layer_names):
202
+ layer = ms["blocks"]["layers"][lname]
203
+
204
+ # LayerNorm (1D, no transpose)
205
+ state[f"blocks.{i}.ln1.weight"] = torch.tensor(deref(layer["ln1"]["diag"]["scale"]), dtype=torch.float32)
206
+ state[f"blocks.{i}.ln1.bias"] = torch.tensor(deref(layer["ln1"]["diag"]["bias"]), dtype=torch.float32)
207
+ state[f"blocks.{i}.ln2.weight"] = torch.tensor(deref(layer["ln2"]["diag"]["scale"]), dtype=torch.float32)
208
+ state[f"blocks.{i}.ln2.bias"] = torch.tensor(deref(layer["ln2"]["diag"]["bias"]), dtype=torch.float32)
209
+
210
+ # Attention QKV + proj (transpose Dense weights)
211
+ state[f"blocks.{i}.attn.qkv.weight"] = torch.tensor(deref(layer["attn"]["qkv"]["weight"]).T.copy(), dtype=torch.float32)
212
+ state[f"blocks.{i}.attn.proj.weight"] = torch.tensor(deref(layer["attn"]["proj"]["weight"]).T.copy(), dtype=torch.float32)
213
+
214
+ # FeedForward (transpose Dense weights)
215
+ state[f"blocks.{i}.ffwd.fc1.weight"] = torch.tensor(deref(layer["ffwd"]["net"]["layers"]["1"]["weight"]).T.copy(), dtype=torch.float32)
216
+ state[f"blocks.{i}.ffwd.fc2.weight"] = torch.tensor(deref(layer["ffwd"]["net"]["layers"]["3"]["weight"]).T.copy(), dtype=torch.float32)
217
+
218
+ # Final LayerNorm
219
+ state["ln_f.weight"] = torch.tensor(deref(ms["ln_f"]["diag"]["scale"]), dtype=torch.float32)
220
+ state["ln_f.bias"] = torch.tensor(deref(ms["ln_f"]["diag"]["bias"]), dtype=torch.float32)
221
+
222
+ # Output projection (transpose Dense weight)
223
+ state["lm_head.weight"] = torch.tensor(deref(ms["lm_head"]["weight"]).T.copy(), dtype=torch.float32)
224
+
225
+ model.load_state_dict(state)
226
+ model.eval()
227
+ f.close()
228
+
229
+ params = sum(p.numel() for p in model.parameters())
230
+ print(f" PyTorch model loaded: {params:,} params")
231
+
232
+ # Load char vocab
233
+ tok = None
234
+ if vocab_path and os.path.exists(vocab_path):
235
+ uchars = json.loads(Path(vocab_path).read_text())
236
+ tok = CharTokenizer(uchars)
237
+ print(f" Loaded char vocab: {tok.vocab_size} chars")
238
+
239
+ return model, tok, {
240
+ "vocab_size": vocab_size, "n_embd": n_embd, "n_head": n_head,
241
+ "n_layer": n_layer, "block_size": block_size, "step": step,
242
+ "best_val_loss": best_val, "params": params,
243
+ }
244
+
245
+
246
+ # ── Load model at startup ────────────────────────────────────────────────────
247
+
248
+ REPO = os.environ.get("HF_REPO", "LisaMegaWatts/JuliaGPT-v2")
249
+ MODEL_ID = "juliagpt-v2-philosophy"
250
+
251
+ print(f"Downloading model from {REPO} ...")
252
+ jld2_path = hf_hub_download(repo_id=REPO, filename="best_model.jld2")
253
+ try:
254
+ vocab_path = hf_hub_download(repo_id=REPO, filename="vocab.json")
255
+ except Exception:
256
+ vocab_path = None
257
+
258
+ model, tok, hp = load_jld2_gpt2(jld2_path, vocab_path)
259
+ n_embd = hp["n_embd"]
260
+ n_head = hp["n_head"]
261
+ n_layer = hp["n_layer"]
262
+ block_size = hp["block_size"]
263
+ vocab_size = hp["vocab_size"]
264
+
265
+ # Fallback tokenizer if vocab.json missing
266
+ if tok is None:
267
+ chars = [" ","!","\"","'","(",")",",","-",".",":",";","?","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
268
+ tok = CharTokenizer(chars)
269
+ print(f" Built fallback char vocab: {tok.vocab_size} chars")
270
+
271
+ print(f"\nModel ready β€” {hp['params']:,} params, vocab={tok.vocab_size}, val_loss={hp['best_val_loss']:.4f}")
272
+
273
+
274
+ # ── FastAPI app ───────────────────────────────────────────────────────────────
275
+
276
+ app = FastAPI(title="JuliaGPT-v2", version="1.0.0")
277
+
278
+ app.add_middleware(
279
+ CORSMiddleware,
280
+ allow_origins=["*"],
281
+ allow_methods=["*"],
282
+ allow_headers=["*"],
283
+ )
284
+
285
+
286
+ def _openai_error(status, message, err_type="invalid_request_error", code=None):
287
+ body = {"error": {"message": message, "type": err_type}}
288
+ if code:
289
+ body["error"]["code"] = code
290
+ return JSONResponse(status_code=status, content=body)
291
+
292
+
293
+ @app.exception_handler(HTTPException)
294
+ async def http_exc(request, exc):
295
+ return _openai_error(exc.status_code, str(exc.detail))
296
+
297
+
298
+ @app.exception_handler(RequestValidationError)
299
+ async def val_exc(request, exc):
300
+ msg = "; ".join(f"{e['loc'][-1]}: {e['msg']}" for e in exc.errors())
301
+ return _openai_error(422, msg, code="invalid_request_error")
302
+
303
+
304
+ @app.get("/")
305
+ def root():
306
+ return {
307
+ "name": "JuliaGPT-v2",
308
+ "version": "1.0.0",
309
+ "description": "Flux.jl GPT-2 trained on classical philosophy β€” v2 (384d, 6L, 6H)",
310
+ "architecture": "GPT-2 (LayerNorm, GELU, combined QKV)",
311
+ "model": {
312
+ "vocab_size": tok.vocab_size, "n_embd": n_embd,
313
+ "n_layer": n_layer, "n_head": n_head,
314
+ "block_size": block_size, "params": hp["params"],
315
+ },
316
+ "endpoints": ["/v1/models", "/v1/chat/completions"],
317
+ "features": ["streaming", "OpenAI-compatible"],
318
+ }
319
+
320
+
321
+ @app.get("/v1/models")
322
+ def list_models():
323
+ return {
324
+ "object": "list",
325
+ "data": [{"id": MODEL_ID, "object": "model",
326
+ "created": 1700000000, "owned_by": "juliagpt"}]
327
+ }
328
+
329
+
330
+ class Message(BaseModel):
331
+ role: str
332
+ content: str
333
+
334
+ class ChatRequest(BaseModel):
335
+ model: Optional[str] = MODEL_ID
336
+ messages: List[Message]
337
+ max_tokens: Optional[int] = 200
338
+ temperature: Optional[float] = 0.8
339
+ top_k: Optional[int] = 20
340
+ repetition_penalty: Optional[float] = 1.3
341
+ n: Optional[int] = 1
342
+ stream: Optional[bool] = False
343
+
344
+
345
+ def _sse(data):
346
+ return f"data: {json.dumps(data)}\n\n"
347
+
348
+
349
+ def _stream_completion(ids, max_tokens, temperature, top_k, rep_penalty,
350
+ completion_id, _model, _tok):
351
+ yield _sse({
352
+ "id": completion_id, "object": "chat.completion.chunk",
353
+ "created": int(time.time()), "model": MODEL_ID,
354
+ "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""},
355
+ "finish_reason": None}],
356
+ })
357
+
358
+ token_count = 0
359
+ for token_id, is_last in _model.generate_stream(
360
+ ids, max_new_tokens=max_tokens, temperature=temperature,
361
+ top_k=top_k, repetition_penalty=rep_penalty
362
+ ):
363
+ token_text = _tok.decode([token_id])
364
+ token_count += 1
365
+ finish_reason = ("length" if token_count >= max_tokens else "stop") if is_last else None
366
+ yield _sse({
367
+ "id": completion_id, "object": "chat.completion.chunk",
368
+ "created": int(time.time()), "model": MODEL_ID,
369
+ "choices": [{"index": 0, "delta": {"content": token_text},
370
+ "finish_reason": finish_reason}],
371
+ })
372
+
373
+ yield "data: [DONE]\n\n"
374
+
375
+
376
+ @app.post("/v1/chat/completions")
377
+ def chat_completions(req: ChatRequest):
378
+ _m, _t = model, tok
379
+
380
+ prompt = req.messages[-1].content.strip() if req.messages else ""
381
+ if not prompt:
382
+ raise HTTPException(status_code=400, detail="No content in messages")
383
+
384
+ ids = _t.encode(prompt)
385
+ if not ids:
386
+ ids = [0]
387
+
388
+ max_tokens = max(1, min(req.max_tokens or 200, block_size))
389
+ temperature = max(0.01, min(req.temperature or 0.8, 2.0))
390
+ top_k = max(1, min(req.top_k or 20, tok.vocab_size))
391
+ rep_penalty = max(1.0, min(req.repetition_penalty or 1.3, 3.0))
392
+ n = max(1, min(req.n or 1, 4))
393
+ completion_id = f"chatcmpl-{uuid.uuid4().hex[:8]}"
394
+
395
+ tensor = torch.tensor([ids], dtype=torch.long)
396
+
397
+ if req.stream:
398
+ return StreamingResponse(
399
+ _stream_completion(tensor, max_tokens, temperature, top_k,
400
+ rep_penalty, completion_id, _m, _t),
401
+ media_type="text/event-stream",
402
+ headers={"X-Accel-Buffering": "no"},
403
+ )
404
+
405
+ choices = []
406
+ total_completion_tokens = 0
407
+ for i in range(n):
408
+ generated = _m.generate(tensor.clone(), max_new_tokens=max_tokens,
409
+ temperature=temperature, top_k=top_k,
410
+ repetition_penalty=rep_penalty)
411
+ text = _t.decode(generated)
412
+ total_completion_tokens += len(generated)
413
+ choices.append({
414
+ "index": i,
415
+ "message": {"role": "assistant", "content": text},
416
+ "finish_reason": "length" if len(generated) >= max_tokens else "stop",
417
+ })
418
+
419
+ return {
420
+ "id": completion_id, "object": "chat.completion",
421
+ "created": int(time.time()), "model": MODEL_ID,
422
+ "system_fingerprint": "juliagpt-v2",
423
+ "choices": choices,
424
+ "usage": {
425
+ "prompt_tokens": len(ids),
426
+ "completion_tokens": total_completion_tokens,
427
+ "total_tokens": len(ids) + total_completion_tokens,
428
+ },
429
+ }