Spaces:

LisaMegaWatts
/

JuliaFluxGPT

Sleeping

App Files Files Community

LisaMegaWatts commited on Feb 25

Commit

6f2e71d

verified ·

1 Parent(s): b85c7dc

Restore Julia-native server (replace Python/FastAPI with Flux.jl + HTTP.jl)

Browse files

Files changed (8) hide show

Dockerfile +31 -6
Project.toml +7 -0
README.md +39 -30
checkpoint.jl +222 -0
model.jl +290 -0
requirements.txt +0 -7
server.jl +312 -0
server.py +0 -708

Dockerfile CHANGED Viewed

@@ -1,10 +1,35 @@
-FROM python:3.11-slim
 RUN useradd -m -u 1000 user
-WORKDIR /home/user/app
-COPY --chown=user requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-COPY --chown=user server.py .
 USER user
 ENV HOME=/home/user
 EXPOSE 7860
-CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "7860"]

+FROM julia:1.10-bookworm
+# HuggingFace Spaces requires user ID 1000
 RUN useradd -m -u 1000 user
+# Shared Julia depot for package caching
+ENV JULIA_DEPOT_PATH=/opt/julia-depot
+RUN mkdir -p /opt/julia-depot && chmod 777 /opt/julia-depot
+# Copy project file first for dependency caching
+COPY --chown=user Project.toml /home/user/app/
+# Install and precompile Julia packages
+RUN julia --project=/home/user/app -e ' \
+    using Pkg; \
+    Pkg.instantiate(); \
+    Pkg.precompile(); \
+    println("Precompile done")'
+# Copy application code
+COPY --chown=user model.jl /home/user/app/
+COPY --chown=user checkpoint.jl /home/user/app/
+COPY --chown=user server.jl /home/user/app/
+# Create checkpoints directory (model downloads from HF at runtime)
+RUN mkdir -p /home/user/app/checkpoints && chown user:user /home/user/app/checkpoints
+# Switch to non-root user
 USER user
 ENV HOME=/home/user
+WORKDIR /home/user/app
 EXPOSE 7860
+CMD ["julia", "--project=/home/user/app", "/home/user/app/server.jl"]

Project.toml ADDED Viewed

	@@ -0,0 +1,7 @@

+[deps]
+Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
+Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
+HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
+JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
+JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
+NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"

README.md CHANGED Viewed

@@ -1,51 +1,60 @@
 ---
 title: JuliaFluxGPT
-emoji: 🏛️
-colorFrom: purple
-colorTo: red
 sdk: docker
 pinned: false
 license: mit
-short_description: LLaMA-style GPT in Flux.jl — philosophy text generation
-app_port: 7860
 ---
 # JuliaFluxGPT
-A LLaMA-style small language model built in Flux.jl, trained on classical philosophy and mathematics texts with a pre-punctuation 28-character vocabulary.
-**100% Julia — no Python dependencies.**
-## API
-OpenAI-compatible inference endpoint:
 ```bash
-curl -X POST https://lisamegawatts-juliafluxgpt.hf.space/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"messages":[{"role":"user","content":"the nature of"}],"temperature":0.8,"max_tokens":200}'
-```
-### Endpoints
-| Method | Path | Description |
-|--------|------|-------------|
-| GET | `/` | Health check + API info |
-| GET | `/v1/models` | List available models |
-| POST | `/v1/chat/completions` | Generate text (OpenAI format) |
 ## Architecture
-- LLaMA-style decoder-only transformer
-- RoPE (Rotary Positional Embeddings)
-- SwiGLU feed-forward blocks
-- GQA (Grouped Query Attention)
-- RMSNorm (pre-norm)
-- Weight tying (embedding = output projection)
-- BPE tokenizer with character-level fallback
-## Links
-- [Training data](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus)
-- [Source code](https://github.com/DavinciDreams/JuliaGPT)
-- [JuliaGPT (autograd version)](https://huggingface.co/spaces/LisaMegaWatts/JuliaGPT)

 ---
 title: JuliaFluxGPT
+emoji: "\U0001F9E0"
+colorFrom: blue
+colorTo: purple
 sdk: docker
+app_port: 7860
 pinned: false
 license: mit
+tags:
+  - julia
+  - flux-jl
+  - llama-style
+  - rope
+  - swiglu
+  - gqa
+  - rmsnorm
+  - bpe
+  - philosophy
+  - openai-compatible
 ---
 # JuliaFluxGPT
+A LLaMA-style decoder-only model (RoPE, GQA, RMSNorm, SwiGLU, weight-tied) trained on classical philosophy and mathematics texts, implemented in Julia with Flux.jl. Serves an OpenAI-compatible API with streaming support.
+## Endpoints
+- `GET /` — Health check and model info
+- `GET /v1/models` — List available models
+- `POST /v1/chat/completions` — Generate text (supports streaming, top-k, top-p)
+## Usage
 ```bash
+# Non-streaming
+curl -X POST https://your-space.hf.space/v1/chat/completions \
   -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "the nature of"}], "max_tokens": 200}'
+# Streaming
+curl -X POST https://your-space.hf.space/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "the nature of"}], "stream": true, "temperature": 0.7, "top_k": 40}'
+```
 ## Architecture
+- **Model**: ~10M params, 512d embed, 8 layers, 8Q/2KV heads (GQA)
+- **Sequence mixing**: Grouped Query Attention + RoPE
+- **Tokenizer**: BPE (2000 tokens)
+- **Framework**: Flux.jl
+- **Normalization**: RMSNorm (pre-norm)
+- **Feed-forward**: SwiGLU activation
+- **Weight tying**: Shared embedding/output projection
+## Environment Variables
+- `HF_REPO` — HuggingFace model repo (default: `LisaMegaWatts/JuliaFluxGPT`)
+- `PORT` — Server port (default: `7860`)

checkpoint.jl ADDED Viewed

	@@ -0,0 +1,222 @@

+#=
+checkpoint.jl — Load Flux model checkpoints for JuliaFluxGPT
+Loads JLD2 checkpoints saved by the juliaflux_v2 training notebook.
+Supports BPE tokenizer (tokenizer.json format) with character-level fallback.
+NOTE: The GPT struct no longer has TiedDense — weight tying is done in the
+forward pass. This simplifies checkpoint loading: we load all components
+normally and skip any lm_head key in the checkpoint (it's redundant since
+the output projection uses wte.weight directly).
+=#
+include("model.jl")
+using JLD2
+using JSON3
+# ═══════════════════════════════════════════════════════════════════════════════
+# BPE Tokenizer (loaded from tokenizer.json — HuggingFace format)
+# ═══════════════════════════════════════════════════════════════════════════════
+struct BPETokenizer
+    vocab::Dict{String, Int}
+    id_to_token::Dict{Int, String}
+    merges::Vector{Tuple{String, String}}
+    merge_rank::Dict{Tuple{String, String}, Int}
+    byte_to_unicode::Dict{UInt8, String}
+    unicode_to_byte::Dict{Char, UInt8}
+    word_cache::Dict{String, Vector{Int}}
+    gpt2_pattern::Regex
+end
+function build_byte_to_unicode()
+    bs = UInt8[]
+    cs = Char[]
+    for r in [0x21:0x7e, 0xa1:0xac, 0xae:0xff]
+        for b in r
+            push!(bs, b)
+            push!(cs, Char(b))
+        end
+    end
+    n = 0
+    for b in 0x00:0xff
+        if b ∉ bs
+            push!(bs, b)
+            push!(cs, Char(256 + n))
+            n += 1
+        end
+    end
+    b2u = Dict(bs[i] => string(cs[i]) for i in eachindex(bs))
+    u2b = Dict(v[1] => k for (k, v) in b2u)
+    return b2u, u2b
+end
+function load_bpe_tokenizer(path::String)
+    tok_json = JSON3.read(read(path, String))
+    vocab = Dict{String, Int}()
+    for (tok_str, id) in pairs(tok_json.model.vocab)
+        vocab[string(tok_str)] = Int(id) + 1  # +1 for Julia 1-indexing
+    end
+    merges = Tuple{String, String}[]
+    for merge_entry in tok_json.model.merges
+        if merge_entry isa AbstractVector && length(merge_entry) >= 2
+            push!(merges, (String(merge_entry[1]), String(merge_entry[2])))
+        else
+            parts = split(string(merge_entry), " ", limit=2)
+            if length(parts) == 2
+                push!(merges, (String(parts[1]), String(parts[2])))
+            end
+        end
+    end
+    id_to_token = Dict{Int, String}(id => tok for (tok, id) in vocab)
+    merge_rank = Dict{Tuple{String, String}, Int}(
+        (a, b) => i for (i, (a, b)) in enumerate(merges)
+    )
+    b2u, u2b = build_byte_to_unicode()
+    gpt2_pat = r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"
+    BPETokenizer(vocab, id_to_token, merges, merge_rank, b2u, u2b,
+                 Dict{String, Vector{Int}}(), gpt2_pat)
+end
+function bpe_encode_word(tok::BPETokenizer, word::Vector{String})
+    tokens = copy(word)
+    while length(tokens) >= 2
+        best_rank = typemax(Int)
+        best_pair = ("", "")
+        for i in 1:length(tokens)-1
+            rank = get(tok.merge_rank, (tokens[i], tokens[i+1]), typemax(Int))
+            if rank < best_rank
+                best_rank = rank
+                best_pair = (tokens[i], tokens[i+1])
+            end
+        end
+        best_rank == typemax(Int) && break
+        a, b = best_pair
+        new_tokens = String[]
+        i = 1
+        while i <= length(tokens)
+            if i < length(tokens) && tokens[i] == a && tokens[i+1] == b
+                push!(new_tokens, a * b)
+                i += 2
+            else
+                push!(new_tokens, tokens[i])
+                i += 1
+            end
+        end
+        tokens = new_tokens
+    end
+    return tokens
+end
+function encode_bpe(tok::BPETokenizer, s::String)
+    ids = Int[]
+    for m in eachmatch(tok.gpt2_pattern, s)
+        word = m.match
+        cached = get(tok.word_cache, word, nothing)
+        if cached !== nothing
+            append!(ids, cached)
+        else
+            word_bytes = Vector{UInt8}(word)
+            chars = [tok.byte_to_unicode[b] for b in word_bytes]
+            tokens = bpe_encode_word(tok, chars)
+            word_ids = Int[]
+            for t in tokens
+                id = get(tok.vocab, t, nothing)
+                id !== nothing && push!(word_ids, id)
+            end
+            tok.word_cache[word] = word_ids
+            append!(ids, word_ids)
+        end
+    end
+    return ids
+end
+function decode_bpe(tok::BPETokenizer, ids::Vector{Int})
+    text = join(get(tok.id_to_token, id, "") for id in ids)
+    bytes = UInt8[tok.unicode_to_byte[c] for c in text if haskey(tok.unicode_to_byte, c)]
+    return String(bytes)
+end
+# ═════════════════════���═════════════════════════════════════════════════════════
+# Checkpoint loading
+# ═══════════════════════════════════════════════════════════════════════════════
+function load_flux_checkpoint(checkpoint_path::String; tokenizer_path::String="")
+    println("Loading checkpoint from $checkpoint_path ...")
+    data = JLD2.load(checkpoint_path)
+    hp = data["hyperparams"]
+    vocab_size = Int(hp["vocab_size"])
+    n_embd = Int(hp["n_embd"])
+    block_size = Int(hp["block_size"])
+    n_layer = Int(hp["n_layer"])
+    n_head = Int(hp["n_head"])
+    n_kv_head = Int(get(hp, "n_kv_head", hp["n_head"]))
+    dropout_val = Float64(get(hp, "dropout", 0.0))
+    model = GPT(;
+        vocab_size = vocab_size,
+        n_embd = n_embd,
+        block_size = block_size,
+        n_layer = n_layer,
+        n_head = n_head,
+        n_kv_head = n_kv_head,
+        dropout = 0.0  # No dropout at inference
+    )
+    # Load weights component-by-component
+    ms = data["model_state"]
+    Flux.loadmodel!(model.wte, ms[:wte])
+    Flux.loadmodel!(model.drop, ms[:drop])
+    Flux.loadmodel!(model.blocks, ms[:blocks])
+    Flux.loadmodel!(model.ln_f, ms[:ln_f])
+    # Set to test mode (disables dropout)
+    Flux.testmode!(model)
+    step = get(data, "step", 0)
+    best_val = get(data, "best_val_loss", Inf)
+    println("  Model loaded: vocab=$vocab_size, embd=$n_embd, layers=$n_layer, " *
+            "heads=$(n_head)Q/$(n_kv_head)KV, block=$block_size")
+    println("  Step=$step, best_val=$(round(best_val, digits=4))")
+    # Load tokenizer
+    encode_fn = nothing
+    decode_fn = nothing
+    if !isempty(tokenizer_path) && isfile(tokenizer_path)
+        println("  Loading BPE tokenizer from $tokenizer_path")
+        bpe = load_bpe_tokenizer(tokenizer_path)
+        tok_vocab_size = length(bpe.vocab)
+        if tok_vocab_size != vocab_size
+            @warn "Vocab mismatch! Model expects vocab_size=$vocab_size but tokenizer has $tok_vocab_size tokens. " *
+                  "Token IDs above $vocab_size will be clamped."
+        end
+        encode_fn = function(s)
+            ids = encode_bpe(bpe, s)
+            return [clamp(id, 1, vocab_size) for id in ids]
+        end
+        decode_fn = ids -> decode_bpe(bpe, ids)
+        println("  BPE tokenizer loaded: $(tok_vocab_size) tokens (model vocab: $vocab_size)")
+    else
+        # Character-level fallback
+        chars = vcat(collect('a':'z'), [' ', '.'])
+        stoi = Dict(c => i for (i, c) in enumerate(chars))
+        itos = Dict(i => c for (i, c) in enumerate(chars))
+        encode_fn = s -> [get(stoi, c, 1) for c in s]
+        decode_fn = ids -> join(get(itos, id, '?') for id in ids)
+        println("  No tokenizer.json found, using character-level fallback ($(length(chars)) chars)")
+    end
+    return (;
+        model, vocab_size, n_embd, block_size, n_layer, n_head, n_kv_head,
+        step, best_val, encode_fn, decode_fn
+    )
+end

model.jl ADDED Viewed

	@@ -0,0 +1,290 @@

+#=
+model.jl — LLaMA-style GPT model in Flux.jl for JuliaFluxGPT
+Contains: RMSNorm, SwiGLU, CausalSelfAttention (GQA + RoPE),
+TransformerBlock, GPT, and generation utilities.
+Same architecture as juliaflux_v2.ipynb — extracted for inference serving.
+NOTE: Weight tying is done by computing the output projection directly using
+m.wte.weight in the forward pass. This matches the training notebooks and
+ensures Flux.loadmodel! works without needing to skip lm_head.
+=#
+using Flux
+using NNlib
+using NNlib: batched_mul
+using Statistics
+using Random
+using LinearAlgebra
+# ═══════════════════════════════════════════════════════════════════════════════
+# RoPE — Rotary Positional Embeddings
+# ═══════════════════════════════════════════════════════════════════════════════
+function precompute_rope_freqs(head_dim::Int, max_seq_len::Int; base::Float32 = 10000.0f0)
+    half_dim = head_dim ÷ 2
+    freqs = Float32[1.0f0 / (base ^ (Float32(2 * (i - 1)) / Float32(head_dim))) for i in 1:half_dim]
+    positions = Float32.(collect(0:max_seq_len-1))
+    angles = freqs * positions'
+    return cos.(angles), sin.(angles)
+end
+function apply_rope(x, cos_f, sin_f, T::Int)
+    d = size(x, 1) ÷ 2
+    x1 = x[1:d, :, :]
+    x2 = x[d+1:2d, :, :]
+    c = cos_f[:, 1:T]
+    s = sin_f[:, 1:T]
+    return vcat(x1 .* c .- x2 .* s, x1 .* s .+ x2 .* c)
+end
+# ═══════════════════════════════════════════════════════════════════════════════
+# Model components
+# ═══════════════════════════════════════════════════════════════════════════════
+struct RMSNorm{W <: AbstractVector}
+    weight::W
+    eps::Float32
+end
+Flux.@layer RMSNorm
+RMSNorm(dim::Int; eps::Float32 = 1.0f-6) = RMSNorm(ones(Float32, dim), eps)
+function (rn::RMSNorm)(x)
+    rms = sqrt.(mean(x .^ 2, dims=1) .+ rn.eps)
+    return (x ./ rms) .* rn.weight
+end
+struct SwiGLUFFN
+    w_gate::Dense
+    w_up::Dense
+    w_down::Dense
+    drop::Dropout
+end
+Flux.@layer SwiGLUFFN
+function SwiGLUFFN(n_embd::Int; bias=false, dropout=0.0)
+    raw_inner = Int(floor(4 * n_embd * 2 / 3))
+    inner_dim = max(64, 64 * div(raw_inner + 32, 64))
+    SwiGLUFFN(
+        Dense(n_embd => inner_dim; bias),
+        Dense(n_embd => inner_dim; bias),
+        Dense(inner_dim => n_embd; bias),
+        Dropout(dropout)
+    )
+end
+function (ff::SwiGLUFFN)(x)
+    ff.drop(ff.w_down(NNlib.swish(ff.w_gate(x)) .* ff.w_up(x)))
+end
+struct CausalSelfAttention
+    wq::Dense
+    wkv::Dense
+    proj::Dense
+    n_head::Int
+    n_kv_head::Int
+end
+Flux.@layer CausalSelfAttention trainable=(wq, wkv, proj)
+function CausalSelfAttention(n_embd::Int, n_head::Int, n_kv_head::Int; bias=false)
+    head_dim = n_embd ÷ n_head
+    kv_dim = head_dim * n_kv_head
+    CausalSelfAttention(
+        Dense(n_embd => n_embd; bias),
+        Dense(n_embd => 2 * kv_dim; bias),
+        Dense(n_embd => n_embd; bias),
+        n_head,
+        n_kv_head
+    )
+end
+function (attn::CausalSelfAttention)(x, causal_mask, rope_cos, rope_sin)
+    C, T, B = size(x)
+    nh = attn.n_head
+    nkv = attn.n_kv_head
+    hs = C ÷ nh
+    kv_dim = hs * nkv
+    groups = nh ÷ nkv
+    q = attn.wq(x)
+    kv = attn.wkv(x)
+    k = kv[1:kv_dim, :, :]
+    v = kv[kv_dim+1:2*kv_dim, :, :]
+    q = reshape(permutedims(reshape(q, hs, nh, T, B), (1, 3, 2, 4)), hs, T, nh * B)
+    k = reshape(permutedims(reshape(k, hs, nkv, T, B), (1, 3, 2, 4)), hs, T, nkv * B)
+    v = reshape(permutedims(reshape(v, hs, nkv, T, B), (1, 3, 2, 4)), hs, T, nkv * B)
+    q = apply_rope(q, rope_cos, rope_sin, T)
+    k = apply_rope(k, rope_cos, rope_sin, T)
+    if groups > 1
+        k_4d = reshape(k, hs, T, nkv, B)
+        v_4d = reshape(v, hs, T, nkv, B)
+        k_rep = repeat(reshape(k_4d, hs, T, nkv, 1, B), 1, 1, 1, groups, 1)
+        v_rep = repeat(reshape(v_4d, hs, T, nkv, 1, B), 1, 1, 1, groups, 1)
+        k = reshape(permutedims(k_rep, (1, 2, 4, 3, 5)), hs, T, nh * B)
+        v = reshape(permutedims(v_rep, (1, 2, 4, 3, 5)), hs, T, nh * B)
+    end
+    scale = Float32(1 / sqrt(hs))
+    wei = batched_mul(permutedims(q, (2, 1, 3)), k) .* scale
+    wei = wei .+ causal_mask[1:T, 1:T]
+    wei = softmax(wei; dims=2)
+    out = batched_mul(v, permutedims(wei, (2, 1, 3)))
+    out = reshape(permutedims(reshape(out, hs, T, nh, B), (1, 3, 2, 4)), C, T, B)
+    attn.proj(out)
+end
+struct TransformerBlock
+    ln1::RMSNorm
+    attn::CausalSelfAttention
+    ln2::RMSNorm
+    ffwd::SwiGLUFFN
+end
+Flux.@layer TransformerBlock
+function TransformerBlock(n_embd::Int, n_head::Int, n_kv_head::Int; dropout=0.0)
+    TransformerBlock(
+        RMSNorm(n_embd),
+        CausalSelfAttention(n_embd, n_head, n_kv_head),
+        RMSNorm(n_embd),
+        SwiGLUFFN(n_embd; dropout)
+    )
+end
+# ═══════════════════════════════════════════════════════════════════════════════
+# GPT — weight-tied output projection (matches training notebooks)
+# ═══════════════════════════════════════════════════════════════════════════════
+struct GPT
+    wte::Embedding
+    drop::Dropout
+    blocks::Chain
+    ln_f::RMSNorm
+    # Precomputed constants (not trainable)
+    causal_mask::Matrix{Float32}
+    rope_cos::Matrix{Float32}
+    rope_sin::Matrix{Float32}
+    n_head::Int
+    n_kv_head::Int
+    block_size::Int
+end
+Flux.@layer GPT trainable=(wte, drop, blocks, ln_f)
+function GPT(; vocab_size, n_embd, block_size, n_layer, n_head, n_kv_head, dropout=0.0)
+    head_dim = n_embd ÷ n_head
+    wte = Embedding(vocab_size => n_embd)
+    causal_mask = triu(fill(typemin(Float32), block_size, block_size), 1)
+    rope_cos, rope_sin = precompute_rope_freqs(head_dim, block_size)
+    GPT(
+        wte,
+        Dropout(dropout),
+        Chain([TransformerBlock(n_embd, n_head, n_kv_head; dropout) for _ in 1:n_layer]...),
+        RMSNorm(n_embd),
+        causal_mask,
+        rope_cos,
+        rope_sin,
+        n_head,
+        n_kv_head,
+        block_size
+    )
+end
+function (m::GPT)(idx)
+    B, T = size(idx)
+    tok = permutedims(m.wte(idx), (1, 3, 2))  # (C, T, B)
+    x = m.drop(tok)
+    for block in m.blocks
+        x = x .+ block.attn(block.ln1(x), m.causal_mask, m.rope_cos, m.rope_sin)
+        x = x .+ block.ffwd(block.ln2(x))
+    end
+    x = m.ln_f(x)
+    # Weight-tied output projection — same weight as embedding
+    W = m.wte.weight
+    C = size(x, 1)
+    x_flat = reshape(x, C, T * B)
+    out = W' * x_flat
+    reshape(out, size(W, 2), T, B)
+end
+# ═══════════════════════════════════════════════════════════════════════════════
+# Text generation with streaming support
+# ═══════════════════════════════════════════════════════════════════════════════
+function generate_streaming(model, encode_fn, decode_fn, vocab_size::Int, block_size::Int;
+                            prompt::String="", max_tokens::Int=200, temperature::Float64=0.8,
+                            top_k::Int=40, top_p::Float64=1.0, on_token=nothing)
+    if !isempty(prompt)
+        prompt_ids = encode_fn(prompt)
+        idx = reshape(prompt_ids, 1, :)
+    else
+        idx = reshape([rand(1:vocab_size)], 1, 1)
+    end
+    generated_ids = Int[]
+    for _ in 1:max_tokens
+        idx_cond = idx[:, max(1, end-block_size+1):end]
+        logits = model(idx_cond)
+        logits_last = Vector{Float32}(logits[:, end, 1])
+        # Temperature scaling
+        logits_last ./= Float32(max(temperature, 0.01))
+        # Top-k filtering
+        if top_k > 0 && top_k < length(logits_last)
+            threshold = partialsort(logits_last, top_k; rev=true)
+            for i in eachindex(logits_last)
+                if logits_last[i] < threshold
+                    logits_last[i] = -Inf32
+                end
+            end
+        end
+        # Top-p (nucleus) filtering
+        if top_p < 1.0
+            sorted_indices = sortperm(logits_last; rev=true)
+            sorted_logits = logits_last[sorted_indices]
+            probs_sorted = NNlib.softmax(sorted_logits)
+            cumprobs = cumsum(Array(probs_sorted))
+            cutoff = something(findfirst(>=(Float32(top_p)), cumprobs), length(probs_sorted))
+            for i in (cutoff+1):length(sorted_indices)
+                logits_last[sorted_indices[i]] = -Inf32
+            end
+        end
+        probs = NNlib.softmax(logits_last)
+        probs_cpu = Float64.(probs)
+        r = rand()
+        cum = 0.0
+        next_id = length(probs_cpu)
+        for (i, p) in enumerate(probs_cpu)
+            cum += p
+            if r <= cum
+                next_id = i
+                break
+            end
+        end
+        push!(generated_ids, next_id)
+        idx = hcat(idx, reshape([next_id], 1, 1))
+        if on_token !== nothing
+            token_str = decode_fn([next_id])
+            on_token(token_str)
+        end
+    end
+    return decode_fn(generated_ids)
+end

requirements.txt DELETED Viewed

@@ -1,7 +0,0 @@
-fastapi>=0.110.0
-uvicorn>=0.29.0
-torch>=2.0.0
-h5py>=3.10.0
-huggingface_hub>=0.20.0
-pydantic>=2.0.0
-tokenizers>=0.15.0

server.jl ADDED Viewed

	@@ -0,0 +1,312 @@

+#=
+server.jl — OpenAI-compatible inference server for JuliaFluxGPT
+Serves a Flux.jl trained LLaMA-style GPT model (RoPE, GQA, RMSNorm, SwiGLU).
+Downloads checkpoint and tokenizer from HuggingFace model repo on first run.
+Endpoints:
+    GET  /                       -> health check / API info
+    GET  /v1/models              -> list available models
+    POST /v1/chat/completions    -> generate text (OpenAI format, streaming supported)
+=#
+include("checkpoint.jl")
+using HTTP
+using UUIDs
+using Downloads
+# ═══════════════════════════════════════════════════════════════════
+# Download artifacts from HuggingFace
+# ═══════════════════════════════════════════════════════════════════
+const CKPT_DIR = "checkpoints"
+const CKPT_PATH = joinpath(CKPT_DIR, "best_model.jld2")
+const TOKENIZER_PATH = joinpath(CKPT_DIR, "tokenizer.json")
+const HF_REPO = get(ENV, "HF_REPO", "LisaMegaWatts/JuliaFluxGPT")
+const PORT = parse(Int, get(ENV, "PORT", "7860"))
+function download_from_hf(repo::String, filename::String, local_path::String)
+    url = "https://huggingface.co/$repo/resolve/main/$filename"
+    println("Downloading $url ...")
+    mkpath(dirname(local_path))
+    Downloads.download(url, local_path)
+    sz = round(filesize(local_path) / 1024^2, digits=1)
+    println("  -> $local_path ($sz MB)")
+end
+function ensure_artifacts()
+    for (localpath, remote) in [(CKPT_PATH, "best_model.jld2"),
+                                (TOKENIZER_PATH, "tokenizer.json")]
+        if !isfile(localpath)
+            println("No local $remote found, downloading from $HF_REPO ...")
+            try
+                download_from_hf(HF_REPO, remote, localpath)
+            catch e
+                println("Download failed for $remote: $e")
+                println("Place $remote at $localpath manually.")
+                exit(1)
+            end
+        end
+    end
+end
+# ═══════════════════════════════════════════════════════════════════
+# Download and load model
+# ═══════════════════════════════════════════════════════════════════
+ensure_artifacts()
+println("\nLoading model...")
+const CKPT = load_flux_checkpoint(CKPT_PATH; tokenizer_path=TOKENIZER_PATH)
+const MODEL = CKPT.model
+const VOCAB_SIZE = CKPT.vocab_size
+const BLOCK_SIZE = CKPT.block_size
+const ENCODE_FN = CKPT.encode_fn
+const DECODE_FN = CKPT.decode_fn
+const MODEL_CREATED_AT = Int(floor(time()))
+println("\nModel ready: vocab=$(VOCAB_SIZE), embd=$(CKPT.n_embd), " *
+        "layers=$(CKPT.n_layer), heads=$(CKPT.n_head)Q/$(CKPT.n_kv_head)KV, " *
+        "block=$(BLOCK_SIZE)")
+# ═══════════════════════════════════════════════════════════════════
+# HTTP helpers
+# ═══════════════════════════════════════════════════════════════════
+const CORS_HEADERS = [
+    "Access-Control-Allow-Origin" => "*",
+    "Access-Control-Allow-Methods" => "GET, POST, OPTIONS",
+    "Access-Control-Allow-Headers" => "Content-Type, Authorization",
+]
+function json_response(status::Int, body; extra_headers=[])
+    json_bytes = JSON3.write(body)
+    headers = [
+        "Content-Type" => "application/json",
+        CORS_HEADERS...,
+        extra_headers...
+    ]
+    return HTTP.Response(status, headers, json_bytes)
+end
+function cors_preflight()
+    return HTTP.Response(204, CORS_HEADERS)
+end
+# ═══════════════════════════════════════════════════════════════════
+# Extract prompt from OpenAI chat messages
+# ═══════════════════════════════════════════════════════════════════
+function extract_prompt(messages)
+    if isempty(messages)
+        return ""
+    end
+    for i in length(messages):-1:1
+        role = string(get(messages[i], :role, ""))
+        if role == "user"
+            return string(get(messages[i], :content, ""))
+        end
+    end
+    return string(get(messages[end], :content, ""))
+end
+# ═════════════════════════════════════════════════════���═════════════
+# SSE helpers
+# ═══════════════════════════════════════════════════════════════════
+function sse_line(data)
+    return "data: $(JSON3.write(data))\n\n"
+end
+# ═══════════════════════════════════════════════════════════════════
+# Request handler
+# ═══════════════════════════════════════════════════════════════════
+function handle_request(request::HTTP.Request)
+    method = request.method
+    target = request.target
+    # CORS preflight
+    if method == "OPTIONS"
+        return cors_preflight()
+    end
+    # GET / — health check and model info
+    if method == "GET" && target == "/"
+        return json_response(200, Dict(
+            "name" => "JuliaFluxGPT",
+            "version" => "1.0.0",
+            "description" => "LLaMA-style GPT in Flux.jl — trained on philosophy and mathematics",
+            "architecture" => "RoPE + SwiGLU + GQA + RMSNorm + weight tying",
+            "model" => Dict(
+                "vocab_size" => VOCAB_SIZE,
+                "n_embd" => CKPT.n_embd,
+                "n_layer" => CKPT.n_layer,
+                "n_head" => CKPT.n_head,
+                "n_kv_head" => CKPT.n_kv_head,
+                "block_size" => BLOCK_SIZE
+            ),
+            "endpoints" => ["/v1/models", "/v1/chat/completions"],
+            "features" => ["streaming", "OpenAI-compatible", "top-k", "top-p"],
+            "compatible_with" => ["OpenAI API", "OpenRouter"]
+        ))
+    end
+    # GET /v1/models — list available models
+    if method == "GET" && target == "/v1/models"
+        return json_response(200, Dict(
+            "object" => "list",
+            "data" => [Dict(
+                "id" => "juliafluxgpt-philosophy",
+                "object" => "model",
+                "created" => MODEL_CREATED_AT,
+                "owned_by" => "juliafluxgpt"
+            )]
+        ))
+    end
+    # POST /v1/chat/completions — generate text
+    if method == "POST" && target == "/v1/chat/completions"
+        local body
+        try
+            body = JSON3.read(String(request.body))
+        catch e
+            return json_response(400, Dict("error" => Dict(
+                "message" => "Invalid JSON in request body",
+                "type" => "invalid_request_error",
+                "code" => "invalid_json")))
+        end
+        temperature = Float64(clamp(get(body, :temperature, 0.8), 0.01, 2.0))
+        max_tokens = Int(clamp(get(body, :max_tokens, 200), 1, BLOCK_SIZE))
+        top_k_val = Int(clamp(get(body, :top_k, 40), 0, VOCAB_SIZE))
+        top_p_val = Float64(clamp(get(body, :top_p, 1.0), 0.0, 1.0))
+        stream = Bool(get(body, :stream, false))
+        messages = get(body, :messages, [])
+        prompt_text = extract_prompt(messages)
+        if stream
+            # ── SSE streaming response (buffered) ──
+            completion_id = "chatcmpl-" * string(uuid4())
+            created = Int(floor(time()))
+            buf = IOBuffer()
+            # Initial chunk with role
+            initial_chunk = Dict(
+                "id" => completion_id,
+                "object" => "chat.completion.chunk",
+                "created" => created,
+                "model" => "juliafluxgpt-philosophy",
+                "choices" => [Dict(
+                    "index" => 0,
+                    "delta" => Dict("role" => "assistant", "content" => ""),
+                    "finish_reason" => nothing
+                )]
+            )
+            write(buf, sse_line(initial_chunk))
+            token_count = Ref(0)
+            generate_streaming(MODEL, ENCODE_FN, DECODE_FN, VOCAB_SIZE, BLOCK_SIZE;
+                               prompt=prompt_text, max_tokens=max_tokens,
+                               temperature=temperature, top_k=top_k_val, top_p=top_p_val,
+                               on_token = function(token_str)
+                                   token_count[] += 1
+                                   chunk = Dict(
+                                       "id" => completion_id,
+                                       "object" => "chat.completion.chunk",
+                                       "created" => created,
+                                       "model" => "juliafluxgpt-philosophy",
+                                       "choices" => [Dict(
+                                           "index" => 0,
+                                           "delta" => Dict("content" => token_str),
+                                           "finish_reason" => nothing
+                                       )]
+                                   )
+                                   write(buf, sse_line(chunk))
+                               end)
+            # Final chunk with finish_reason
+            prompt_tokens = length(ENCODE_FN(prompt_text))
+            finish_chunk = Dict(
+                "id" => completion_id,
+                "object" => "chat.completion.chunk",
+                "created" => created,
+                "model" => "juliafluxgpt-philosophy",
+                "choices" => [Dict(
+                    "index" => 0,
+                    "delta" => Dict(),
+                    "finish_reason" => token_count[] >= max_tokens ? "length" : "stop"
+                )],
+                "usage" => Dict(
+                    "prompt_tokens" => prompt_tokens,
+                    "completion_tokens" => token_count[],
+                    "total_tokens" => prompt_tokens + token_count[]
+                )
+            )
+            write(buf, sse_line(finish_chunk))
+            write(buf, "data: [DONE]\n\n")
+            sse_body = take!(buf)
+            headers = [
+                "Content-Type" => "text/event-stream",
+                "Cache-Control" => "no-cache",
+                "X-Accel-Buffering" => "no",
+                CORS_HEADERS...
+            ]
+            return HTTP.Response(200, headers, sse_body)
+        else
+            # ── Standard (non-streaming) response ──
+            n_completions = Int(clamp(get(body, :n, 1), 1, 4))
+            choices = []
+            total_completion_tokens = 0
+            for i in 1:n_completions
+                text = generate_streaming(MODEL, ENCODE_FN, DECODE_FN, VOCAB_SIZE, BLOCK_SIZE;
+                                          prompt=prompt_text, max_tokens=max_tokens,
+                                          temperature=temperature, top_k=top_k_val, top_p=top_p_val)
+                finish_reason = length(text) >= max_tokens ? "length" : "stop"
+                push!(choices, Dict(
+                    "index" => i - 1,
+                    "message" => Dict("role" => "assistant", "content" => text),
+                    "finish_reason" => finish_reason))
+                total_completion_tokens += length(text)
+            end
+            prompt_tokens = length(ENCODE_FN(prompt_text))
+            return json_response(200, Dict(
+                "id" => "chatcmpl-" * string(uuid4()),
+                "object" => "chat.completion",
+                "created" => Int(floor(time())),
+                "model" => "juliafluxgpt-philosophy",
+                "choices" => choices,
+                "usage" => Dict(
+                    "prompt_tokens" => prompt_tokens,
+                    "completion_tokens" => total_completion_tokens,
+                    "total_tokens" => prompt_tokens + total_completion_tokens),
+                "system_fingerprint" => "juliafluxgpt-flux-v1"))
+        end
+    end
+    # 404 fallback
+    return json_response(404, Dict("error" => Dict(
+        "message" => "Not found: $method $target",
+        "type" => "invalid_request_error",
+        "code" => "not_found")))
+end
+# ═══════════════════════════════════════════════════════════════════
+# Start server
+# ═══════════════════════════════════════════════════════════════════
+println("\nJuliaFluxGPT server starting on 0.0.0.0:$PORT ...")
+println("  GET  http://localhost:$PORT/")
+println("  GET  http://localhost:$PORT/v1/models")
+println("  POST http://localhost:$PORT/v1/chat/completions")
+println("  POST http://localhost:$PORT/v1/chat/completions  (stream=true)")
+println()
+HTTP.serve(handle_request, "0.0.0.0", PORT)

server.py DELETED Viewed

@@ -1,708 +0,0 @@
-"""
-server.py — OpenAI-compatible FastAPI inference server for JuliaFluxGPT
-Endpoints:
-    GET  /                       -> model info
-    GET  /v1/models              -> list available models
-    POST /v1/chat/completions    -> generate text (streaming via SSE)
-Weights are loaded from HuggingFace Hub at startup:
-    repo:     LisaMegaWatts/JuliaFluxGPT
-    files:    best_model.jld2, tokenizer.json
-Architecture: LLaMA-style GPT
-    - RMSNorm (weight only, no bias)
-    - RoPE (Rotary Positional Embeddings, base=10000)
-    - GQA (Grouped Query Attention, 8 query heads / 2 KV heads)
-    - SwiGLU FFN
-    - Weight-tied output projection (lm_head shares wte weights)
-"""
-from __future__ import annotations
-import json
-import math
-import os
-import time
-import uuid
-from typing import List, Optional
-import h5py
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import uvicorn
-from fastapi import FastAPI, HTTPException, Request
-from fastapi.middleware.cors import CORSMiddleware
-from fastapi.responses import JSONResponse, StreamingResponse
-from fastapi.exceptions import RequestValidationError
-from huggingface_hub import hf_hub_download
-from pydantic import BaseModel
-from tokenizers import Tokenizer
-# ---------------------------------------------------------------------------
-# Hyperparameters (must match training checkpoint)
-# ---------------------------------------------------------------------------
-VOCAB_SIZE   = 2000
-N_EMBD       = 512
-N_HEAD       = 8
-N_KV_HEAD    = 2
-N_LAYER      = 8
-BLOCK_SIZE   = 256
-ROPE_BASE    = 10000.0
-RMS_EPS      = 1e-6
-MODEL_ID     = "juliafluxgpt-philosophy"
-HF_REPO      = "LisaMegaWatts/JuliaFluxGPT"
-HF_WEIGHTS   = "best_model.jld2"
-HF_TOKENIZER = "tokenizer.json"
-DEVICE = torch.device("cpu")  # HF Spaces free tier = CPU only
-# ---------------------------------------------------------------------------
-# RoPE helpers
-# ---------------------------------------------------------------------------
-def precompute_rope(head_dim: int, max_seq_len: int, base: float = 10000.0):
-    """
-    Returns (cos, sin) each of shape (max_seq_len, head_dim // 2).
-    Sliced to actual sequence length in apply_rope.
-    """
-    half = head_dim // 2
-    freqs = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
-    positions = torch.arange(max_seq_len).float()
-    angles = positions.unsqueeze(1) * freqs.unsqueeze(0)   # (T, half)
-    return torch.cos(angles), torch.sin(angles)             # (T, half)
-def apply_rope(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
-    """
-    x   : (B, n_head, T, head_dim)
-    cos : (T, head_dim // 2)
-    sin : (T, head_dim // 2)
-    """
-    T = x.shape[2]
-    cos = cos[:T].unsqueeze(0).unsqueeze(0)  # (1, 1, T, half)
-    sin = sin[:T].unsqueeze(0).unsqueeze(0)
-    d = x.shape[-1] // 2
-    x1, x2 = x[..., :d], x[..., d:]
-    return torch.cat([x1 * cos - x2 * sin, x1 * sin + x2 * cos], dim=-1)
-# ---------------------------------------------------------------------------
-# Model components
-# ---------------------------------------------------------------------------
-class RMSNorm(nn.Module):
-    def __init__(self, dim: int, eps: float = 1e-6):
-        super().__init__()
-        self.eps = eps
-        self.weight = nn.Parameter(torch.ones(dim))
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        # x: (B, T, C)
-        rms = x.pow(2).mean(dim=-1, keepdim=True).add(self.eps).sqrt()
-        return x / rms * self.weight
-class GQAttention(nn.Module):
-    """
-    Grouped Query Attention.
-      wq  : (n_embd, n_embd)          — query projection
-      wkv : (n_embd, 2 * kv_dim)      — combined K+V projection
-      proj: (n_embd, n_embd)          — output projection
-    """
-    def __init__(self, n_embd: int, n_head: int, n_kv_head: int):
-        super().__init__()
-        assert n_embd % n_head == 0, "n_embd must be divisible by n_head"
-        assert n_head % n_kv_head == 0, "n_head must be divisible by n_kv_head"
-        self.n_head    = n_head
-        self.n_kv_head = n_kv_head
-        self.head_dim  = n_embd // n_head
-        kv_dim         = self.head_dim * n_kv_head
-        self.wq   = nn.Linear(n_embd, n_embd,    bias=False)
-        self.wkv  = nn.Linear(n_embd, 2 * kv_dim, bias=False)
-        self.proj = nn.Linear(n_embd, n_embd,    bias=False)
-    def forward(
-        self,
-        x: torch.Tensor,
-        rope_cos: torch.Tensor,
-        rope_sin: torch.Tensor,
-    ) -> torch.Tensor:
-        B, T, C = x.shape
-        nh, nkv, hd = self.n_head, self.n_kv_head, self.head_dim
-        groups = nh // nkv
-        # Project
-        q  = self.wq(x)                         # (B, T, n_embd)
-        kv = self.wkv(x)                        # (B, T, 2*kv_dim)
-        k, v = kv.split(hd * nkv, dim=-1)       # each (B, T, kv_dim)
-        # Reshape to (B, heads, T, head_dim)
-        q = q.view(B, T, nh,  hd).transpose(1, 2)   # (B, nh,  T, hd)
-        k = k.view(B, T, nkv, hd).transpose(1, 2)   # (B, nkv, T, hd)
-        v = v.view(B, T, nkv, hd).transpose(1, 2)   # (B, nkv, T, hd)
-        # Apply RoPE to queries and keys
-        q = apply_rope(q, rope_cos, rope_sin)
-        k = apply_rope(k, rope_cos, rope_sin)
-        # Expand KV heads to match query heads (GQA)
-        if groups > 1:
-            k = k.repeat_interleave(groups, dim=1)   # (B, nh, T, hd)
-            v = v.repeat_interleave(groups, dim=1)
-        # Scaled dot-product attention with causal mask
-        scale = math.sqrt(hd)
-        attn  = torch.matmul(q, k.transpose(-2, -1)) / scale  # (B, nh, T, T)
-        # Causal mask: upper triangle = -inf
-        causal = torch.triu(
-            torch.full((T, T), float("-inf"), device=x.device, dtype=x.dtype),
-            diagonal=1,
-        )
-        attn = attn + causal
-        attn = F.softmax(attn, dim=-1)
-        # Weighted sum and reshape
-        out = torch.matmul(attn, v)              # (B, nh, T, hd)
-        out = out.transpose(1, 2).contiguous().view(B, T, C)
-        return self.proj(out)
-class SwiGLUFFN(nn.Module):
-    """
-    SwiGLU feed-forward network.
-    forward: w_down(swish(w_gate(x)) * w_up(x))
-    """
-    def __init__(self, n_embd: int, inner_dim: int):
-        super().__init__()
-        self.w_gate = nn.Linear(n_embd, inner_dim, bias=False)
-        self.w_up   = nn.Linear(n_embd, inner_dim, bias=False)
-        self.w_down = nn.Linear(inner_dim, n_embd, bias=False)
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))
-class Block(nn.Module):
-    def __init__(self, n_embd: int, n_head: int, n_kv_head: int, inner_dim: int):
-        super().__init__()
-        self.ln1  = RMSNorm(n_embd)
-        self.attn = GQAttention(n_embd, n_head, n_kv_head)
-        self.ln2  = RMSNorm(n_embd)
-        self.ffwd = SwiGLUFFN(n_embd, inner_dim)
-    def forward(
-        self,
-        x: torch.Tensor,
-        rope_cos: torch.Tensor,
-        rope_sin: torch.Tensor,
-    ) -> torch.Tensor:
-        x = x + self.attn(self.ln1(x), rope_cos, rope_sin)
-        x = x + self.ffwd(self.ln2(x))
-        return x
-class GPT(nn.Module):
-    def __init__(
-        self,
-        vocab_size: int,
-        n_embd: int,
-        n_head: int,
-        n_kv_head: int,
-        n_layer: int,
-        block_size: int,
-        inner_dim: int,
-        rope_base: float = 10000.0,
-    ):
-        super().__init__()
-        self.block_size = block_size
-        head_dim = n_embd // n_head
-        self.wte    = nn.Embedding(vocab_size, n_embd)
-        self.blocks = nn.ModuleList(
-            [Block(n_embd, n_head, n_kv_head, inner_dim) for _ in range(n_layer)]
-        )
-        self.ln_f   = RMSNorm(n_embd)
-        # lm_head shares weights with wte (weight tying)
-        self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
-        self.lm_head.weight = self.wte.weight
-        # Precompute RoPE frequencies — registered as buffers (not parameters)
-        rope_cos, rope_sin = precompute_rope(head_dim, block_size, rope_base)
-        self.register_buffer("rope_cos", rope_cos)  # (block_size, head_dim//2)
-        self.register_buffer("rope_sin", rope_sin)
-    def forward(self, idx: torch.Tensor) -> torch.Tensor:
-        """
-        idx: (B, T) of token ids
-        Returns logits: (B, T, vocab_size)
-        """
-        B, T = idx.shape
-        assert T <= self.block_size, f"Sequence length {T} > block_size {self.block_size}"
-        x = self.wte(idx)                                   # (B, T, n_embd)
-        cos = self.rope_cos[:T]                             # (T, head_dim//2)
-        sin = self.rope_sin[:T]
-        for block in self.blocks:
-            x = block(x, cos, sin)
-        x = self.ln_f(x)
-        return self.lm_head(x)                              # (B, T, vocab_size)
-# ---------------------------------------------------------------------------
-# JLD2 weight loader
-# ---------------------------------------------------------------------------
-def _deref(f: h5py.File, ref):
-    """Dereference an HDF5 object reference."""
-    obj = f[ref]
-    return obj[()] if isinstance(obj, h5py.Dataset) else obj
-def _get_weight(f: h5py.File, struct, *path):
-    """
-    Walk a numpy.void struct following *path, dereferencing HDF5 references
-    at each step, and return the final value as a numpy array.
-    """
-    val = struct
-    for p in path:
-        val = val[p]
-        if isinstance(val, h5py.h5r.Reference):
-            val = _deref(f, val)
-    if isinstance(val, np.ndarray):
-        return val
-    return np.array(f[val])
-def load_weights_from_jld2(path: str, model: GPT) -> None:
-    """
-    Read weights from a JLD2 (HDF5) file produced by Julia's Flux.jl and
-    copy them into the PyTorch GPT model.
-    Julia is column-major.  h5py reads in row-major order, so:
-      - Embedding (2-D, vocab x embd): h5py gives (vocab, embd) -> use as-is
-      - Dense/Linear (2-D, in x out):  h5py gives (in, out)     -> transpose to (out, in)
-      - 1-D vectors (RMSNorm weight):  no transpose needed
-    """
-    print(f"Loading JLD2 weights from {path} ...")
-    with h5py.File(path, "r") as f:
-        ms = f["model_state"][()]
-        # ── top-level embedding ──────────────────────────────────────────────
-        wte_w = _get_weight(f, ms, "wte", "weight")      # h5py: (vocab, embd)
-        # No transpose: Julia Embedding stores (embd, vocab) internally,
-        # but HDF5 row-major flip already gives us (vocab, embd) which is
-        # what PyTorch Embedding expects.
-        model.wte.weight.data.copy_(
-            torch.from_numpy(wte_w.copy()).float()
-        )
-        # ── final layer norm ─────────────────────────────────────────────────
-        ln_f_w = _get_weight(f, ms, "ln_f", "weight")    # (embd,)
-        model.ln_f.weight.data.copy_(
-            torch.from_numpy(ln_f_w.copy()).float()
-        )
-        # ── transformer blocks ───────────────────────────────────────────────
-        blocks_ref  = ms["blocks"]
-        if isinstance(blocks_ref, h5py.h5r.Reference):
-            blocks_ref = _deref(f, blocks_ref)
-        layers_ref = blocks_ref["layers"]
-        if isinstance(layers_ref, h5py.h5r.Reference):
-            layers_ref = _deref(f, layers_ref)
-        for layer_idx, block in enumerate(model.blocks):
-            # Julia layers are 1-indexed
-            jl_key = str(layer_idx + 1)
-            l = layers_ref[jl_key]
-            def gw(*path):
-                return _get_weight(f, l, *path)
-            # Attention weights — h5py gives (in, out), need (out, in)
-            wq_np   = gw("attn", "wq",   "weight")   # (512, 512)
-            wkv_np  = gw("attn", "wkv",  "weight")   # (512, 256)
-            proj_np = gw("attn", "proj", "weight")   # (512, 512)
-            block.attn.wq.weight.data.copy_(
-                torch.from_numpy(wq_np.T.copy()).float()
-            )
-            block.attn.wkv.weight.data.copy_(
-                torch.from_numpy(wkv_np.T.copy()).float()
-            )
-            block.attn.proj.weight.data.copy_(
-                torch.from_numpy(proj_np.T.copy()).float()
-            )
-            # FFN weights — h5py gives (in, out), need (out, in)
-            w_gate_np = gw("ffwd", "w_gate", "weight")  # (512, 1344)
-            w_up_np   = gw("ffwd", "w_up",   "weight")  # (512, 1344)
-            w_down_np = gw("ffwd", "w_down", "weight")  # (1344, 512)
-            block.ffwd.w_gate.weight.data.copy_(
-                torch.from_numpy(w_gate_np.T.copy()).float()
-            )
-            block.ffwd.w_up.weight.data.copy_(
-                torch.from_numpy(w_up_np.T.copy()).float()
-            )
-            block.ffwd.w_down.weight.data.copy_(
-                torch.from_numpy(w_down_np.T.copy()).float()
-            )
-            # Layer norms — 1-D, no transpose
-            ln1_np = gw("ln1", "weight")  # (512,)
-            ln2_np = gw("ln2", "weight")  # (512,)
-            block.ln1.weight.data.copy_(
-                torch.from_numpy(ln1_np.copy()).float()
-            )
-            block.ln2.weight.data.copy_(
-                torch.from_numpy(ln2_np.copy()).float()
-            )
-    # Weight tying: lm_head must share wte's storage
-    model.lm_head.weight = model.wte.weight
-    print("Weights loaded successfully.")
-# ---------------------------------------------------------------------------
-# Sampling helpers
-# ---------------------------------------------------------------------------
-@torch.inference_mode()
-def _sample_next_token(
-    logits: torch.Tensor,           # (vocab_size,) on CPU
-    temperature: float,
-    top_k: int,
-    seen_ids: list[int],
-    repetition_penalty: float,
-) -> int:
-    """
-    Apply repetition penalty, temperature scaling, top-k filtering, then sample.
-    """
-    logits = logits.clone().float()
-    # Repetition penalty
-    if repetition_penalty != 1.0 and seen_ids:
-        for tok_id in set(seen_ids):
-            if logits[tok_id] > 0:
-                logits[tok_id] /= repetition_penalty
-            else:
-                logits[tok_id] *= repetition_penalty
-    # Temperature
-    logits = logits / max(temperature, 1e-6)
-    # Top-k
-    if 0 < top_k < logits.size(0):
-        topk_vals, _ = torch.topk(logits, top_k)
-        threshold = topk_vals[-1]
-        logits[logits < threshold] = float("-inf")
-    probs = F.softmax(logits, dim=-1)
-    next_id = torch.multinomial(probs, num_samples=1).item()
-    return int(next_id)
-# ---------------------------------------------------------------------------
-# Model initialisation at module level
-# ---------------------------------------------------------------------------
-# Compute inner_dim to match Julia's SwiGLUFFN sizing:
-#   raw_inner = floor(4 * n_embd * 2 / 3)  = floor(4*512*2/3) = 1365
-#   inner_dim = max(64, 64 * div(raw_inner + 32, 64))
-#             = max(64, 64 * div(1397, 64))
-#             = max(64, 64 * 21) = 1344
-_raw_inner = int(math.floor(4 * N_EMBD * 2 / 3))
-_INNER_DIM = max(64, 64 * ((_raw_inner + 32) // 64))   # 1344
-print(f"Building GPT model (n_layer={N_LAYER}, n_embd={N_EMBD}, "
-      f"n_head={N_HEAD}, n_kv_head={N_KV_HEAD}, inner_dim={_INNER_DIM}) ...")
-MODEL = GPT(
-    vocab_size=VOCAB_SIZE,
-    n_embd=N_EMBD,
-    n_head=N_HEAD,
-    n_kv_head=N_KV_HEAD,
-    n_layer=N_LAYER,
-    block_size=BLOCK_SIZE,
-    inner_dim=_INNER_DIM,
-    rope_base=ROPE_BASE,
-).to(DEVICE)
-# Download and load weights from HuggingFace Hub
-print(f"Downloading weights from {HF_REPO} ...")
-_weights_path = hf_hub_download(repo_id=HF_REPO, filename=HF_WEIGHTS)
-print(f"Downloading tokenizer from {HF_REPO} ...")
-_tokenizer_path = hf_hub_download(repo_id=HF_REPO, filename=HF_TOKENIZER)
-load_weights_from_jld2(_weights_path, MODEL)
-MODEL.eval()
-# Load tokenizer
-TOKENIZER: Tokenizer = Tokenizer.from_file(_tokenizer_path)
-print("Tokenizer loaded.")
-MODEL_CREATED_AT = int(time.time())
-print(f"JuliaFluxGPT ready on device={DEVICE}.")
-# ---------------------------------------------------------------------------
-# Token-by-token generator
-# ---------------------------------------------------------------------------
-@torch.inference_mode()
-def generate_stream(
-    prompt: str,
-    max_tokens: int = 200,
-    temperature: float = 0.1,
-    top_k: int = 8,
-    repetition_penalty: float = 1.3,
-):
-    """
-    Yields (token_text: str, is_last: bool) one token at a time.
-    Uses a sliding window of BLOCK_SIZE tokens.
-    """
-    # Encode prompt; if empty start with a random token
-    if prompt.strip():
-        input_ids = TOKENIZER.encode(prompt).ids
-    else:
-        input_ids = [int(torch.randint(VOCAB_SIZE, (1,)).item())]
-    context: list[int] = list(input_ids)
-    generated: list[int] = []
-    for step in range(max_tokens):
-        # Sliding window
-        window = context[-BLOCK_SIZE:]
-        idx = torch.tensor([window], dtype=torch.long, device=DEVICE)  # (1, T)
-        logits = MODEL(idx)                     # (1, T, vocab_size)
-        next_logits = logits[0, -1, :].cpu()   # (vocab_size,)
-        # Build seen window for repetition penalty (last 64 tokens)
-        seen = context[max(0, len(context) - 64):]
-        next_id = _sample_next_token(
-            next_logits,
-            temperature=temperature,
-            top_k=top_k,
-            seen_ids=seen,
-            repetition_penalty=repetition_penalty,
-        )
-        generated.append(next_id)
-        context.append(next_id)
-        token_text = TOKENIZER.decode([next_id])
-        is_last    = (step == max_tokens - 1)
-        yield token_text, is_last
-# ---------------------------------------------------------------------------
-# Pydantic request / response models
-# ---------------------------------------------------------------------------
-class Message(BaseModel):
-    role: str
-    content: str
-class ChatRequest(BaseModel):
-    model: Optional[str] = MODEL_ID
-    messages: List[Message]
-    temperature: Optional[float] = 0.8
-    max_tokens: Optional[int] = 200
-    top_k: Optional[int] = 40
-    repetition_penalty: Optional[float] = 1.3
-    stream: Optional[bool] = False
-    n: Optional[int] = 1
-# ---------------------------------------------------------------------------
-# FastAPI application
-# ---------------------------------------------------------------------------
-app = FastAPI(title="JuliaFluxGPT", version="1.0.0")
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=["*"],
-    allow_methods=["*"],
-    allow_headers=["*"],
-)
-def _openai_error(status: int, message: str, err_type: str = "invalid_request_error", code: str = None):
-    body = {"error": {"message": message, "type": err_type}}
-    if code:
-        body["error"]["code"] = code
-    return JSONResponse(status_code=status, content=body)
-@app.exception_handler(HTTPException)
-async def http_exception_handler(request: Request, exc: HTTPException):
-    return _openai_error(exc.status_code, str(exc.detail))
-@app.exception_handler(RequestValidationError)
-async def validation_exception_handler(request: Request, exc: RequestValidationError):
-    msg = "; ".join(f"{e['loc'][-1]}: {e['msg']}" for e in exc.errors())
-    return _openai_error(422, msg, code="invalid_request_error")
-# ── GET / ────────────────────────────────────────────────────────────────────
-@app.get("/")
-def root():
-    return {
-        "name": "JuliaFluxGPT",
-        "version": "1.0.0",
-        "description": "LLaMA-style GPT in Flux.jl — trained on philosophy and mathematics",
-        "architecture": "RoPE + SwiGLU + GQA + RMSNorm + weight tying",
-        "hyperparams": {
-            "vocab_size": VOCAB_SIZE,
-            "n_embd": N_EMBD,
-            "n_head": N_HEAD,
-            "n_kv_head": N_KV_HEAD,
-            "n_layer": N_LAYER,
-            "block_size": BLOCK_SIZE,
-        },
-        "endpoints": ["/v1/models", "/v1/chat/completions"],
-        "compatible_with": ["OpenAI API", "OpenRouter"],
-    }
-# ── GET /v1/models ───────────────────────────────────────────────────────────
-@app.get("/v1/models")
-def list_models():
-    return {
-        "object": "list",
-        "data": [
-            {
-                "id": MODEL_ID,
-                "object": "model",
-                "created": MODEL_CREATED_AT,
-                "owned_by": "juliafluxgpt",
-            }
-        ],
-    }
-# ── POST /v1/chat/completions ─────────────────────────────────────────────────
-def _sse(data: dict) -> str:
-    return f"data: {json.dumps(data)}\n\n"
-def _stream_completion(prompt, max_tokens, temperature, top_k, rep_penalty, completion_id):
-    """Synchronous generator that yields SSE chunks one token at a time."""
-    token_count = 0
-    for token_text, is_last in generate_stream(
-        prompt=prompt,
-        max_tokens=max_tokens,
-        temperature=temperature,
-        top_k=top_k,
-        repetition_penalty=rep_penalty,
-    ):
-        token_count += 1
-        finish_reason = ("length" if token_count >= max_tokens else "stop") if is_last else None
-        yield _sse({
-            "id": completion_id,
-            "object": "chat.completion.chunk",
-            "created": int(time.time()),
-            "model": MODEL_ID,
-            "choices": [{
-                "index": 0,
-                "delta": {"content": token_text},
-                "finish_reason": finish_reason,
-            }],
-        })
-    yield "data: [DONE]\n\n"
-@app.post("/v1/chat/completions")
-def chat_completions(request: ChatRequest):
-    # Extract prompt from the last user message
-    prompt = request.messages[-1].content.strip() if request.messages else ""
-    if not prompt:
-        raise HTTPException(status_code=400, detail="No content in messages")
-    max_tokens = max(1, min(request.max_tokens or 200, BLOCK_SIZE))
-    temperature = max(0.01, min(request.temperature or 0.8, 2.0))
-    top_k = max(1, min(request.top_k or 40, VOCAB_SIZE))
-    rep_penalty = max(1.0, min(request.repetition_penalty or 1.3, 3.0))
-    n = max(1, min(request.n or 1, 4))
-    completion_id = f"chatcmpl-{uuid.uuid4().hex[:8]}"
-    # ── Streaming ────────────────────────────────────────────────────────────
-    if request.stream:
-        return StreamingResponse(
-            _stream_completion(
-                prompt, max_tokens, temperature,
-                top_k, rep_penalty, completion_id,
-            ),
-            media_type="text/event-stream",
-            headers={"X-Accel-Buffering": "no"},
-        )
-    # ── Non-streaming (generate all n completions) ────────────────────────────
-    choices = []
-    total_completion_tokens = 0
-    for i in range(n):
-        tokens = list(
-            generate_stream(
-                prompt=prompt,
-                max_tokens=max_tokens,
-                temperature=temperature,
-                top_k=top_k,
-                repetition_penalty=rep_penalty,
-            )
-        )
-        content = "".join(t for t, _ in tokens)
-        total_completion_tokens += len(tokens)
-        choices.append({
-            "index": i,
-            "message": {"role": "assistant", "content": content},
-            "finish_reason": "length" if len(tokens) >= max_tokens else "stop",
-        })
-    prompt_tokens = len(TOKENIZER.encode(prompt).ids) if prompt else 0
-    return {
-        "id": completion_id,
-        "object": "chat.completion",
-        "created": int(time.time()),
-        "model": MODEL_ID,
-        "system_fingerprint": "juliafluxgpt-v1",
-        "choices": choices,
-        "usage": {
-            "prompt_tokens": prompt_tokens,
-            "completion_tokens": total_completion_tokens,
-            "total_tokens": prompt_tokens + total_completion_tokens,
-        },
-    }
-# ---------------------------------------------------------------------------
-# Entrypoint
-# ---------------------------------------------------------------------------
-if __name__ == "__main__":
-    port = int(os.environ.get("PORT", 7860))
-    uvicorn.run("server:app", host="0.0.0.0", port=port, reload=False)