How to use from
SGLang
Install from pip and serve model
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "programasweights/paw-4b-gpt2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "programasweights/paw-4b-gpt2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker images
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "programasweights/paw-4b-gpt2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "programasweights/paw-4b-gpt2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Quick Links

paw-4b-gpt2 β€” ProgramAsWeights "Compact" compiler

This is the Compact compiler from ProgramAsWeights (PAW). Given a natural-language spec, it emits a tiny per-task program β€” a LoRA adapter β€” that runs locally on a GPT-2 (124M) interpreter (small enough to run in the browser).

It is the model invoked by paw.compile(spec, compiler="paw-4b-gpt2").

  • Compiler base model: Qwen/Qwen3-4B-Instruct-2507
  • Target interpreter: a custom GPT-2 (124M) whose positional embeddings are extended from 1024 β†’ 2048 (n_ctx=2048); tokenizer is stock GPT-2 BPE.
  • Snapshot: 20260406 (see git tag 20260406)

Contents

  • compiler/ β€” a finetuned Qwen3-4B-Instruct-2507 causal LM (the compiler).
  • lora_mapper.pt β€” the mapper head (trunk + coefficient head + learnable LoRA basis matrices) that turns the compiler's hidden states into a LoRA program.
  • meta.json β€” lora_rank=64, lora_alpha=16, lora_num_bases=64, prefix_steps=64, target modules [c_attn, c_proj, c_fc].

How it works

  1. The 4B compiler generates a short "pseudo-program" (a task description plus a few I/O examples) from the spec.
  2. The text chat_template(spec) + pseudo-program + 64 prefix tokens is run through the compiler; the mapper reads the 64 prefix hidden states and emits per-layer LoRA A/B matrices as a learned mixture of basis matrices.
  3. The resulting LoRA (about 5 MB) is the program. It loads onto the GPT-2 interpreter and runs locally/offline (including in-browser).

Status

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for programasweights/paw-4b-gpt2

Adapter
(5498)
this model