--- language: - code license: mit tags: - javascript - code-generation - fill-in-the-middle - gpt - pytorch library_name: custom --- # JSCoder — JavaScript Code Completion Model (~300M) A GPT-style decoder-only language model trained from scratch on ~1B tokens of JavaScript source code (sourced from The Stack). It supports both plain next-token completion and **fill-in-the-middle (FIM)** autocomplete at the cursor position (StarCoder-style PSM/SPM format). ## Architecture | Hyper-parameter | Value | |---|---| | Parameters | ~300M | | Layers | 24 | | Hidden dim | 1024 | | Heads | 16 | | Context window | 1024 tokens | | Vocabulary | 8 192 (byte-level BPE, JS-tuned) | | Positional encoding | RoPE | | Normalization | RMSNorm | | Activation | SwiGLU | | Weight tying | Yes (embedding ↔ lm_head) | ## Files | File | Description | |---|---| | `checkpoints/jscoder_300m/ckpt.pt` | PyTorch checkpoint (`model` state-dict + `config` dict) | | `tokenizer/js_bpe.json` | Byte-level BPE tokenizer (HuggingFace `tokenizers` format) | | `model/gpt.py` | Model definition (`GPT`, `GPTConfig`) | | `tokenizer/tokenizer.py` | `JSCoderTokenizer` wrapper | | `sample.py` | Inference script (plain completion + FIM) | ## Quick Start ```bash git clone https://huggingface.co/YOUR_USERNAME/jscoder-300m cd jscoder-300m pip install torch tokenizers ``` ### Plain completion ```bash python sample.py \ --ckpt checkpoints/jscoder_300m/ckpt.pt \ --prompt "// returns the sum of all numbers in the array const sumArray = (items) => { let result = 0; for (const item of items) {" \ --max-new-tokens 80 --temperature 0.2 ``` ### Fill-in-the-middle (autocomplete at cursor) ```bash python sample.py \ --ckpt checkpoints/jscoder_300m/ckpt.pt \ --fim \ --prefix $'function sum(arr) {\n let total = 0;\n ' \ --suffix $'\n return total;\n}' \ --temperature 0.2 ``` ### Python API ```python import torch from model.gpt import GPT, GPTConfig from tokenizer.tokenizer import JSCoderTokenizer ckpt = torch.load("checkpoints/jscoder_300m/ckpt.pt", map_location="cpu") model = GPT(GPTConfig(**ckpt["config"])) model.load_state_dict(ckpt["model"]) model.eval() tok = JSCoderTokenizer.load("tokenizer/js_bpe.json") prompt = "// parses JSON safely\nfunction parseJSON(str) {\n try {" ids = tok.encode(prompt) idx = torch.tensor([ids], dtype=torch.long) with torch.no_grad(): out = model.generate(idx, max_new_tokens=100, temperature=0.2, top_k=50) print(tok.decode(out[0].tolist())) ``` ## Capability Tiers The model is most reliable on patterns that dominate its training data: **Tier 1 — high confidence:** - `try/catch` JSON parse / async fetch wrappers - `for-of` accumulators - Throttle / memoize (when scaffolded with the outer shell) **Tier 2 — partial (right structure, minor logic error):** - Word capitalisation, type guards, number validation **Tier 3 — scaffold required:** - `Array.isArray` ternaries, `Set` dedup, `Object.assign` merge, `hasOwnProperty`, deep clone See [`inference.md`](inference.md) for detailed prompt examples and scaffolding strategies for each tier. ## Training Trained with a custom PyTorch loop (`train.py`) on sharded `.bin` token files packed from ~1B tokens of JavaScript from [The Stack](https://huggingface.co/datasets/bigcode/the-stack). ``` Tokenizer: byte-level BPE, 8 192 vocab, trained on the same corpus Optimizer: AdamW, lr=3e-4, cosine decay, warmup=500 iters Batch size: 512 tokens × grad-accum 128 → ~65k tokens/step Hardware: trained on cloud GPU (A5000+) ``` ## Limitations - Trained on JavaScript only; will not generalise to other languages. - Small vocabulary (8 192) causes slightly longer tokenisation of uncommon identifiers. - Recursive / divide-and-conquer patterns are weak — the model has not seen enough of them to generalise reliably. - Not RLHF-tuned; outputs are raw language model completions. ## License MIT