| --- |
| language: |
| - code |
| license: mit |
| tags: |
| - javascript |
| - code-generation |
| - fill-in-the-middle |
| - gpt |
| - pytorch |
| library_name: custom |
| --- |
| |
| # JSCoder β JavaScript Code Completion Model (~300M) |
|
|
| A GPT-style decoder-only language model trained from scratch on ~1B tokens of |
| JavaScript source code (sourced from The Stack). It supports both plain |
| next-token completion and **fill-in-the-middle (FIM)** autocomplete at the |
| cursor position (StarCoder-style PSM/SPM format). |
|
|
| ## Architecture |
|
|
| | Hyper-parameter | Value | |
| |---|---| |
| | Parameters | ~300M | |
| | Layers | 24 | |
| | Hidden dim | 1024 | |
| | Heads | 16 | |
| | Context window | 1024 tokens | |
| | Vocabulary | 8 192 (byte-level BPE, JS-tuned) | |
| | Positional encoding | RoPE | |
| | Normalization | RMSNorm | |
| | Activation | SwiGLU | |
| | Weight tying | Yes (embedding β lm_head) | |
| |
| ## Files |
| |
| | File | Description | |
| |---|---| |
| | `checkpoints/jscoder_300m/ckpt.pt` | PyTorch checkpoint (`model` state-dict + `config` dict) | |
| | `tokenizer/js_bpe.json` | Byte-level BPE tokenizer (HuggingFace `tokenizers` format) | |
| | `model/gpt.py` | Model definition (`GPT`, `GPTConfig`) | |
| | `tokenizer/tokenizer.py` | `JSCoderTokenizer` wrapper | |
| | `sample.py` | Inference script (plain completion + FIM) | |
|
|
| ## Quick Start |
|
|
| ```bash |
| git clone https://huggingface.co/YOUR_USERNAME/jscoder-300m |
| cd jscoder-300m |
| pip install torch tokenizers |
| ``` |
|
|
| ### Plain completion |
|
|
| ```bash |
| python sample.py \ |
| --ckpt checkpoints/jscoder_300m/ckpt.pt \ |
| --prompt "// returns the sum of all numbers in the array |
| const sumArray = (items) => { |
| let result = 0; |
| for (const item of items) {" \ |
| --max-new-tokens 80 --temperature 0.2 |
| ``` |
|
|
| ### Fill-in-the-middle (autocomplete at cursor) |
|
|
| ```bash |
| python sample.py \ |
| --ckpt checkpoints/jscoder_300m/ckpt.pt \ |
| --fim \ |
| --prefix $'function sum(arr) {\n let total = 0;\n ' \ |
| --suffix $'\n return total;\n}' \ |
| --temperature 0.2 |
| ``` |
|
|
| ### Python API |
|
|
| ```python |
| import torch |
| from model.gpt import GPT, GPTConfig |
| from tokenizer.tokenizer import JSCoderTokenizer |
| |
| ckpt = torch.load("checkpoints/jscoder_300m/ckpt.pt", map_location="cpu") |
| model = GPT(GPTConfig(**ckpt["config"])) |
| model.load_state_dict(ckpt["model"]) |
| model.eval() |
| |
| tok = JSCoderTokenizer.load("tokenizer/js_bpe.json") |
| |
| prompt = "// parses JSON safely\nfunction parseJSON(str) {\n try {" |
| ids = tok.encode(prompt) |
| idx = torch.tensor([ids], dtype=torch.long) |
| |
| with torch.no_grad(): |
| out = model.generate(idx, max_new_tokens=100, temperature=0.2, top_k=50) |
| |
| print(tok.decode(out[0].tolist())) |
| ``` |
|
|
| ## Capability Tiers |
|
|
| The model is most reliable on patterns that dominate its training data: |
|
|
| **Tier 1 β high confidence:** |
| - `try/catch` JSON parse / async fetch wrappers |
| - `for-of` accumulators |
| - Throttle / memoize (when scaffolded with the outer shell) |
|
|
| **Tier 2 β partial (right structure, minor logic error):** |
| - Word capitalisation, type guards, number validation |
|
|
| **Tier 3 β scaffold required:** |
| - `Array.isArray` ternaries, `Set` dedup, `Object.assign` merge, |
| `hasOwnProperty`, deep clone |
|
|
| See [`inference.md`](inference.md) for detailed prompt examples and scaffolding |
| strategies for each tier. |
|
|
| ## Training |
|
|
| Trained with a custom PyTorch loop (`train.py`) on sharded `.bin` token files |
| packed from ~1B tokens of JavaScript from [The Stack](https://huggingface.co/datasets/bigcode/the-stack). |
|
|
| ``` |
| Tokenizer: byte-level BPE, 8 192 vocab, trained on the same corpus |
| Optimizer: AdamW, lr=3e-4, cosine decay, warmup=500 iters |
| Batch size: 512 tokens Γ grad-accum 128 β ~65k tokens/step |
| Hardware: trained on cloud GPU (A5000+) |
| ``` |
|
|
| ## Limitations |
|
|
| - Trained on JavaScript only; will not generalise to other languages. |
| - Small vocabulary (8 192) causes slightly longer tokenisation of uncommon |
| identifiers. |
| - Recursive / divide-and-conquer patterns are weak β the model has not seen |
| enough of them to generalise reliably. |
| - Not RLHF-tuned; outputs are raw language model completions. |
|
|
| ## License |
|
|
| MIT |
|
|