---
language:
  - code
license: mit
tags:
  - javascript
  - code-generation
  - fill-in-the-middle
  - gpt
  - pytorch
library_name: custom
---

# JSCoder — JavaScript Code Completion Model (~300M)

A GPT-style decoder-only language model trained from scratch on ~1B tokens of
JavaScript source code (sourced from The Stack). It supports both plain
next-token completion and **fill-in-the-middle (FIM)** autocomplete at the
cursor position (StarCoder-style PSM/SPM format).

## Architecture

| Hyper-parameter | Value |
|---|---|
| Parameters | ~300M |
| Layers | 24 |
| Hidden dim | 1024 |
| Heads | 16 |
| Context window | 1024 tokens |
| Vocabulary | 8 192 (byte-level BPE, JS-tuned) |
| Positional encoding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |
| Weight tying | Yes (embedding ↔ lm_head) |

## Files

| File | Description |
|---|---|
| `checkpoints/jscoder_300m/ckpt.pt` | PyTorch checkpoint (`model` state-dict + `config` dict) |
| `tokenizer/js_bpe.json` | Byte-level BPE tokenizer (HuggingFace `tokenizers` format) |
| `model/gpt.py` | Model definition (`GPT`, `GPTConfig`) |
| `tokenizer/tokenizer.py` | `JSCoderTokenizer` wrapper |
| `sample.py` | Inference script (plain completion + FIM) |

## Quick Start

```bash
git clone https://huggingface.co/YOUR_USERNAME/jscoder-300m
cd jscoder-300m
pip install torch tokenizers
```

### Plain completion

```bash
python sample.py \
  --ckpt checkpoints/jscoder_300m/ckpt.pt \
  --prompt "// returns the sum of all numbers in the array
const sumArray = (items) => {
  let result = 0;
  for (const item of items) {" \
  --max-new-tokens 80 --temperature 0.2
```

### Fill-in-the-middle (autocomplete at cursor)

```bash
python sample.py \
  --ckpt checkpoints/jscoder_300m/ckpt.pt \
  --fim \
  --prefix $'function sum(arr) {\n  let total = 0;\n  ' \
  --suffix $'\n  return total;\n}' \
  --temperature 0.2
```

### Python API

```python
import torch
from model.gpt import GPT, GPTConfig
from tokenizer.tokenizer import JSCoderTokenizer

ckpt = torch.load("checkpoints/jscoder_300m/ckpt.pt", map_location="cpu")
model = GPT(GPTConfig(**ckpt["config"]))
model.load_state_dict(ckpt["model"])
model.eval()

tok = JSCoderTokenizer.load("tokenizer/js_bpe.json")

prompt = "// parses JSON safely\nfunction parseJSON(str) {\n  try {"
ids = tok.encode(prompt)
idx = torch.tensor([ids], dtype=torch.long)

with torch.no_grad():
    out = model.generate(idx, max_new_tokens=100, temperature=0.2, top_k=50)

print(tok.decode(out[0].tolist()))
```

## Capability Tiers

The model is most reliable on patterns that dominate its training data:

**Tier 1 — high confidence:**
- `try/catch` JSON parse / async fetch wrappers
- `for-of` accumulators
- Throttle / memoize (when scaffolded with the outer shell)

**Tier 2 — partial (right structure, minor logic error):**
- Word capitalisation, type guards, number validation

**Tier 3 — scaffold required:**
- `Array.isArray` ternaries, `Set` dedup, `Object.assign` merge,
  `hasOwnProperty`, deep clone

See [`inference.md`](inference.md) for detailed prompt examples and scaffolding
strategies for each tier.

## Training

Trained with a custom PyTorch loop (`train.py`) on sharded `.bin` token files
packed from ~1B tokens of JavaScript from [The Stack](https://huggingface.co/datasets/bigcode/the-stack).

```
Tokenizer:  byte-level BPE, 8 192 vocab, trained on the same corpus
Optimizer:  AdamW, lr=3e-4, cosine decay, warmup=500 iters
Batch size: 512 tokens × grad-accum 128 → ~65k tokens/step
Hardware:   trained on cloud GPU (A5000+)
```

## Limitations

- Trained on JavaScript only; will not generalise to other languages.
- Small vocabulary (8 192) causes slightly longer tokenisation of uncommon
  identifiers.
- Recursive / divide-and-conquer patterns are weak — the model has not seen
  enough of them to generalise reliably.
- Not RLHF-tuned; outputs are raw language model completions.

## License

MIT