Upload README.md with huggingface_hub

3f3ea2c verified 24 days ago

3.94 kB

	---
	language:
	- code
	license: mit
	tags:
	- javascript
	- code-generation
	- fill-in-the-middle
	- gpt
	- pytorch
	library_name: custom
	---

	# JSCoder — JavaScript Code Completion Model (~300M)

	A GPT-style decoder-only language model trained from scratch on ~1B tokens of
	JavaScript source code (sourced from The Stack). It supports both plain
	next-token completion and fill-in-the-middle (FIM) autocomplete at the
	cursor position (StarCoder-style PSM/SPM format).

	## Architecture

	\| Hyper-parameter \| Value \|
	\|---\|---\|
	\| Parameters \| ~300M \|
	\| Layers \| 24 \|
	\| Hidden dim \| 1024 \|
	\| Heads \| 16 \|
	\| Context window \| 1024 tokens \|
	\| Vocabulary \| 8 192 (byte-level BPE, JS-tuned) \|
	\| Positional encoding \| RoPE \|
	\| Normalization \| RMSNorm \|
	\| Activation \| SwiGLU \|
	\| Weight tying \| Yes (embedding ↔ lm_head) \|

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `checkpoints/jscoder_300m/ckpt.pt` \| PyTorch checkpoint (`model` state-dict + `config` dict) \|
	\| `tokenizer/js_bpe.json` \| Byte-level BPE tokenizer (HuggingFace `tokenizers` format) \|
	\| `model/gpt.py` \| Model definition (`GPT`, `GPTConfig`) \|
	\| `tokenizer/tokenizer.py` \| `JSCoderTokenizer` wrapper \|
	\| `sample.py` \| Inference script (plain completion + FIM) \|

	## Quick Start

	```bash
	git clone https://huggingface.co/YOUR_USERNAME/jscoder-300m
	cd jscoder-300m
	pip install torch tokenizers
	```

	### Plain completion

	```bash
	python sample.py \
	--ckpt checkpoints/jscoder_300m/ckpt.pt \
	--prompt "// returns the sum of all numbers in the array
	const sumArray = (items) => {
	let result = 0;
	for (const item of items) {" \
	--max-new-tokens 80 --temperature 0.2
	```

	### Fill-in-the-middle (autocomplete at cursor)

	```bash
	python sample.py \
	--ckpt checkpoints/jscoder_300m/ckpt.pt \
	--fim \
	--prefix $'function sum(arr) {\n let total = 0;\n ' \
	--suffix $'\n return total;\n}' \
	--temperature 0.2
	```

	### Python API

	```python
	import torch
	from model.gpt import GPT, GPTConfig
	from tokenizer.tokenizer import JSCoderTokenizer

	ckpt = torch.load("checkpoints/jscoder_300m/ckpt.pt", map_location="cpu")
	model = GPT(GPTConfig(**ckpt["config"]))
	model.load_state_dict(ckpt["model"])
	model.eval()

	tok = JSCoderTokenizer.load("tokenizer/js_bpe.json")

	prompt = "// parses JSON safely\nfunction parseJSON(str) {\n try {"
	ids = tok.encode(prompt)
	idx = torch.tensor([ids], dtype=torch.long)

	with torch.no_grad():
	out = model.generate(idx, max_new_tokens=100, temperature=0.2, top_k=50)

	print(tok.decode(out[0].tolist()))
	```

	## Capability Tiers

	The model is most reliable on patterns that dominate its training data:

	Tier 1 — high confidence:
	- `try/catch` JSON parse / async fetch wrappers
	- `for-of` accumulators
	- Throttle / memoize (when scaffolded with the outer shell)

	Tier 2 — partial (right structure, minor logic error):
	- Word capitalisation, type guards, number validation

	Tier 3 — scaffold required:
	- `Array.isArray` ternaries, `Set` dedup, `Object.assign` merge,
	`hasOwnProperty`, deep clone

	See [`inference.md`](inference.md) for detailed prompt examples and scaffolding
	strategies for each tier.

	## Training

	Trained with a custom PyTorch loop (`train.py`) on sharded `.bin` token files
	packed from ~1B tokens of JavaScript from [The Stack](https://huggingface.co/datasets/bigcode/the-stack).

	```
	Tokenizer: byte-level BPE, 8 192 vocab, trained on the same corpus
	Optimizer: AdamW, lr=3e-4, cosine decay, warmup=500 iters
	Batch size: 512 tokens × grad-accum 128 → ~65k tokens/step
	Hardware: trained on cloud GPU (A5000+)
	```

	## Limitations

	- Trained on JavaScript only; will not generalise to other languages.
	- Small vocabulary (8 192) causes slightly longer tokenisation of uncommon
	identifiers.
	- Recursive / divide-and-conquer patterns are weak — the model has not seen
	enough of them to generalise reliably.
	- Not RLHF-tuned; outputs are raw language model completions.

	## License

	MIT