Add model card for JuliaGPT-v2 (384d/6L char-level model)

286fc72 verified 4 days ago

3.34 kB

	---
	language:
	- en
	license: mit
	library_name: flux
	tags:
	- julia
	- flux-jl
	- character-level
	- philosophy
	- transformer
	- gpt-2
	- text-generation
	pipeline_tag: text-generation
	datasets:
	- LisaMegaWatts/philosophy-corpus
	model-index:
	- name: JuliaGPT-v2
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: LisaMegaWatts/philosophy-corpus
	name: philosophy-corpus
	metrics:
	- type: loss
	value: 2.91
	name: Val Loss
	verified: false
	---

	# JuliaGPT-v2

	A ~10M parameter character-level GPT trained on classical philosophy texts. Scaled-up successor to the original [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) (8K params), using the same 38-character vocabulary but with a much larger architecture.

	## Model Lineage

	\| Model \| Params \| Architecture \| Vocab \| Val Loss \|
	\|-------\|--------\|-------------\|-------\|----------\|
	\| [MicroJulia](https://huggingface.co/LisaMegaWatts/MicroJulia) \| 4,992 \| 1L/16d/4H, block=64 \| 27 chars \| 2.43 \|
	\| [JuliaGPT](https://huggingface.co/LisaMegaWatts/JuliaGPT) \| 8,096 \| 1L/16d/4H, block=256 \| 29 chars \| 2.34 \|
	\| JuliaGPT-v2 \| ~10M \| 6L/384d/6H, block=256 \| 38 chars \| 2.91 \|

	## Architecture

	```
	GPT (GPT-2 style, scaled)
	+-- wte: Embedding(38 -> 384)
	+-- wpe: Embedding(256 -> 384) [learned position embeddings]
	+-- blocks x 6:
	\| +-- attn: CausalSelfAttention
	\| \| +-- wq: Dense(384 -> 384) [6 heads, 64 dim each]
	\| \| +-- wk: Dense(384 -> 384)
	\| \| +-- wv: Dense(384 -> 384)
	\| \| +-- wo: Dense(384 -> 384)
	\| +-- ffwd: FeedForward
	\| +-- Dense(384 -> 1536)
	\| +-- Dense(1536 -> 384)
	+-- lm_head: Dense(384 -> 38)
	```

	### Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| GPT-2 style Transformer \|
	\| Parameters \| ~10M \|
	\| Embedding dim \| 384 \|
	\| Layers \| 6 \|
	\| Attention heads \| 6 \|
	\| Head dim \| 64 \|
	\| Context length \| 256 characters \|
	\| Vocabulary \| 38 characters (a-z, space, punctuation) \|
	\| Dropout \| 0.1 \|
	\| Weight tying \| No (separate lm_head) \|
	\| Framework \| Julia + Flux.jl \|

	### Vocabulary

	38 characters: `` !"'(),-.:;?abcdefghijklmnopqrstuvwxyz``

	Character-level tokenization with no BPE — each character is one token.

	## Training

	\| \| Value \|
	\|---\|---\|
	\| Dataset \| Classical philosophy corpus \|
	\| Training steps \| 14,739 \|
	\| Best val loss \| 2.91 \|
	\| Hardware \| NVIDIA RTX 3060 12GB \|
	\| Precision \| Float32 \|

	## Inference Settings

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| vocab_size \| 38 \|
	\| context_length \| 256 \|
	\| temperature \| 0.8 \|
	\| top_k \| 40 \|

	## Checkpoint Format

	JLD2 files containing:
	- `model_state` — Flux model weights
	- `hyperparams` — `Dict("n_embd"=>384, "n_layer"=>6, "n_head"=>6, "vocab_size"=>38, "block_size"=>256, "dropout"=>0.1)`
	- `step` — 14,739
	- `best_val_loss` — 2.91

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `final_model.jld2` \| Final training checkpoint \|
	\| `best_model.jld2` \| Best validation loss checkpoint \|
	\| `checkpoint_latest.jld2` \| Latest periodic checkpoint \|
	\| `vocab.json` \| Character vocabulary (38 chars) \|

	## Provenance

	- Author: LisaMegaWatts
	- Source code: [DavinciDreams/JuliaGPT](https://github.com/DavinciDreams/JuliaGPT)

	## License

	MIT