WIGIP-1 v2
WIGIP-1 v2
โ ๏ธ implementation & Training Scripts: The full source code, JAX training loops, and architecture definitions are available on my GitHub: ๐ Click here to view the Training Scripts on GitHub
Stage 1 โ Text Pre-Training (ViT-Style Transformer)
WIGIP-1 v2 is an experimental research model exploring Vision Transformer (ViT) style architectures for text modeling, implemented using JAX + Flax with Fully Sharded Data Parallelism (FSDP) via pjit.
This repository currently contains ONLY Phase 1 (Text Pre-Training).
โ ๏ธ Training Status (IMPORTANT)
โ Phase 1: Text-only pre-training
- Character-level language modeling
- Dataset: C4 (English)
- Architecture: ViT-style transformer applied to reshaped text
- ~57,000 training steps completed
- Training performed using streaming data and FSDP
โ Phase 2: Image training (NOT DONE)
- No image data has been used
- No multimodal or vision supervision yet
- This phase is planned for future work
๐จ Model weights will be updated in the future once Phase 2 training is performed. Do NOT treat current checkpoints as a final or multimodal-capable model.
๐ง Model Overview (Phase 1)
- Text is tokenized at character level
- Tokens are reshaped into a 2D grid
- Grid is treated like an image and processed using:
- Patch embedding via convolution
- Multi-head self-attention
- Feed-forward blocks
- Final output predicts the next character token
This phase is intended to test whether ViT-style inductive biases can learn meaningful structure from text alone.
โ๏ธ Technical Highlights
- JAX + Flax + Optax
pjitwith 2D mesh (data,model)- Activation rematerialization (
nn.remat) - Gradient clipping
- Warmup + cosine learning rate schedule
- Streaming dataset (no full dataset in memory)
๐พ Checkpointing
- Checkpoints are:
- Automatically saved at time intervals
- Compressed into
.ziparchives - Contain:
- Model parameters (
.pkl.gz) - Optimizer state
- Training step metadata
- Model parameters (
- Training can be safely resumed from the latest zipped checkpoint
๐ฎ Future Work
- Phase 2: Image-based training
- Multimodal alignment (text + vision)
- Scaling beyond current step count
- Improved tokenization strategies
- Evaluation on downstream tasks
โ ๏ธ Disclaimer
This is research code and an experimental architecture. Results are preliminary and not production-ready.