Open Source Research for Building Large Language Models from Scratch and Finetuning on TPUs & GPUs
AutonomousX aims to make LLM training infrastructure accessible and reproducible for researchers, students, and developers. While modern language models are widely used, complete end-to-end guides for training LLMs from scratch on TPUs remain scarce, particularly for beginners working with JAX and distributed TPU training. AutonomousX focuses on filling this gap by publishing fully reproducible open-source pipelines that demonstrate how to train language models from scratch using limited compute resources.
Compute supporting the development of this organization and its models was provided by Google's TRC Program (TPU Research Cloud).
The organization explores multiple aspects of efficient LLM training on TPUs, including:
The goal is to demonstrate how meaningful LLM research can be conducted even with compute-limited environments.
AutonomousX develops the Instinct family of language models. These models are built entirely from scratch, including tokenizer, architecture, training pipeline, and TPU training infrastructure. Instinct models explore different configurations such as:
The models are designed to demonstrate how modern language models can be trained on small TPU pods such as TPU v4-8.
One of the core goals of AutonomousX is to explore efficient training on limited compute resources. Research focuses on training models:
By optimizing training pipelines and architecture design, AutonomousX investigates how far efficient training can scale without access to massive GPU clusters.
AutonomousX publishes complete reproducible implementations including Dataset pipelines, Tokenizer training, Model architectures, TPU training scripts, Checkpointing systems, and Inference pipelines. All repositories aim to provide transparent and educational implementations so the open-source community can learn how large language models are trained from the ground up.
Many tutorials focus only on using pretrained models, but very few resources explain:
AutonomousX aims to make these processes accessible, understandable, and reproducible.