NeMo
Megatron-LM / docs /get-started /quickstart.md
KexuanShi's picture
Upload folder using huggingface_hub
88e6849 verified
|
Raw
History Blame Contribute Delete
1.74 kB
# Quick Start
## Installation
Install Megatron Core with pip:
```bash
# 1. Install Megatron Core with required dependencies
pip install --no-build-isolation megatron-core[mlm,dev]
# 2. Clone repository for examples
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
pip install --no-build-isolation .[mlm,dev]
```
That's it! You're ready to start training.
## Your First Training Run
### Simple Training Example
```bash
# Distributed training example (2 GPUs, mock data)
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
```
### LLaMA-3 Training Example
```bash
# 8 GPUs, FP8 precision, mock data
./examples/llama/train_llama3_8b_fp8.sh
```
## Data Preparation
### JSONL Data Format
```json
{"text": "Your training text here..."}
{"text": "Another training sample..."}
```
### Basic Preprocessing
```bash
python tools/preprocess_data.py \
--input data.jsonl \
--output-prefix processed_data \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--workers 8 \
--append-eod
```
### Key Arguments
- `--input`: Path to input JSON/JSONL file
- `--output-prefix`: Prefix for output binary files (.bin and .idx)
- `--tokenizer-type`: Tokenizer type (`HuggingFaceTokenizer`, `GPT2BPETokenizer`, etc.)
- `--tokenizer-model`: Path to tokenizer model file
- `--workers`: Number of parallel workers for processing
- `--append-eod`: Add end-of-document token
## Next Steps
- Explore [Parallelism Strategies](../user-guide/parallelism-guide.md) to scale your training
- Learn about [Data Preparation](../user-guide/data-preparation.md) best practices
- Check out [Advanced Features](../user-guide/features/index.md) for advanced capabilities