Spaces:
Running
Apply for a GPU community grant: Academic project
Hi Hugging Face Team,
I am requesting an L40S GPU grant for the Space turtle170/Gemma-Train to perform high-rank LoRA fine-tuning of the new unsloth/gemma-3-4b-pt-unsloth-bnb-4bit model. (the unsloth 4 bit edition of Gemma 3 decreases VRAM usage while increasing compatibility with unsloth).
Project Focus: I am researching the effectiveness of High-Rank Parameter-Efficient Fine-Tuning (PEFT) on multimodal architectures to improve logical reasoning and "Chain-of-Thought" capabilities in SMALL-parameter models (4B). I will be utilizing the turtle170/Gemma-3-4B-Reasoning dataset, derived from OpenR1.
Technical Justification: While a 24GB card can load the model, the L40S with 48GB VRAM is critical for this project due to:
High-Rank LoRA (Rank 256): Using a rank of 256 significantly increases the number of trainable parameters and optimizer states compared to standard LoRA (rank 8/16), requiring the expanded memory headroom of the L40S.
Sequence Length: To capture complex math reasoning traces, I am using a 8192 block size, which increases the activation memory footprint, as the first row of my dataset already contains ~4400 tokens.
Precision & Speed: Utilizing BF16 mixed-precision (aligning with Gemma 3's native training) and Flash Attention 2 for efficient kernel execution.
Optimization: Leveraging Unsloth kernels and AdamW Torch to maximize throughput.
The 48GB VRAM ensures I can maintain a sufficient batch size to stabilize gradients without encountering OOM errors during the multimodal attention phase.
Community Commitment: I intend to share the final model weights and training logs to the Hub to provide the community with a reference point for fine-tuning the Gemma 3 architecture for reasoning tasks. I will immediately release the GPU resources back to the pool as soon as the training is complete.
Thank you for supporting community-led open research!
configured requirements.txt:
unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git
unsloth_zoo
transformers>=4.50.0
trl>=0.13.0
peft>=0.14.0
accelerate>=1.2.0
bitsandbytes>=0.45.0
flash-attn --no-build-isolation
torchao
cut-cross-entropy
sentencepiece
protobuf
psutil
Training JSON:
{
"auto_find_batch_size": "false",
"chat_template": "chatml",
"disable_gradient_checkpointing": "false",
"distributed_backend": "deepspeed",
"eval_strategy": "epoch",
"merge_adapter": "true",
"mixed_precision": "bf16",
"optimizer": "adamw_torch",
"peft": "true",
"padding": "right",
"quantization": "int4",
"scheduler": "cosine_warmup",
"unsloth": "true",
"use_flash_attention_2": "true",
"batch_size": "4",
"block_size": "8192",
"epochs": "1",
"gradient_accumulation": "8",
"lr": "0.00005",
"logging_steps": "1",
"lora_alpha": "256",
"lora_dropout": "0",
"lora_r": "256",
"max_grad_norm": "1",
"model_max_length": "8192",
"save_total_limit": "1",
"seed": "42",
"warmup_ratio": "0.1",
"weight_decay": "0.01",
"target_modules": "all-linear"
}
(If logs show this message: The NVIDIA Driver was not detected. GPU functionality will not be available. that is because I have changed requirements.txt to fit L40S.)
Currently waiting for GPU.
(By the way, I'm in SG, so the time for me is UTC+8, meaning I am 13 hours ahead of New York.)
@danielhanchen sorry for the Friday afternoon ping! Pushing a unsloth/gemma-3-4b-pt-unsloth-bnb-4bit Reasoning research project with a Rank 256 LoRA and need a small nudge for the hardware.
After a nightmare trying to run this on dual T4s (the 2018 Turing architecture just couldn't keep up with these reasoning traces), I’ve moved the entire pipeline to Unsloth. I want to document the massive speedup Unsloth provides for high-rank SFT on the new Gemma 3 architecture.
Environment is 100% prepped with unsloth[cu121-torch250]. Just waiting on an L40S grant from
@hysts
to start. Would love to get an Unsloth expert's eyes on this case study! 🦥