--- library_name: peft license: bigcode-openrail-m base_model: bigcode/starcoder2-3b tags: - generated_from_trainer datasets: - code_search_net model-index: - name: codex-finetune results: [] --- # codex-finetune This model is a fine-tuned version of [bigcode/starcoder2-3b](https://huggingface.co/bigcode/starcoder2-3b) on the code_search_net dataset. ## Model description It’s designed to serve as an intelligent coding copilot: generate code, explain functions, refactor logic, and complete partial implementations. ## 🚀 Features - **Multi-task formatting**: Instruction-tuned samples with tasks like code generation, docstring generation, function completion, and code improvement. - **Efficient LoRA training** using `PEFT` and `transformers`. - Token-level preprocessing with Hugging Face's tokenizer and trainer utilities. Training tracked via Weights & Biases (W&B). - Dataset sampling + tokenization to stay memory-efficient. - Ready for inference integration and API deployment. ## Dataset - **Source**: `code_search_net` (Python split) - **Fields Used**: `func_code_string`, `func_documentation_string` - **Size after sampling**: - Train: 1000 samples - Validation: 200 samples - Test: 200 samples ## Format: Multi-Task Examples Examples were formatted into prompts like: Instruction: Write a function for this description: "Calculate factorial recursively." Response: def factorial(n): return 1 if n == 0 else n * factorial(n - 1) ## Model - **Base**: `bigcode/starcoder2-3b` - **PEFT Config**: - `r=8`, `lora_alpha=16` - `target_modules=["q_proj", "v_proj"]` - `dropout=0.05`, `bias="none"` - **Training Config**: - `per_device_train_batch_size=4` - `num_train_epochs=3` - `learning_rate=2e-4` - `save_steps=100` - `logging_dir=./logs` ## Dependencies ```bash pip install transformers peft datasets accelerate wandb pip install bitsandbytes ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0002 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - num_epochs: 3 - mixed_precision_training: Native AMP ### Training results Step Training Loss 500 1.700700 1000 1.305100 1500 1.234500 [3000/3000 1:11:12, Epoch 3/3] Step Training Loss 500 1.700700 1000 1.305100 1500 1.234500 2000 1.229400 2500 1.185200 3000 1.203400 ### Framework versions - PEFT 0.15.2 - Transformers 4.52.4 - Pytorch 2.6.0+cu124 - Datasets 3.6.0 - Tokenizers 0.21.1 ### Run Locally from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("khushimalik53/codex-finetune") model = AutoModelForCausalLM.from_pretrained("khushimalik53/codex-finetune") prompt = "### Instruction:\nExplain what this function does:\ndef reverse_string(s): return s[::-1]\n\n### Response:\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) output = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(output[0], skip_special_tokens=True))