Update README.md
Browse files
README.md
CHANGED
|
@@ -1,9 +1,3 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
tags:
|
| 4 |
-
- unsloth
|
| 5 |
-
---
|
| 6 |
-
|
| 7 |
# GitVac
|
| 8 |
Don't forget to vacuum your git repo.
|
| 9 |
|
|
@@ -16,7 +10,7 @@ GitVac is like a vacuum cleaner for code fixes. It's a series of 3B, 8B, 14B, an
|
|
| 16 |
|
| 17 |
|
| 18 |
# How were the models made?
|
| 19 |
-
I distilled samples from r1
|
| 20 |
|
| 21 |
# How is verification done?
|
| 22 |
A lot of models are already trained on function calling syntax.
|
|
@@ -342,13 +336,14 @@ These models create pre-made actions that are higher quality than the turbo mode
|
|
| 342 |
# Benchmarks
|
| 343 |
I started with 2,400 patches/issues.
|
| 344 |
|
| 345 |
-
- Only 1,100 problems could be solved by
|
| 346 |
- Each problem was attempted up to 3 times. earlier scripts were doing up to 10.
|
| 347 |
- The remaining 1,300 problems were ones that these top models failed to solve even after 9,000 total attempts
|
| 348 |
|
| 349 |
To evaluate GitVac models, I randomly selected a sample size from the unfinished dataset.
|
| 350 |
|
| 351 |
## Performance Results
|
|
|
|
| 352 |
|
| 353 |
| Model | Success Rate | Notes |
|
| 354 |
|-------|--------------|-------|
|
|
@@ -363,7 +358,7 @@ Start by gathering your patches and extracting all the necessary components - st
|
|
| 363 |
|
| 364 |
Combine the tool calls into a list and shuffle them randomly. This randomization turned out to be a crucial factor in improving dataset quality.
|
| 365 |
|
| 366 |
-
Initially, I presented the function calls in a fixed order (writes, reads, deletes). The models would blindly follow this pattern - making changes before reading files, which makes no logical sense. Simply instructing them to do otherwise in the prompt had no effect.
|
| 367 |
|
| 368 |
A breakthrough came when I randomized the function calls. This seemed to break the models out of their rigid patterns and activate more natural problem-solving behaviors. They started properly reading files before modifying them and demonstrated more realistic roleplay capabilities.
|
| 369 |
|
|
@@ -529,7 +524,7 @@ Think before you respond.
|
|
| 529 |
<br>
|
| 530 |
|
| 531 |
# Cost & Details
|
| 532 |
-
The total cost for this project was approximately $400, with the majority spent on
|
| 533 |
I used an automated script to handle the full training pipeline - from finetuning through evaluation across all model sizes up to 32B parameters. The hardware setup included an A100 80GB GPU and a rented H200 140GB+ GPU from RunPod.
|
| 534 |
Training times varied from 1.5 hours for smaller models up to 8 hours for the largest ones. All reasoning models went through 3 epochs of training.
|
| 535 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# GitVac
|
| 2 |
Don't forget to vacuum your git repo.
|
| 3 |
|
|
|
|
| 10 |
|
| 11 |
|
| 12 |
# How were the models made?
|
| 13 |
+
I distilled samples from r1 through multiple rounds of trial and error. About 2.4k questions fired off, with 1.1k making the verification cut. My rough estimate puts it around 45%.
|
| 14 |
|
| 15 |
# How is verification done?
|
| 16 |
A lot of models are already trained on function calling syntax.
|
|
|
|
| 336 |
# Benchmarks
|
| 337 |
I started with 2,400 patches/issues.
|
| 338 |
|
| 339 |
+
- Only 1,100 problems could be solved by r1
|
| 340 |
- Each problem was attempted up to 3 times. earlier scripts were doing up to 10.
|
| 341 |
- The remaining 1,300 problems were ones that these top models failed to solve even after 9,000 total attempts
|
| 342 |
|
| 343 |
To evaluate GitVac models, I randomly selected a sample size from the unfinished dataset.
|
| 344 |
|
| 345 |
## Performance Results
|
| 346 |
+
Tested against the remainder 1300 r1 could not pass. These datasets were never seen in the models training.
|
| 347 |
|
| 348 |
| Model | Success Rate | Notes |
|
| 349 |
|-------|--------------|-------|
|
|
|
|
| 358 |
|
| 359 |
Combine the tool calls into a list and shuffle them randomly. This randomization turned out to be a crucial factor in improving dataset quality.
|
| 360 |
|
| 361 |
+
Initially, I presented the function calls in a fixed order (writes, reads, deletes). The models would blindly follow this pattern - making changes before reading files, which makes no logical sense. Simply instructing them to do otherwise in the prompt had no effect.
|
| 362 |
|
| 363 |
A breakthrough came when I randomized the function calls. This seemed to break the models out of their rigid patterns and activate more natural problem-solving behaviors. They started properly reading files before modifying them and demonstrated more realistic roleplay capabilities.
|
| 364 |
|
|
|
|
| 524 |
<br>
|
| 525 |
|
| 526 |
# Cost & Details
|
| 527 |
+
The total cost for this project was approximately $400, with the majority spent on inference.
|
| 528 |
I used an automated script to handle the full training pipeline - from finetuning through evaluation across all model sizes up to 32B parameters. The hardware setup included an A100 80GB GPU and a rented H200 140GB+ GPU from RunPod.
|
| 529 |
Training times varied from 1.5 hours for smaller models up to 8 hours for the largest ones. All reasoning models went through 3 epochs of training.
|
| 530 |
|