Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,65 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
-Welcome to my Computer Science Capstone Project!
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
This is the code that for the training pipeline that was used during my multi year Computer Science Capstone Project. It is a finetune of the most recent Command R model trained using a custom Python training pipeline from scratch.
|
| 9 |
+
My goal is ultimately is understanding the process of training an LLM though the creation of an administrative assistant AI Agent powered by my own custom model.
|
| 10 |
+
|
| 11 |
+
I started this project around the summer of my sophomore year in high school. I was just getting around to studying the mechanics of LLMs back then. My school
|
| 12 |
+
offers a CS capstone class where you are allowed to work on a computer science related project of your choice for the year. This can be repeated in later years if take
|
| 13 |
+
prior to senior year in order to build a new project or continue a previous one.
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
# Technical Approach:
|
| 18 |
+
|
| 19 |
+
-Multi-task Training: Curated custom dataset batches across various administrative capabilities such as tool calling, summarization and RAG
|
| 20 |
+
|
| 21 |
+
-Iterative Fine-tuning: Progressive training runs with small learning rate to prevent catastrophic forgetting(learned this the hard way after losing 20 credits)
|
| 22 |
+
|
| 23 |
+
-Knowledge Preservation: Mixed subsets of previous datasets into each new run
|
| 24 |
+
|
| 25 |
+
-Quantization: 8-bit loading via BitsAndBytes for efficient training on Google Colab L4 GPUs
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
# Some Challenges:
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
-The very first training run I forgot I was working with a dictionary and accidentally assigned the variables wrong so every model was trained on "question"-"answer" repeatedly
|
| 33 |
+
|
| 34 |
+
-Trying to train on long chain of thought while heavily truncating the text resulted in barely coherent checkpoints
|
| 35 |
+
|
| 36 |
+
-Cuda dependencies were a struggle that cost a great many hours, nearly causing me to give up on quantization entirely
|
| 37 |
+
|
| 38 |
+
-Money management. I originally used expensive H100 GPUs from cloud providers before settling on Colab
|
| 39 |
+
|
| 40 |
+
-Finding tutorials. Since the subject is so new, I couldn't find many tutorials for younger students. Unsloth notebooks ended up being very useful.
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# Model Rationale
|
| 46 |
+
|
| 47 |
+
-I was originally going to try Mistral Small 3 24B but it was too large and expensive
|
| 48 |
+
|
| 49 |
+
-Qwen models felt too stiff to me in testing despite recommendation
|
| 50 |
+
|
| 51 |
+
-Cohere models are advertised as good at tool calling and seemed good in practice
|
| 52 |
+
|
| 53 |
+
-I emailed Cohere to see if they were okay with me using this for things that could theoretically help me make money with it and they said I was fine
|
| 54 |
+
|
| 55 |
+
-This is still a research project first and foremost, so non commercial use wasn't really a dealbreaker for me.
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
# Current Goal?
|
| 61 |
+
|
| 62 |
+
-My current goal this senior year is phase 2 of the project, working on a custom agent, built on the smolagents framework, for the model to use in day to day life
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
|