| datasets: | |
| nvidia/OpenCodeReasoning | |
| future-technologies/Universal-Transformers-Dataset | |
| metrics: | |
| bleu | |
| AI FixCode: A Code Repair Model π οΈ | |
| AI FixCode is a Transformer-based model designed to automatically identify and correct errors in source code. Built upon the powerful CodeT5 architecture, it's trained on a diverse dataset of real-world buggy and fixed code pairs to address both syntactic and semantic issues. | |
| π Key Features | |
| Base Model: Salesforce/codet5p-220m | |
| Architecture: Encoder-Decoder (Seq2Seq) | |
| Target Languages: Primarily Python, with future plans to expand to other languages like JavaScript and Go. | |
| Task: Code repair and error correction. | |
| π§ How to Use | |
| Simply provide a faulty code snippet, and the model will return a corrected version. It's intended for use in code editors, IDEs, or automated pipelines to assist developers in debugging. | |
| Example: | |
| # Input: | |
| def add(x, y) | |
| return x + y | |
| # Output: | |
| def add(x, y): | |
| return x + y | |
| π§ Under the Hood | |
| The model operates as a sequence-to-sequence system. During training, it learns to map a sequence of buggy code tokens to a sequence of correct code tokens. This approach allows it to "reason" about the necessary changes at a granular level, effectively predicting the patches needed to fix the code. | |
| π Getting Started with Inference | |
| You can easily use the model with the Hugging Face transformers library. | |
| Python Code | |
| from transformers import AutoModelForSeq2SeqLM, AutoTokenizer | |
| model_name = "khulnasoft/aifixcode-model" # Replace with your model name | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForSeq2SeqLM.from_pretrained(model_name) | |
| input_code = "def foo(x):\n print(x" | |
| inputs = tokenizer(input_code, return_tensors="pt") | |
| # Generate the corrected code | |
| outputs = model.generate(**inputs, max_length=512) | |
| corrected_code = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(corrected_code) | |
| π Dataset and Format | |
| The model was trained on a custom dataset following a simple format: | |
| [ | |
| { | |
| "input": "def add(x, y)\n return x + y", | |
| "output": "def add(x, y):\n return x + y" | |
| }, | |
| { | |
| "input": "for i in range(10): \n if i == 5 \n print(i)", | |
| "output": "for i in range(10): \n if i == 5: \n print(i)" | |
| } | |
| ] | |
| This format allows the model to learn the direct mapping between erroneous and corrected code. | |
| π‘οΈ License and Acknowledgements | |
| License: MIT License | |
| Acknowledgements: This project was made possible by the incredible work of the Hugging Face team for their Transformers library and Salesforce for the CodeT5 model. | |