| | --- |
| | base_model: Salesforce/codegen-350M-mono |
| | library_name: peft |
| | license: mit |
| | datasets: |
| | - google/code_x_glue_ct_code_to_text |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # CodeGen-ft-python |
| |
|
| | <!-- Provide a quick summary of what the model is/does. --> |
| | Generate python code from natural language prompts. |
| |
|
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | <!-- Provide a longer summary of what this model is. --> |
| | This model is a fine-tuned variant of Salesforce/codegen-350M-mono, |
| | specialized for natural language to code generation in Python. |
| | It takes natural language instructions (e.g., “check MySQL database connection”) |
| | and generates the corresponding Python code snippet. |
| | The model was trained on a curated text-to-code dataset containing diverse |
| | programming instructions and function-level examples to improve semantic and syntactic accuracy. |
| |
|
| |
|
| | - **Developed by:** Akshay Bharadwaj |
| |
|
| | - **Model type:** Transformer-based Causal Language Model |
| | - **Language(s) (NLP):** English (Prompts) and Python (Code Outputs) |
| | - **License:** MIT License |
| | - **Finetuned from model [optional]:** Salesforce/codegen-350M-mono |
| |
|
| | ## Uses |
| |
|
| | <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
| |
|
| | ### Direct Use |
| |
|
| | <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
| |
|
| | The model can be used for: |
| |
|
| | * Translating natural language prompts into functional Python code. |
| |
|
| | * Assisting in code autocompletion or boilerplate generation. |
| |
|
| | * Supporting educational and prototyping environments. |
| |
|
| | ### Downstream Use |
| |
|
| | <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
| |
|
| | Can be integrated into: |
| |
|
| | * Developer tools (IDE plugins or assistants). |
| |
|
| | * Chatbots for code assistance or educational coding tutors. |
| |
|
| | * LLM pipelines for multi-step reasoning or coding workflows. |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
| |
|
| | * Generating production-level code without human review. |
| |
|
| | * Security-critical or real-time applications (e.g., code execution automation). |
| |
|
| | * Generation of malicious or unsafe code. |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | <!-- This section is meant to convey both technical and sociotechnical limitations. --> |
| |
|
| | * The model may produce incomplete or syntactically incorrect code for ambiguous prompts. |
| |
|
| | * It can misinterpret vague natural language queries (semantic drift). |
| |
|
| | * Potential bias toward common Python idioms and limited handling of rare libraries or APIs. |
| |
|
| |
|
| | ## How to Get Started with the Model |
| |
|
| | Use the code below to get started with the model. |
| | ``` |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | model_id = "akshayb/nl-code-gen-python" |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForCausalLM.from_pretrained(model_id) |
| | |
| | prompt = "write a python function to check mysql database connection" |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=256) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
| |
|
| | The dataset contains paired natural language descriptions |
| | and Python function implementations, collected and cleaned |
| | from public code repositories and text-to-code benchmarks (e.g., CodeXGLUE). |
| | Preprocessing involved deduplication, tokenization, and removal of incomplete code samples. |
| |
|
| | ## Evaluation |
| |
|
| | <!-- This section describes the evaluation protocols and provides the results. --> |
| |
|
| | #### Metrics |
| |
|
| | <!-- These are the evaluation metrics being used, ideally with a description of why. --> |
| | For Comparison between Base Model and Fine-tuned model, we use the following metrics: |
| |
|
| | | Metric | Focus | Strength | |
| | | ---------------- | ------------------------------ | ----------------------------------------- | |
| | | **BLEU** | Token-level similarity | Measures fluency and lexical accuracy | |
| | | **CodeBLEU** | Lexical + syntactic + semantic | Captures holistic code quality | |
| | | **Exact Match** | String equality | Strict correctness measure | |
| | | **Syntax Match** | AST structure | Validates syntactic and logical integrity | |
| |
|
| |
|
| | ## Citation [optional] |
| |
|
| | <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
| |
|
| | **BibTeX:** |
| |
|
| | @misc{akshay2025nlcodegen, |
| | title={Natural Language to Code Generation (Fine-tuned CodeGen-350M)}, |
| | author={Akshay Bharadwaj}, |
| | year={2025}, |
| | howpublished={\url{https://huggingface.co/akshayb/nl-code-gen-python}} |
| | } |
| |
|
| |
|
| | - PEFT 0.7.2.dev0 |