Update README.md

77e1d24 verified almost 2 years ago

4.26 kB

library_name: peft
base_model: google/codegemma-2b

Model Card for Model ID

This model is a fine-tuned version of the CodeGemma-2B base model that generates high-quality docstrings for Python code functions.

Model Details

Model Description

The DocuMint model is a fine-tuned variant of Google's CodeGemma-2B base model, which was originally trained to predict the next token on internet text without any instructions. The DocuMint model has been fine-tuned using supervised instruction fine-tuning on a dataset of 100,000 Python functions and their respective docstrings extracted from the Free and open-source software (FLOSS) ecosystem. The fine-tuning was performed using the Low-Rank Adaptation (LoRA) technique.

The goal of the DocuMint model is to generate docstrings that are concise (brief and to the point), complete (cover functionality, parameters, return values, and exceptions), and clear (use simple language and avoid ambiguity).

Developed by: Bibek Poudel, Adam Cook, Sekou Traore, Shelah Ameli (University of Tennessee, Knoxville)
Model type: Causal language model fine-tuned for code documentation generation
Language(s) (NLP): English, Python
License: MIT
Finetuned from model: google/codegemma-2b

Model Sources

Repository: TODO
Paper: [DocuMint: Docstring Generation for Python using Small Language Models] Link TODO

Uses

Direct Use

The DocuMint model can be used directly to generate high-quality docstrings for Python functions. Given a Python function definition, the model will output a docstring in the format """""".

Training Details

Training Data

The training data consists of 100,000 Python functions and their docstrings extracted from popular open-source repositories in the FLOSS ecosystem. Repositories were filtered based on metrics such as number of contributors (> 50), commits (> 5k), stars (> 35k), and forks (> 10k) to focus on well-established and actively maintained projects.

An abstract syntax tree (AST) based parser was used to extract functions and docstrings. Challenges in the data sampling process included syntactic errors, multi-language repositories, computational expense, repository size discrepancies, and ensuring diversity while avoiding repetition.

Training Hyperparameters

The model was fine-tuned using Low-Rank Adaptation (LoRA) for 4 epochs with a batch size of 8 and gradient accumulation steps of 16. The initial learning rate was 2e-4. In total, there were 78,446,592 LoRA parameters and 185,040,896 training tokens. The full hyperparameter configuration is provided in Table 2 of the paper.

Fine-tuning was performed using an Intel 12900K CPU, an Nvidia RTX-3090 GPU, and 64 GB RAM. Total fine-tuning time was 48 GPU hours.

Evaluation

Testing Data, Factors & Metrics

Metrics

Accuracy: Measures the coverage of the generated docstring on code elements like input/output variables. Calculated using cosine similarity between the generated and expert docstring embeddings.
Conciseness: Measures the ability to convey information succinctly without verbosity. Calculated as a compression ratio between the compressed and original docstring sizes.
Clarity: Measures readability using simple, unambiguous language. Calculated using the Flesch-Kincaid readability score.

Hardware

Single GPU RTX-3090

Citation

BibTeX:

(TODO)

@misc{poudel2024documint,
      title={DocuMint: Docstring Generation for Python using Small Language Models}, 
      author={Bibek Poudel* and Adam Cook* and Sekou Traore* and Shelah Ameli*},
      year={2024},
}

[More Information Needed]

Model Card Contact

For questions or more information, please contact: {bpoudel3,acook46,staore1,oameli}@vols.utk.edu