--- license: mit language: - en pipeline_tag: summarization widget: - text: "test = pd.read_csv('../input/test.csv')\ntrain = pd.read_csv('../input/train.csv')\nX_train=train.iloc[:, 1:].values\ny_train=train.iloc[:, 0].values\nX_test = test.values" --- # Model Overview This model performs abstract summarization of python data science code to english natural language. It is finetuned from [google/flan-t5-small]() with a subset of [Meta Kaggle For Code]() labeled with a 43B model. # Model Architecture This model was finetuned from the [google/flan-t5-small]() and shares its architecture and tokenizer. # Training Code cells were extracted from Jupyter Notebooks, chunked into ~500 tokens, and labelled by a 43B model with the prompt: "Think step by step and then provide a two or three sentence summary of what the code is doing for an audience who may not be familiar with machine learning. Focus on the problem the authors' are trying to solve." ## Datasets All code was extracted from .ipynb files that are part of the [Meta Kaggle for Code]() dataset. ## Tokenizer Construction The tokenizer was not modified from the standard [google/flan-t5-small]() tokenizer. # How to Use this Model The model is available for use in the `transformers` library, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. ``` ## Generating summaries with this model ```python ipynb_string = "import pandas as pd\nimport numpy as np" tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint) chunk_ids = tokenizer.encode("summarize: ```" + ipynb_string + "```", return_tensors="pt", truncation=True, padding="max_length", max_length=512) output_tokens = model.generate(chunk_ids, max_length=128) output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True) ``` ## Input This model accepts 512 tokens from the associated tokenizer. Preface input data with `summarize: ` and wrap input as a markdown code block "```". ## Output This model provides short natural language summaries of python data science code. # Limitations The Flan-T5-Small architecture was chosen to maximize portability, but summaries may sometimes be repetitive, incomplete, or too abstract. Remember that the model was finetuned with Kaggle notebooks and will perform better for code in that distribution.