--- metrics: - accuracy:85% license: apache-2.0 datasets: - arhansd1/Csv-CommandToCode base_model: - Salesforce/codet5-base --- # Model Card for Csv-AI-Cleaner-V3 **Transform English instructions with data context into executable pandas code with AI** ### Model Description Csv-AI-Cleaner converts **natural language instructions** into **pandas code** for data cleaning, filtering, grouping, sorting, merges, and more. Fine-tuned on synthetic + real-world datasets using CodeT5 with LoRA for efficiency. - **Developed by:** ArhanSD1 - **Model type:** Seq2Seq Transformer (CodeT5) - **Language(s) (NLP):** English - **License:** Apache 2.0 - **Finetuned from model:** Salesforce/CodeT5-base ### Model Sources - **Repository:** https://huggingface.co/arhansd1/Csv-AI-Cleaner-V3 ### Direct Use - Input: Context (sample dataset) + instruction in natural language - Output: Executable pandas code snippet Example: ``` Context: employee_id | name | salary | department E001 | Alice | 50000 | IT E002 | Bob | 45000 | HR Instruction: Show IT department employees earning over 45000 ``` Output: ```python df[(df['department'] == 'IT') & (df['salary'] > 45000)] ``` ### Out-of-Scope Use - Ambiguous or poorly defined instructions without dataset context - Complex multi-step pipelines exceeding ~500 token context limit ## Bias, Risks, and Limitations - Works best with clean and clear column names - May generate suboptimal code if context is incomplete or contains noise - No awareness of business logic correctness — only syntax + pattern learning ### Recommendations Users should verify generated code for correctness and safety before execution. ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM import torch MODEL_REPO = "arhansd1/Csv-AI-Cleaner-V3" tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO) model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_REPO) def generate_code(input_text): prefixed_input = "Generate pandas code: " + input_text inputs = tokenizer(prefixed_input, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_length=128, num_beams=5, temperature=0.7, early_stopping=True ) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Example input_example = """ Context: employee_id | name | salary | department E001 | Alice | 50000 | IT E002 | Bob | 45000 | HR Instruction: Show IT department employees earning over 45000 """ print(generate_code(input_example)) ``` ### Training Data - Combination of synthetic data cleaning instructions + public dataset column contexts - Augmented with filtered StackOverflow code snippets for pandas tasks #### Preprocessing - Normalized table format for context section - Instruction phrasing normalized to imperative form #### Training Hyperparameters - LoRA fine-tuning - Learning rate: 5e-5 - Epochs: 3 - Batch size: 8 - Precision: fp16 mixed precision ### Testing Data - Held-out set of 500 natural language → pandas task pairs ### Metrics | Metric | Score | |-----------------|-------| | Exact Match | 71% | | Partial Match | 92% | | Syntax Accuracy | 100% | ### Results Summary High syntax accuracy, good partial match rate, slightly lower exact match on multi-condition or chained operations. ## Environmental Impact - **Hardware Type:** Single NVIDIA A100 - **Hours used:** ~4 hours - **Cloud Provider:** AWS - **Compute Region:** US-East - **Carbon Emitted:** ~1.2 kg CO2eq (estimate) ### Model Architecture and Objective - CodeT5-base (encoder-decoder) - Objective: Seq2Seq code generation from natural language + data context ### Compute Infrastructure - Training done on 1×A100 GPU - Fine-tuning with Hugging Face Transformers + PEFT (LoRA) ## Model Card Contact - **Author:** ArhanSD1 - **Hugging Face:** https://huggingface.co/arhansd1 - **Email:** N/A