| | --- |
| | metrics: |
| | - accuracy:85% |
| | license: apache-2.0 |
| | datasets: |
| | - arhansd1/Csv-CommandToCode |
| | base_model: |
| | - Salesforce/codet5-base |
| | --- |
| | # Model Card for Csv-AI-Cleaner-V3 |
| |
|
| | **Transform English instructions with data context into executable pandas code with AI** |
| |
|
| | ### Model Description |
| |
|
| | Csv-AI-Cleaner converts **natural language instructions** into **pandas code** for data cleaning, filtering, grouping, sorting, merges, and more. |
| | Fine-tuned on synthetic + real-world datasets using CodeT5 with LoRA for efficiency. |
| |
|
| | - **Developed by:** ArhanSD1 |
| | - **Model type:** Seq2Seq Transformer (CodeT5) |
| | - **Language(s) (NLP):** English |
| | - **License:** Apache 2.0 |
| | - **Finetuned from model:** Salesforce/CodeT5-base |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** https://huggingface.co/arhansd1/Csv-AI-Cleaner-V3 |
| |
|
| |
|
| | ### Direct Use |
| | - Input: Context (sample dataset) + instruction in natural language |
| | - Output: Executable pandas code snippet |
| |
|
| | Example: |
| | ``` |
| | Context: |
| | employee_id | name | salary | department |
| | E001 | Alice | 50000 | IT |
| | E002 | Bob | 45000 | HR |
| | |
| | Instruction: Show IT department employees earning over 45000 |
| | ``` |
| | Output: |
| | ```python |
| | df[(df['department'] == 'IT') & (df['salary'] > 45000)] |
| | ``` |
| |
|
| | ### Out-of-Scope Use |
| | - Ambiguous or poorly defined instructions without dataset context |
| | - Complex multi-step pipelines exceeding ~500 token context limit |
| |
|
| | ## Bias, Risks, and Limitations |
| | - Works best with clean and clear column names |
| | - May generate suboptimal code if context is incomplete or contains noise |
| | - No awareness of business logic correctness — only syntax + pattern learning |
| |
|
| | ### Recommendations |
| | Users should verify generated code for correctness and safety before execution. |
| |
|
| | ## How to Get Started with the Model |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
| | import torch |
| | |
| | MODEL_REPO = "arhansd1/Csv-AI-Cleaner-V3" |
| | tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO) |
| | model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_REPO) |
| | |
| | def generate_code(input_text): |
| | prefixed_input = "Generate pandas code: " + input_text |
| | inputs = tokenizer(prefixed_input, return_tensors="pt", truncation=True, max_length=512) |
| | with torch.no_grad(): |
| | outputs = model.generate( |
| | input_ids=inputs.input_ids, |
| | attention_mask=inputs.attention_mask, |
| | max_length=128, |
| | num_beams=5, |
| | temperature=0.7, |
| | early_stopping=True |
| | ) |
| | return tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | |
| | # Example |
| | input_example = """ |
| | Context: |
| | employee_id | name | salary | department |
| | E001 | Alice | 50000 | IT |
| | E002 | Bob | 45000 | HR |
| | |
| | Instruction: Show IT department employees earning over 45000 |
| | """ |
| | print(generate_code(input_example)) |
| | ``` |
| |
|
| | ### Training Data |
| | - Combination of synthetic data cleaning instructions + public dataset column contexts |
| | - Augmented with filtered StackOverflow code snippets for pandas tasks |
| |
|
| | #### Preprocessing |
| | - Normalized table format for context section |
| | - Instruction phrasing normalized to imperative form |
| |
|
| | #### Training Hyperparameters |
| | - LoRA fine-tuning |
| | - Learning rate: 5e-5 |
| | - Epochs: 3 |
| | - Batch size: 8 |
| | - Precision: fp16 mixed precision |
| |
|
| |
|
| | ### Testing Data |
| | - Held-out set of 500 natural language → pandas task pairs |
| |
|
| | ### Metrics |
| | | Metric | Score | |
| | |-----------------|-------| |
| | | Exact Match | 71% | |
| | | Partial Match | 92% | |
| | | Syntax Accuracy | 100% | |
| |
|
| | ### Results Summary |
| | High syntax accuracy, good partial match rate, slightly lower exact match on multi-condition or chained operations. |
| |
|
| | ## Environmental Impact |
| | - **Hardware Type:** Single NVIDIA A100 |
| | - **Hours used:** ~4 hours |
| | - **Cloud Provider:** AWS |
| | - **Compute Region:** US-East |
| | - **Carbon Emitted:** ~1.2 kg CO2eq (estimate) |
| |
|
| | ### Model Architecture and Objective |
| | - CodeT5-base (encoder-decoder) |
| | - Objective: Seq2Seq code generation from natural language + data context |
| |
|
| | ### Compute Infrastructure |
| | - Training done on 1×A100 GPU |
| | - Fine-tuning with Hugging Face Transformers + PEFT (LoRA) |
| |
|
| | ## Model Card Contact |
| | - **Author:** ArhanSD1 |
| | - **Hugging Face:** https://huggingface.co/arhansd1 |
| | - **Email:** N/A |