---
metrics:
- accuracy:85%
license: apache-2.0
datasets:
- arhansd1/Csv-CommandToCode
base_model:
- Salesforce/codet5-base
---
# Model Card for Csv-AI-Cleaner-V3

**Transform English instructions with data context into executable pandas code with AI**  

### Model Description

Csv-AI-Cleaner converts **natural language instructions** into **pandas code** for data cleaning, filtering, grouping, sorting, merges, and more.  
Fine-tuned on synthetic + real-world datasets using CodeT5 with LoRA for efficiency.

- **Developed by:** ArhanSD1  
- **Model type:** Seq2Seq Transformer (CodeT5)  
- **Language(s) (NLP):** English  
- **License:** Apache 2.0  
- **Finetuned from model:** Salesforce/CodeT5-base

### Model Sources

- **Repository:** https://huggingface.co/arhansd1/Csv-AI-Cleaner-V3


### Direct Use
- Input: Context (sample dataset) + instruction in natural language  
- Output: Executable pandas code snippet  

Example:  
```
Context:
employee_id | name | salary | department
E001 | Alice | 50000 | IT
E002 | Bob | 45000 | HR

Instruction: Show IT department employees earning over 45000
```
Output:  
```python
df[(df['department'] == 'IT') & (df['salary'] > 45000)]
```

### Out-of-Scope Use
- Ambiguous or poorly defined instructions without dataset context  
- Complex multi-step pipelines exceeding ~500 token context limit

## Bias, Risks, and Limitations
- Works best with clean and clear column names  
- May generate suboptimal code if context is incomplete or contains noise  
- No awareness of business logic correctness — only syntax + pattern learning

### Recommendations
Users should verify generated code for correctness and safety before execution.

## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

MODEL_REPO = "arhansd1/Csv-AI-Cleaner-V3"
tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_REPO)

def generate_code(input_text):
    prefixed_input = "Generate pandas code: " + input_text
    inputs = tokenizer(prefixed_input, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs.input_ids,
            attention_mask=inputs.attention_mask,
            max_length=128,
            num_beams=5,
            temperature=0.7,
            early_stopping=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
input_example = """
Context:
employee_id | name | salary | department
E001 | Alice | 50000 | IT
E002 | Bob | 45000 | HR

Instruction: Show IT department employees earning over 45000
"""
print(generate_code(input_example))
```

### Training Data
- Combination of synthetic data cleaning instructions + public dataset column contexts  
- Augmented with filtered StackOverflow code snippets for pandas tasks

#### Preprocessing
- Normalized table format for context section  
- Instruction phrasing normalized to imperative form

#### Training Hyperparameters
- LoRA fine-tuning  
- Learning rate: 5e-5  
- Epochs: 3  
- Batch size: 8  
- Precision: fp16 mixed precision  


### Testing Data
- Held-out set of 500 natural language → pandas task pairs

### Metrics
| Metric          | Score |
|-----------------|-------|
| Exact Match     | 71%   |
| Partial Match   | 92%   |
| Syntax Accuracy | 100%  |

### Results Summary
High syntax accuracy, good partial match rate, slightly lower exact match on multi-condition or chained operations.

## Environmental Impact
- **Hardware Type:** Single NVIDIA A100  
- **Hours used:** ~4 hours  
- **Cloud Provider:** AWS  
- **Compute Region:** US-East  
- **Carbon Emitted:** ~1.2 kg CO2eq (estimate)

### Model Architecture and Objective
- CodeT5-base (encoder-decoder)  
- Objective: Seq2Seq code generation from natural language + data context

### Compute Infrastructure
- Training done on 1×A100 GPU  
- Fine-tuning with Hugging Face Transformers + PEFT (LoRA)

## Model Card Contact
- **Author:** ArhanSD1  
- **Hugging Face:** https://huggingface.co/arhansd1  
- **Email:** N/A