Update README.md

f42eb70 verified 7 months ago

4.12 kB

	---
	metrics:
	- accuracy:85%
	license: apache-2.0
	datasets:
	- arhansd1/Csv-CommandToCode
	base_model:
	- Salesforce/codet5-base
	---
	# Model Card for Csv-AI-Cleaner-V3

	Transform English instructions with data context into executable pandas code with AI

	### Model Description

	Csv-AI-Cleaner converts natural language instructions into pandas code for data cleaning, filtering, grouping, sorting, merges, and more.
	Fine-tuned on synthetic + real-world datasets using CodeT5 with LoRA for efficiency.

	- Developed by: ArhanSD1
	- Model type: Seq2Seq Transformer (CodeT5)
	- Language(s) (NLP): English
	- License: Apache 2.0
	- Finetuned from model: Salesforce/CodeT5-base

	### Model Sources

	- Repository: https://huggingface.co/arhansd1/Csv-AI-Cleaner-V3


	### Direct Use
	- Input: Context (sample dataset) + instruction in natural language
	- Output: Executable pandas code snippet

	Example:
	```
	Context:
	employee_id \| name \| salary \| department
	E001 \| Alice \| 50000 \| IT
	E002 \| Bob \| 45000 \| HR

	Instruction: Show IT department employees earning over 45000
	```
	Output:
	```python
	df[(df['department'] == 'IT') & (df['salary'] > 45000)]
	```

	### Out-of-Scope Use
	- Ambiguous or poorly defined instructions without dataset context
	- Complex multi-step pipelines exceeding ~500 token context limit

	## Bias, Risks, and Limitations
	- Works best with clean and clear column names
	- May generate suboptimal code if context is incomplete or contains noise
	- No awareness of business logic correctness — only syntax + pattern learning

	### Recommendations
	Users should verify generated code for correctness and safety before execution.

	## How to Get Started with the Model
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
	import torch

	MODEL_REPO = "arhansd1/Csv-AI-Cleaner-V3"
	tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
	model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_REPO)

	def generate_code(input_text):
	prefixed_input = "Generate pandas code: " + input_text
	inputs = tokenizer(prefixed_input, return_tensors="pt", truncation=True, max_length=512)
	with torch.no_grad():
	outputs = model.generate(
	input_ids=inputs.input_ids,
	attention_mask=inputs.attention_mask,
	max_length=128,
	num_beams=5,
	temperature=0.7,
	early_stopping=True
	)
	return tokenizer.decode(outputs[0], skip_special_tokens=True)

	# Example
	input_example = """
	Context:
	employee_id \| name \| salary \| department
	E001 \| Alice \| 50000 \| IT
	E002 \| Bob \| 45000 \| HR

	Instruction: Show IT department employees earning over 45000
	"""
	print(generate_code(input_example))
	```

	### Training Data
	- Combination of synthetic data cleaning instructions + public dataset column contexts
	- Augmented with filtered StackOverflow code snippets for pandas tasks

	#### Preprocessing
	- Normalized table format for context section
	- Instruction phrasing normalized to imperative form

	#### Training Hyperparameters
	- LoRA fine-tuning
	- Learning rate: 5e-5
	- Epochs: 3
	- Batch size: 8
	- Precision: fp16 mixed precision


	### Testing Data
	- Held-out set of 500 natural language → pandas task pairs

	### Metrics
	\| Metric \| Score \|
	\|-----------------\|-------\|
	\| Exact Match \| 71% \|
	\| Partial Match \| 92% \|
	\| Syntax Accuracy \| 100% \|

	### Results Summary
	High syntax accuracy, good partial match rate, slightly lower exact match on multi-condition or chained operations.

	## Environmental Impact
	- Hardware Type: Single NVIDIA A100
	- Hours used: ~4 hours
	- Cloud Provider: AWS
	- Compute Region: US-East
	- Carbon Emitted: ~1.2 kg CO2eq (estimate)

	### Model Architecture and Objective
	- CodeT5-base (encoder-decoder)
	- Objective: Seq2Seq code generation from natural language + data context

	### Compute Infrastructure
	- Training done on 1×A100 GPU
	- Fine-tuning with Hugging Face Transformers + PEFT (LoRA)

	## Model Card Contact
	- Author: ArhanSD1
	- Hugging Face: https://huggingface.co/arhansd1
	- Email: N/A