File size: 4,940 Bytes
bc4443f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bfd3013
 
bc4443f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bfd3013
bc4443f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-14B-Instruct
library_name: peft
pipeline_tag: text-generation
tags:
  - qlora
  - data-science
  - code-generation
  - peft
  - qwen2
  - lora
  - sft
  - unsloth
language:
  - en
---

# DataSci-Coder-14B: Qwen2.5-Coder-14B LoRA Adapter for Data Science

[![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jacksonSmall/DataSci-Coder)

A QLoRA fine-tuned adapter for [Qwen2.5-Coder-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct) optimized for data science code generation. The model outputs clean, runnable Python code with zero explanatory text — strictly following code-only instructions.

## Key Results

| Metric | DS-Tuned (FT) | Base Model | Delta |
|--------|:---:|:---:|:---:|
| Hard Eval (12 complex tasks) | 12/12 | 12/12 | Tie |
| Constraint Compliance | 93.3% | 91.4% | **+1.9%** |
| Code-Only Compliance | 10/10 | 6/10 | **+67%** |
| Code Ratio | 100% | 87.9% | **+12.1%** |

## What It Does

- Generates complete, runnable Python code for data science tasks
- Covers statistics, machine learning, deep learning, NLP, time series, and visualization
- Follows instructions precisely — when told "no explanations," it outputs only code (base model ignores this 40% of the time)
- Handles complex tasks: Bayesian inference, VAEs, GANs, survival analysis, stacking ensembles, SHAP, anomaly detection

## Training Details

| Parameter | Value |
|-----------|-------|
| Base Model | `unsloth/Qwen2.5-Coder-14B-Instruct-bnb-4bit` |
| Method | QLoRA (4-bit quantization) |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| LoRA Targets | q/k/v/o_proj, gate/up/down_proj |
| Trainable Parameters | 68.8M / 14.8B (0.46%) |
| Training Examples | 10,795 |
| Epochs | 1 |
| Final Loss | 0.5933 |
| Training Time | 1.9 hours on NVIDIA L40S |
| Precision | bfloat16 |
| Optimizer | Paged AdamW 8-bit |
| Learning Rate | 3e-5 (cosine schedule) |
| Effective Batch Size | 16 (1 x 16 grad accum) |

## Training Data

10,795 curated data science instruction-response pairs from:
- 6 public HuggingFace datasets (CodeAlpaca, Evol-Instruct, etc.)
- University coursework (statistics, ML, deep learning)
- Data science newsletters
- Hand-curated examples

All examples filtered for Python code quality, data science relevance, and length. Categories: machine learning, deep learning, statistics, data wrangling, visualization, NLP, time series, numerical computing.

## Usage

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="jsmall12/DataSci-Coder-14B-LoRA",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=None,
)
FastLanguageModel.for_inference(model)

messages = [
    {"role": "system", "content": "You are an expert data science coding assistant. Respond ONLY with clean, runnable Python code. Use inline comments for explanation. No text outside code blocks."},
    {"role": "user", "content": "Write a function to train a logistic regression model with sklearn and print the classification report."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.1,
        do_sample=True,
        top_p=0.9,
        repetition_penalty=1.15,
        use_cache=False,
    )

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
```

## Evaluation

### Hard Eval (12 Complex Tasks)

All 12 tasks produced correct, complete, runnable implementations:

| Category | Tasks | Score |
|----------|-------|:---:|
| Statistics | Bayesian A/B testing, Kaplan-Meier survival analysis, time series CV + ARIMA, VIF + Ridge/Lasso/ElasticNet | 4/4 |
| Machine Learning | Stacking ensemble, SHAP importance, Isolation Forest, TF-IDF + SVM pipeline | 4/4 |
| Deep Learning | LR scheduler (warmup + cosine), BiLSTM + attention, VAE, GAN | 4/4 |

### Constraint Eval (10 Multi-Constraint Tests)

| Test | FT | Base | Delta |
|------|:---:|:---:|:---:|
| C01 Multi-step data cleaning | 8/8 | 8/8 | 0 |
| C02 Complete ML pipeline | 12/12 | 12/12 | 0 |
| C03 Statistical hypothesis test | 9/9 | 7/9 | **+2** |
| C04 PyTorch architecture | 9/9 | 7/9 | **+2** |
| C05 EDA visualizations | 11/12 | 10/12 | **+1** |
| C06 Cross-validated pipeline | 12/12 | 12/12 | 0 |
| C07 Time series ARIMA | 9/10 | 10/10 | -1 |
| C08 DL training function | 8/10 | 8/10 | 0 |
| C09 Pandas method chain | 10/10 | 10/10 | 0 |
| C10 Model evaluation | 10/13 | 12/13 | -2 |
| **Total** | **98/105** | **96/105** | **+2** |

## Hardware Requirements

- **Minimum:** ~10GB VRAM (4-bit quantized)
- **Recommended:** 24GB+ VRAM (L4, A100, etc.)
- Tested on: NVIDIA L40S (44GB), NVIDIA T4 x2 (15GB each)

## License

Apache 2.0