melihcatal
/

codedp-cpt-models

+---
+language:
+- code
+license: apache-2.0
+tags:
+- differential-privacy
+- code-generation
+- continued-pretraining
+- lora
+- dp-sgd
+- opacus
+- privacy
+datasets:
+- melihcatal/codedp-cpt
+base_model:
+- ibm-granite/granite-4.0-h-tiny
+- deepseek-ai/deepseek-coder-6.7b-instruct
+- Qwen/Qwen3-4B-Instruct-2507
+library_name: peft
+pipeline_tag: text-generation
+---
+# CodeDP-CPT: Differentially Private Continued Pre-Training for Code Models
+This repository contains LoRA adapters for code language models trained with **Continued Pre-Training (CPT)** under **Differential Privacy (DP-SGD)**. The models demonstrate that formal privacy guarantees can be applied to code generation models while preserving utility.
+## Models
+Nine adapter checkpoints are provided — three base models × three privacy configurations:
+| Base Model | Variant | DP | Target ε | Achieved ε | Adapter Path |
+|---|---|---|---|---|---|
+| [ibm-granite/granite-4.0-h-tiny](https://huggingface.co/ibm-granite/granite-4.0-h-tiny) | base | No | — | — | `granite-4.0-h-tiny/base/adapter/` |
+| [ibm-granite/granite-4.0-h-tiny](https://huggingface.co/ibm-granite/granite-4.0-h-tiny) | dp3 | Yes | 3.0 | 2.99 | `granite-4.0-h-tiny/dp3/adapter/` |
+| [ibm-granite/granite-4.0-h-tiny](https://huggingface.co/ibm-granite/granite-4.0-h-tiny) | dp8 | Yes | 8.0 | 8.00 | `granite-4.0-h-tiny/dp8/adapter/` |
+| [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) | base | No | — | — | `deepseek-coder-6.7b/base/adapter/` |
+| [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) | dp3 | Yes | 3.0 | 3.00 | `deepseek-coder-6.7b/dp3/adapter/` |
+| [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) | dp8 | Yes | 8.0 | 8.00 | `deepseek-coder-6.7b/dp8/adapter/` |
+| [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) | base | No | — | — | `qwen3-4b-instruct/base/adapter/` |
+| [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) | dp3 | Yes | 3.0 | 2.99 | `qwen3-4b-instruct/dp3/adapter/` |
+| [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) | dp8 | Yes | 8.0 | 8.00 | `qwen3-4b-instruct/dp8/adapter/` |
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+base_model_name = "ibm-granite/granite-4.0-h-tiny"
+adapter_path = "melihcatal/codedp-cpt-models"
+subfolder = "granite-4.0-h-tiny/dp8/adapter"
+tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code=True)
+model = PeftModel.from_pretrained(model, adapter_path, subfolder=subfolder)
+```
+## Training Details
+### Dataset
+- **Dataset:** [melihcatal/codedp-cpt](https://huggingface.co/datasets/melihcatal/codedp-cpt) — code mined from GitHub repositories with quality filtering and decontamination (file-level, Type-1, and Type-2 clone detection against evaluation benchmarks)
+- **Mode:** Causal language modeling (continued pre-training)
+- **Validation split:** 5% held out
+### LoRA Configuration
+| Parameter | Value |
+|---|---|
+| Rank (r) | 16 |
+| Alpha (α) | 32 |
+| Dropout | 0.05 |
+| Target modules | q_proj, k_proj, v_proj, o_proj |
+| Modules to save | lm_head |
+### Training Hyperparameters
+| Parameter | No-DP (base) | DP variants |
+|---|---|---|
+| Epochs | 2 | 2 |
+| Batch size (per GPU) | 8 | 8 |
+| Learning rate | 1e-4 | 2e-4 |
+| Optimizer | AdamW | AdamW |
+| LR scheduler | Cosine | Cosine |
+| Warmup ratio | 5% | 5% |
+| Grad accumulation steps | 4–8 | 16 |
+| Max gradient norm | 1.0 | 1.0 |
+| Sequence length | 1024 | 1024 |
+| Precision | bfloat16 | bfloat16 |
+| Seed | 42 | 42 |
+### Differential Privacy
+| Parameter | Value |
+|---|---|
+| Engine | Opacus PrivacyEngine |
+| Mechanism | Gaussian (DP-SGD) |
+| Per-sample gradients | Hook-based |
+| Clipping | Flat (global) |
+| Target δ | 1e-5 |
+| Target ε | 3.0 or 8.0 |
+| Privacy accounting | RDP (Rényi Differential Privacy) |
+### Infrastructure
+- **Distributed strategy:** DDP (Distributed Data Parallel) with NCCL backend
+- **Hardware:** NVIDIA H200 GPUs
+## Evaluation Results
+### Functional Correctness — CodeDP-FC (Granite-4.0-H-Tiny)
+103 code generation tasks, 10 samples per task, temperature 0.8.
+| Variant | pass@1 | pass@5 | pass@10 |
+|---|---|---|---|
+| No fine-tuning | 13.5% | 18.4% | 20.4% |
+| CPT (no DP) | 10.1% | 16.6% | 18.4% |
+| CPT + DP (ε=3) | 13.7% | 19.1% | 21.4% |
+| CPT + DP (ε=8) | **14.5%** | **21.1%** | **23.3%** |
+### Training Loss (Eval Set)
+| Model | No-DP | DP ε=3 | DP ε=8 |
+|---|---|---|---|
+| Granite-4.0-H-Tiny | 0.946 | 1.044 | 1.038 |
+| DeepSeek-Coder-6.7B | 4.840 | 10.326 | 7.523 |
+| Qwen3-4B-Instruct | 0.808 | 0.941 | 0.925 |
+### Privacy Audit
+New-token canary audit (500 members, 500 non-members, 49-token random prefixes). Higher AUC = more memorization; lower = better privacy.
+| Model | Variant | Loss AUC | Embedding AUC | Empirical ε (p=0.01) |
+|---|---|---|---|---|
+| Granite-4.0-H-Tiny | base | 1.000 | 1.000 | 3.02 |
+| Granite-4.0-H-Tiny | dp3 | 0.543 | 0.513 | 0.00 |
+| Granite-4.0-H-Tiny | dp8 | 0.564 | 0.508 | 0.16 |
+| DeepSeek-Coder-6.7B | base | 0.957 | 0.968 | 3.02 |
+| DeepSeek-Coder-6.7B | dp3 | 0.522 | 0.543 | 0.00 |
+| DeepSeek-Coder-6.7B | dp8 | 0.533 | 0.545 | 0.00 |
+| Qwen3-4B-Instruct | base | 0.969 | 0.884 | 3.02 |
+| Qwen3-4B-Instruct | dp3 | 0.505 | 0.515 | 0.00 |
+| Qwen3-4B-Instruct | dp8 | 0.515 | 0.516 | 0.00 |
+**Key finding:** DP training reduces canary audit AUC to near-random (0.5), with empirical ε dropping to 0 in most cases — confirming that the formal privacy guarantees hold in practice.
+## Repository Structure
+```
+├── granite-4.0-h-tiny/
+│   ├── base/                    # No-DP baseline
+│   ├── dp3/                     # DP ε=3
+│   └── dp8/                     # DP ε=8
+├── deepseek-coder-6.7b/
+│   ├── base/
+│   ├── dp3/
+│   └── dp8/
+└── qwen3-4b-instruct/
+    ├── base/
+    ├── dp3/
+    └── dp8/
+```
+Each variant directory contains:
+- `adapter/` — LoRA adapter weights (PEFT-compatible)
+- `tokenizer/` — Tokenizer with any added audit tokens
+- `resolved_config.yaml` — Full training configuration
+- `summary.json` — Training and audit metrics
+- `audit_results.json`, `audit_scores.npz` — Privacy audit artifacts
+- `metrics.jsonl`, `scalars.csv` — Training logs
+- `tensorboard/` — TensorBoard events
+- `codecarbon.csv` — Carbon emissions tracking
+- `epochs/` — Per-epoch checkpoints and audit results
+## Limitations
+- These are **LoRA adapters**, not standalone models. They require the corresponding base model for inference.
+- The adapters include additional tokenizer tokens added during the privacy audit process (canary tokens). These do not affect normal generation.
+- Evaluation results are on the CodeDP-FC benchmark; performance may vary on other code generation tasks.
+- DP training with tight privacy budgets (ε=3) incurs a utility cost, particularly visible in validation loss.
+## Related Resources
+- **Training dataset:** [melihcatal/codedp-cpt](https://huggingface.co/datasets/melihcatal/codedp-cpt)
+- **MIA benchmark:** [melihcatal/codedp-bench-mia-cpt](https://huggingface.co/datasets/melihcatal/codedp-bench-mia-cpt)