mdonigian commited on
Commit
4c8d68e
Β·
verified Β·
1 Parent(s): 78cbdf8

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +2 -40
README.md CHANGED
@@ -25,7 +25,7 @@ Multi-task code classification model for filtering large-scale code datasets. Bu
25
 
26
  - **Base model:** microsoft/unixcoder-base (125M params)
27
  - **Architecture:** Shared encoder + three task-specific linear heads
28
- - **Training data:** 191,776 code samples from bigcode/starcoderdata, labeled by GPT-5-nano (Batch API)
29
  - **Languages:** Python, JavaScript, TypeScript, Java, Go, Rust, SQL, Shell (25K per language)
30
  - **Training:** 3 epochs, batch size 16, lr 2e-5, AMP (bf16), torch.compile
31
 
@@ -111,44 +111,6 @@ For this filtering use case, what matters is **rank ordering**, not exact classi
111
 
112
  The primary bottleneck is **training data volume and class balance**, not model capacity:
113
 
114
- 1. **Scale up the GPT-5-nano labeling set.** The current model was trained on 192K labeled samples (~$20 via Batch API). Doubling to 400K samples (~$40) would particularly help quality levels 2 and 5, where the model struggles most. Level 5 (excellent code) had only 2,345 training examples β€” far too few for the model to learn the pattern.
115
-
116
- 2. **Oversample rare classes.** Content types like tutorial (2,197 samples), data (1,413), and generated (2,975) are underrepresented. A targeted labeling run that specifically seeks out these types β€” e.g., filtering by filename patterns like `*_test.*`, `*.generated.*`, `tutorial*` β€” would improve recall on rare types without relabeling the entire dataset.
117
 
118
  3. **Increase max token length.** The current model uses 512 tokens, but code files often need more context to assess quality. Increasing to 1024 or 2048 tokens (UniXcoder supports up to 1024) would give the model more signal, particularly for quality assessment where style and documentation patterns emerge over longer spans.
119
-
120
- 4. **Add a second training round with hard examples.** After running inference on the full StarCoderData, sample files where the model is least confident (prediction near the decision boundary, e.g., quality between 2.5 and 3.5) and send those to GPT-5-nano for labeling. Training on these hard cases would sharpen the model's performance exactly where filtering decisions are made.
121
-
122
- ## Usage
123
-
124
- ```python
125
- from train_starcoderdata import CodeClassifierModel, load_model
126
- from transformers import AutoTokenizer
127
- import torch
128
-
129
- model_dir = "models/starcoderdata-classifier"
130
- tokenizer = AutoTokenizer.from_pretrained(model_dir)
131
- model = load_model(model_dir)
132
- model.eval()
133
-
134
- code = "def hello(): print('world')"
135
- inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
136
- with torch.no_grad():
137
- quality, structured_data, content_type_logits = model(
138
- inputs["input_ids"], inputs["attention_mask"]
139
- )
140
-
141
- print(f"Quality: {quality.item():.1f}")
142
- print(f"Structured Data: {structured_data.item():.1f}")
143
- print(f"Content Type: {['library','application','test','config','tutorial','data','generated','script','other'][content_type_logits.argmax()]}")
144
- ```
145
-
146
- ## Files
147
-
148
- - `config.json`, `model.safetensors` β€” UniXcoder encoder weights (HuggingFace format)
149
- - `classifier_heads.pt` β€” Quality, structured data, and content type head weights
150
- - `tokenizer.json`, `tokenizer_config.json` β€” Tokenizer
151
- - `label_config.json` β€” Label definitions and task types
152
- - `test_metrics.json` β€” Full test set metrics
153
- - `training_history.csv` β€” Per-epoch training/validation metrics
154
- - `checkpoint.pt` β€” Full training checkpoint (for resume)
 
25
 
26
  - **Base model:** microsoft/unixcoder-base (125M params)
27
  - **Architecture:** Shared encoder + three task-specific linear heads
28
+ - **Training data:** 191,776 code samples from bigcode/starcoderdata, labeled by GPT-5-nano (~$80 Batch API)
29
  - **Languages:** Python, JavaScript, TypeScript, Java, Go, Rust, SQL, Shell (25K per language)
30
  - **Training:** 3 epochs, batch size 16, lr 2e-5, AMP (bf16), torch.compile
31
 
 
111
 
112
  The primary bottleneck is **training data volume and class balance**, not model capacity:
113
 
114
+ 1. **Scale up the GPT-5-nano labeling set.** The current model was trained on 192K labeled samples. Doubling to 400K samples (~$80) would particularly help quality levels 2 and 5, where the model struggles most. Level 5 (excellent code) had only 2,345 training examples β€” far too few for the model to learn the pattern.
 
 
115
 
116
  3. **Increase max token length.** The current model uses 512 tokens, but code files often need more context to assess quality. Increasing to 1024 or 2048 tokens (UniXcoder supports up to 1024) would give the model more signal, particularly for quality assessment where style and documentation patterns emerge over longer spans.