File size: 11,739 Bytes
a70eb3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
# GSoC 2026 Proposal: LLM Support for 7B Models (OLMo) in DeepChem
# CONTENT REFERENCE β€” REWRITE IN YOUR OWN WORDS BEFORE SUBMITTING

---

## 1. INTRODUCTION

### What this project is about
DeepChem's HuggingFaceModel wrapper currently supports encoder-only models
(ChemBERTa, MoLFormer) through masked language modeling. There is no support
for decoder-only causal language models. This project adds OLMo-2
(Allen AI's open language model) to DeepChem, enabling:
- Continued pretraining on molecular data (SMILES)
- Fine-tuning for classification and regression
- Autoregressive molecular generation

### Why it matters
Encoder models (ChemBERTa) can classify and predict properties, but they
CANNOT generate molecules. A causal LM like OLMo opens up:
- De novo molecular generation (drug discovery)
- Text-molecule bridging (OLMo understands English AND can learn SMILES)
- In-context few-shot learning without fine-tuning
- Transfer learning from scientific literature

### Why OLMo specifically
- Fully open (weights, data, training code) β€” unlike LLaMA/GPT
- OLMo-2 is natively supported in HuggingFace transformers (no custom code)
- 1B and 7B variants available for different compute budgets
- Trained on Dolma corpus which includes scientific papers

### What I've already done (reference your PR and experiments)
- Found and fixed a transformers 5.x compatibility bug in Chemberta (PR #4913)
- Filed issue #4912 documenting broader transformers 5.x compat gap
- Built a working OLMo wrapper prototype (locally) with:
  - Olmo2ForSequenceClassification (doesn't exist in transformers)
  - Causal LM pretraining on SMILES
  - All 8 unit tests passing
- Ran experiments on real MoleculeNet data:
  - BBBP classification: ROC-AUC 0.67 (random init, tiny model)
  - ESOL regression: R^2 = 0.37
  - SMILES generation: 0% validity (expected β€” proves pretraining is essential)

---

## 2. RELEVANT EXPERIENCE & INTEREST

### Technical background
- Parameter Golf (OpenAI competition, March 2026): Trained language models
  from scratch under 16MB constraint. Custom SentencePiece tokenizers,
  GPTQ-lite quantization, flash attention, architecture design (11L 512d
  transformer). This is directly relevant β€” I understand transformer
  training at a low level.
- GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 GitHub stars): Fine-tuning and
  distillation of large language models.
- wingman-AI (29 GitHub stars): Production AI assistant system.
- Open source contributions: PRs to HuggingFace transformers, Unsloth,
  Anthropic SDK, OpenAI SDK, Karpathy's nanochat.

### Why I want to work on this
[WRITE THIS YOURSELF β€” what genuinely interests you about molecular ML?
Why DeepChem? Be specific and honest. Don't say "I'm passionate about
open source" β€” say what specific thing drew you to this project.]

### Links
- GitHub: https://github.com/vivekvar-dl
- PR #4913: https://github.com/deepchem/deepchem/pull/4913
- Issue #4912: https://github.com/deepchem/deepchem/issues/4912

---

## 3. WORK PLAN

### 3.1 Design

The implementation has four components:

**Component A: Base class changes to HuggingFaceModel**
- Add `causal_lm` task support (DataCollatorForLanguageModeling with mlm=False)
- Add `AutoModelForCausalLM` branch in load_from_pretrained()
- Add `generate()` method for autoregressive text generation
- Add causal LM batch preparation in _prepare_batch()

Note: PR #4907 by another contributor adds a similar generate() method.
My work is complementary β€” I'm adding a full model wrapper, not just
generation plumbing.

**Component B: Olmo2ForSequenceClassification**
This class DOES NOT EXIST in HuggingFace transformers. OLMo only has
OlmoForCausalLM β€” no classification head. I built one:
- Extends Olmo2PreTrainedModel
- Uses last-token pooling (last non-padded token's hidden state)
- Linear projection head for classification/regression
- Supports single-label, multi-label, and regression via problem_type config
- Computes CrossEntropyLoss / BCEWithLogitsLoss / MSELoss based on task

This follows the same pattern as LlamaForSequenceClassification.

**Component C: OLMo wrapper class**
```
OLMo(HuggingFaceModel)
  __init__(task, model_name, n_tasks, config)
    - task: causal_lm | regression | classification | mtr
    - Loads tokenizer from HuggingFace Hub
    - Sets pad_token = eos_token (decoder models don't have pad by default)
    - Syncs vocab_size between config and tokenizer
    - Creates appropriate model class based on task

  _prepare_batch(batch)
    - causal_lm: labels = input_ids (model shifts internally)
    - regression/classification: labels from dataset, proper dtype casting
    - Multi-task classification: float labels for BCEWithLogitsLoss
```

**Component D: Tokenization strategy**
Phase 1 (GSoC): Use OLMo's pretrained tokenizer as-is on SMILES.
  - OLMo's 100K BPE vocab actually tokenizes SMILES more efficiently
    than ChemBERTa's 600-token vocab (0.9x token ratio in my analysis)
  - BUT it fragments chemical semantics: [C@@H] -> [C, @@, H, ]
  - ChemBERTa learns chemistry-aware merges: (=O), ccccc, COc

Phase 2 (stretch): Extend tokenizer with SMILES-specific tokens.
  - Add special tokens for stereochemistry: [C@@H], [C@H], [nH]
  - Add aromatic ring tokens: c1ccccc1
  - Retrain BPE on mixed English + SMILES corpus

### 3.2 Pseudocode

Olmo2ForSequenceClassification.forward():
```
  hidden_states = self.model(input_ids, attention_mask)
  # Pool: use last non-padded token
  seq_lengths = (input_ids != pad_token_id).sum(-1) - 1
  pooled = hidden_states[batch_range, seq_lengths]
  logits = self.score(pooled)  # Linear(hidden_size, num_labels)
  if labels:
    if regression: loss = MSELoss(logits, labels)
    if single_class: loss = CrossEntropy(logits, labels)
    if multi_label: loss = BCEWithLogits(logits, labels)
  return {loss, logits}
```

OLMo._prepare_batch() for causal_lm:
```
  tokens = tokenizer(smiles_list, padding=True)
  input_ids = tokens.input_ids.to(device)
  labels = input_ids.clone()  # next-token prediction
  return {input_ids, attention_mask, labels}
```

HuggingFaceModel.generate():
```
  tokens = tokenizer(inputs, padding=True)
  output_ids = model.generate(**tokens, max_new_tokens=N, **kwargs)
  return tokenizer.batch_decode(output_ids)
```

### 3.3 Testing Plan

8 unit tests (all passing in my prototype):

| Test | What it validates |
|------|-------------------|
| test_olmo_causal_lm_pretraining | Causal LM trains, loss > 0 |
| test_olmo_regression_finetuning | Regression trains, predictions match shape, MAE computable |
| test_olmo_classification | Classification on binary labels, loss > 0 |
| test_olmo_multitask_regression | MTR with 2 tasks, predictions shape matches |
| test_olmo_save_and_restore | Checkpoint save/load, weights match exactly |
| test_olmo_load_from_pretrained | Pretrain causal LM -> load into regression model |
| test_olmo_generate | Single and batch generation returns strings |
| test_olmo_invalid_task | ValueError on bad task name |

All tests use a tiny config (64 hidden, 2 layers, 2 heads) β€” no model
download needed, runs in ~27 seconds on CPU.

Integration tests (to add during GSoC):
- MoleculeNet benchmarks: BBBP, ESOL, FreeSolv, Lipophilicity
- SMILES generation validity (RDKit validation)
- Continued pretraining convergence on ZINC/PubChem subsets

### 3.4 Sources of Risk

| Risk | Likelihood | Mitigation |
|------|-----------|------------|
| OLMo-7B requires ~14GB VRAM for inference | High | Use OLMo-1B for CI/demos. Test with tiny configs. Document GPU requirements. |
| SMILES generation validity low without extensive pretraining | High | This IS the core problem. Budget 3 weeks for pretraining experiments. Use ZINC-250K as training corpus. Target >50% validity. |
| Olmo2ForSequenceClassification not upstream | Medium | Our implementation follows HF patterns exactly. If HF adds it later, we swap to theirs. |
| Tokenizer fragments chemical semantics | Medium | Phase 1: works as-is (my experiments show learning happens). Phase 2: extend vocabulary. |
| transformers version compatibility | Low | Already found and fixed one issue (PR #4913). Use top-level imports throughout. |

### 3.5 Milestones & Timeline

Assuming Medium size (175 hours, ~12 weeks):

**Milestone 1: Core wrapper (Weeks 1-3)**
- PR: Base class changes to HuggingFaceModel (causal_lm task, generate())
  - Coordinate with PR #4907 to avoid duplication
- PR: Olmo2ForSequenceClassification
- PR: OLMo wrapper class with all task modes
- PR: Unit tests (8 tests)
- Deliverable: `from deepchem.models import OLMo` works for all tasks

**Milestone 2: Continued pretraining (Weeks 4-6)**
- PR: Pretraining pipeline on molecular data (ZINC-250K)
- PR: Data loading utilities for SMILES corpora
- PR: Pretraining tutorial notebook
- Deliverable: Pretrained OLMo checkpoint on molecular data

**Milestone 3: Fine-tuning & benchmarks (Weeks 7-9)**
- PR: Classification tutorial (BBBP, Tox21)
- PR: Regression tutorial (ESOL, FreeSolv, Lipophilicity)
- PR: Benchmark results table vs ChemBERTa
- Deliverable: Published benchmark comparing OLMo vs ChemBERTa on MoleculeNet

**Milestone 4: Generation & polish (Weeks 10-12)**
- PR: SMILES generation tutorial with RDKit validity checking
- PR: Documentation (numpydoc, API reference, user guide)
- PR: Tokenizer extension experiments (stretch goal)
- Deliverable: Complete documentation and tutorials

Each milestone = 1 evaluation checkpoint. PRs are <50 lines where possible,
following DeepChem's contribution guidelines.

### 3.6 Pull Request Plan

I will follow DeepChem's guidelines: small PRs (<50 lines for initial ones),
with tests and numpydoc documentation. Expected ~8-12 PRs total:

1. HuggingFaceModel causal_lm support (~40 lines)
2. generate() method (~50 lines)
3. Olmo2ForSequenceClassification (~100 lines β€” larger, will discuss with mentor)
4. OLMo wrapper class (~80 lines)
5. Unit tests (~180 lines)
6. Pretraining pipeline
7. Data utilities
8. Tutorial notebooks (3-4 notebooks)
9. Documentation updates
10. Benchmark scripts

---

## 4. COMMUNITY ENGAGEMENT

- Already contributing: PR #4913 (bug fix), Issue #4912 (compat report)
- Will attend office hours MWF 9am PST
- Will join Discord for async discussion
- Will write weekly progress updates
- Happy to review other contributors' HuggingFace-related PRs

---

## 5. RESOURCES REQUIRED

- GPU: I have access to 1x H100 NVL 96GB (Azure) for development
- For CI: tiny model configs, no GPU needed
- For pretraining experiments: my H100 is sufficient for OLMo-1B
- OLMo-7B experiments: may need multi-GPU setup (discuss with mentor)

---

## 6. BIBLIOGRAPHY

1. Groeneveld et al. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838
2. Chithrananda et al. (2020). "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv:2010.09885
3. Ross et al. (2022). "Large-Scale Chemical Language Representations Capture Molecular Structure and Properties." Nature Machine Intelligence.
4. Weininger (1988). "SMILES, a chemical language and information system." J. Chem. Inf. Comput. Sci.
5. Wu et al. (2018). "MoleculeNet: A Benchmark for Molecular Machine Learning." Chemical Science.

---

## KEY NUMBERS FROM YOUR EXPERIMENTS (reference these in proposal)

- Tokenization: OLMo uses 0.9x tokens vs ChemBERTa on drug molecules
- BBBP classification: ROC-AUC 0.67 (random init, 12.9M param model, 200 samples, 3 epochs)
- ESOL regression: R^2 = 0.37, MAE = 1.27 (same conditions)
- SMILES generation: 0% validity from random init (proves pretraining is the core challenge)
- Test suite: 8/8 tests pass in 27 seconds on CPU
- Stereochemistry fragmentation: [C@@H] splits into 4 tokens in OLMo vs 7 in ChemBERTa