File size: 11,739 Bytes
a70eb3d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 | # GSoC 2026 Proposal: LLM Support for 7B Models (OLMo) in DeepChem
# CONTENT REFERENCE β REWRITE IN YOUR OWN WORDS BEFORE SUBMITTING
---
## 1. INTRODUCTION
### What this project is about
DeepChem's HuggingFaceModel wrapper currently supports encoder-only models
(ChemBERTa, MoLFormer) through masked language modeling. There is no support
for decoder-only causal language models. This project adds OLMo-2
(Allen AI's open language model) to DeepChem, enabling:
- Continued pretraining on molecular data (SMILES)
- Fine-tuning for classification and regression
- Autoregressive molecular generation
### Why it matters
Encoder models (ChemBERTa) can classify and predict properties, but they
CANNOT generate molecules. A causal LM like OLMo opens up:
- De novo molecular generation (drug discovery)
- Text-molecule bridging (OLMo understands English AND can learn SMILES)
- In-context few-shot learning without fine-tuning
- Transfer learning from scientific literature
### Why OLMo specifically
- Fully open (weights, data, training code) β unlike LLaMA/GPT
- OLMo-2 is natively supported in HuggingFace transformers (no custom code)
- 1B and 7B variants available for different compute budgets
- Trained on Dolma corpus which includes scientific papers
### What I've already done (reference your PR and experiments)
- Found and fixed a transformers 5.x compatibility bug in Chemberta (PR #4913)
- Filed issue #4912 documenting broader transformers 5.x compat gap
- Built a working OLMo wrapper prototype (locally) with:
- Olmo2ForSequenceClassification (doesn't exist in transformers)
- Causal LM pretraining on SMILES
- All 8 unit tests passing
- Ran experiments on real MoleculeNet data:
- BBBP classification: ROC-AUC 0.67 (random init, tiny model)
- ESOL regression: R^2 = 0.37
- SMILES generation: 0% validity (expected β proves pretraining is essential)
---
## 2. RELEVANT EXPERIENCE & INTEREST
### Technical background
- Parameter Golf (OpenAI competition, March 2026): Trained language models
from scratch under 16MB constraint. Custom SentencePiece tokenizers,
GPTQ-lite quantization, flash attention, architecture design (11L 512d
transformer). This is directly relevant β I understand transformer
training at a low level.
- GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 GitHub stars): Fine-tuning and
distillation of large language models.
- wingman-AI (29 GitHub stars): Production AI assistant system.
- Open source contributions: PRs to HuggingFace transformers, Unsloth,
Anthropic SDK, OpenAI SDK, Karpathy's nanochat.
### Why I want to work on this
[WRITE THIS YOURSELF β what genuinely interests you about molecular ML?
Why DeepChem? Be specific and honest. Don't say "I'm passionate about
open source" β say what specific thing drew you to this project.]
### Links
- GitHub: https://github.com/vivekvar-dl
- PR #4913: https://github.com/deepchem/deepchem/pull/4913
- Issue #4912: https://github.com/deepchem/deepchem/issues/4912
---
## 3. WORK PLAN
### 3.1 Design
The implementation has four components:
**Component A: Base class changes to HuggingFaceModel**
- Add `causal_lm` task support (DataCollatorForLanguageModeling with mlm=False)
- Add `AutoModelForCausalLM` branch in load_from_pretrained()
- Add `generate()` method for autoregressive text generation
- Add causal LM batch preparation in _prepare_batch()
Note: PR #4907 by another contributor adds a similar generate() method.
My work is complementary β I'm adding a full model wrapper, not just
generation plumbing.
**Component B: Olmo2ForSequenceClassification**
This class DOES NOT EXIST in HuggingFace transformers. OLMo only has
OlmoForCausalLM β no classification head. I built one:
- Extends Olmo2PreTrainedModel
- Uses last-token pooling (last non-padded token's hidden state)
- Linear projection head for classification/regression
- Supports single-label, multi-label, and regression via problem_type config
- Computes CrossEntropyLoss / BCEWithLogitsLoss / MSELoss based on task
This follows the same pattern as LlamaForSequenceClassification.
**Component C: OLMo wrapper class**
```
OLMo(HuggingFaceModel)
__init__(task, model_name, n_tasks, config)
- task: causal_lm | regression | classification | mtr
- Loads tokenizer from HuggingFace Hub
- Sets pad_token = eos_token (decoder models don't have pad by default)
- Syncs vocab_size between config and tokenizer
- Creates appropriate model class based on task
_prepare_batch(batch)
- causal_lm: labels = input_ids (model shifts internally)
- regression/classification: labels from dataset, proper dtype casting
- Multi-task classification: float labels for BCEWithLogitsLoss
```
**Component D: Tokenization strategy**
Phase 1 (GSoC): Use OLMo's pretrained tokenizer as-is on SMILES.
- OLMo's 100K BPE vocab actually tokenizes SMILES more efficiently
than ChemBERTa's 600-token vocab (0.9x token ratio in my analysis)
- BUT it fragments chemical semantics: [C@@H] -> [C, @@, H, ]
- ChemBERTa learns chemistry-aware merges: (=O), ccccc, COc
Phase 2 (stretch): Extend tokenizer with SMILES-specific tokens.
- Add special tokens for stereochemistry: [C@@H], [C@H], [nH]
- Add aromatic ring tokens: c1ccccc1
- Retrain BPE on mixed English + SMILES corpus
### 3.2 Pseudocode
Olmo2ForSequenceClassification.forward():
```
hidden_states = self.model(input_ids, attention_mask)
# Pool: use last non-padded token
seq_lengths = (input_ids != pad_token_id).sum(-1) - 1
pooled = hidden_states[batch_range, seq_lengths]
logits = self.score(pooled) # Linear(hidden_size, num_labels)
if labels:
if regression: loss = MSELoss(logits, labels)
if single_class: loss = CrossEntropy(logits, labels)
if multi_label: loss = BCEWithLogits(logits, labels)
return {loss, logits}
```
OLMo._prepare_batch() for causal_lm:
```
tokens = tokenizer(smiles_list, padding=True)
input_ids = tokens.input_ids.to(device)
labels = input_ids.clone() # next-token prediction
return {input_ids, attention_mask, labels}
```
HuggingFaceModel.generate():
```
tokens = tokenizer(inputs, padding=True)
output_ids = model.generate(**tokens, max_new_tokens=N, **kwargs)
return tokenizer.batch_decode(output_ids)
```
### 3.3 Testing Plan
8 unit tests (all passing in my prototype):
| Test | What it validates |
|------|-------------------|
| test_olmo_causal_lm_pretraining | Causal LM trains, loss > 0 |
| test_olmo_regression_finetuning | Regression trains, predictions match shape, MAE computable |
| test_olmo_classification | Classification on binary labels, loss > 0 |
| test_olmo_multitask_regression | MTR with 2 tasks, predictions shape matches |
| test_olmo_save_and_restore | Checkpoint save/load, weights match exactly |
| test_olmo_load_from_pretrained | Pretrain causal LM -> load into regression model |
| test_olmo_generate | Single and batch generation returns strings |
| test_olmo_invalid_task | ValueError on bad task name |
All tests use a tiny config (64 hidden, 2 layers, 2 heads) β no model
download needed, runs in ~27 seconds on CPU.
Integration tests (to add during GSoC):
- MoleculeNet benchmarks: BBBP, ESOL, FreeSolv, Lipophilicity
- SMILES generation validity (RDKit validation)
- Continued pretraining convergence on ZINC/PubChem subsets
### 3.4 Sources of Risk
| Risk | Likelihood | Mitigation |
|------|-----------|------------|
| OLMo-7B requires ~14GB VRAM for inference | High | Use OLMo-1B for CI/demos. Test with tiny configs. Document GPU requirements. |
| SMILES generation validity low without extensive pretraining | High | This IS the core problem. Budget 3 weeks for pretraining experiments. Use ZINC-250K as training corpus. Target >50% validity. |
| Olmo2ForSequenceClassification not upstream | Medium | Our implementation follows HF patterns exactly. If HF adds it later, we swap to theirs. |
| Tokenizer fragments chemical semantics | Medium | Phase 1: works as-is (my experiments show learning happens). Phase 2: extend vocabulary. |
| transformers version compatibility | Low | Already found and fixed one issue (PR #4913). Use top-level imports throughout. |
### 3.5 Milestones & Timeline
Assuming Medium size (175 hours, ~12 weeks):
**Milestone 1: Core wrapper (Weeks 1-3)**
- PR: Base class changes to HuggingFaceModel (causal_lm task, generate())
- Coordinate with PR #4907 to avoid duplication
- PR: Olmo2ForSequenceClassification
- PR: OLMo wrapper class with all task modes
- PR: Unit tests (8 tests)
- Deliverable: `from deepchem.models import OLMo` works for all tasks
**Milestone 2: Continued pretraining (Weeks 4-6)**
- PR: Pretraining pipeline on molecular data (ZINC-250K)
- PR: Data loading utilities for SMILES corpora
- PR: Pretraining tutorial notebook
- Deliverable: Pretrained OLMo checkpoint on molecular data
**Milestone 3: Fine-tuning & benchmarks (Weeks 7-9)**
- PR: Classification tutorial (BBBP, Tox21)
- PR: Regression tutorial (ESOL, FreeSolv, Lipophilicity)
- PR: Benchmark results table vs ChemBERTa
- Deliverable: Published benchmark comparing OLMo vs ChemBERTa on MoleculeNet
**Milestone 4: Generation & polish (Weeks 10-12)**
- PR: SMILES generation tutorial with RDKit validity checking
- PR: Documentation (numpydoc, API reference, user guide)
- PR: Tokenizer extension experiments (stretch goal)
- Deliverable: Complete documentation and tutorials
Each milestone = 1 evaluation checkpoint. PRs are <50 lines where possible,
following DeepChem's contribution guidelines.
### 3.6 Pull Request Plan
I will follow DeepChem's guidelines: small PRs (<50 lines for initial ones),
with tests and numpydoc documentation. Expected ~8-12 PRs total:
1. HuggingFaceModel causal_lm support (~40 lines)
2. generate() method (~50 lines)
3. Olmo2ForSequenceClassification (~100 lines β larger, will discuss with mentor)
4. OLMo wrapper class (~80 lines)
5. Unit tests (~180 lines)
6. Pretraining pipeline
7. Data utilities
8. Tutorial notebooks (3-4 notebooks)
9. Documentation updates
10. Benchmark scripts
---
## 4. COMMUNITY ENGAGEMENT
- Already contributing: PR #4913 (bug fix), Issue #4912 (compat report)
- Will attend office hours MWF 9am PST
- Will join Discord for async discussion
- Will write weekly progress updates
- Happy to review other contributors' HuggingFace-related PRs
---
## 5. RESOURCES REQUIRED
- GPU: I have access to 1x H100 NVL 96GB (Azure) for development
- For CI: tiny model configs, no GPU needed
- For pretraining experiments: my H100 is sufficient for OLMo-1B
- OLMo-7B experiments: may need multi-GPU setup (discuss with mentor)
---
## 6. BIBLIOGRAPHY
1. Groeneveld et al. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838
2. Chithrananda et al. (2020). "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv:2010.09885
3. Ross et al. (2022). "Large-Scale Chemical Language Representations Capture Molecular Structure and Properties." Nature Machine Intelligence.
4. Weininger (1988). "SMILES, a chemical language and information system." J. Chem. Inf. Comput. Sci.
5. Wu et al. (2018). "MoleculeNet: A Benchmark for Molecular Machine Learning." Chemical Science.
---
## KEY NUMBERS FROM YOUR EXPERIMENTS (reference these in proposal)
- Tokenization: OLMo uses 0.9x tokens vs ChemBERTa on drug molecules
- BBBP classification: ROC-AUC 0.67 (random init, 12.9M param model, 200 samples, 3 epochs)
- ESOL regression: R^2 = 0.37, MAE = 1.27 (same conditions)
- SMILES generation: 0% validity from random init (proves pretraining is the core challenge)
- Test suite: 8/8 tests pass in 27 seconds on CPU
- Stereochemistry fragmentation: [C@@H] splits into 4 tokens in OLMo vs 7 in ChemBERTa
|