azure-scripts / gsoc_proposal_content.md

azure home scripts: data gen, training, misc

a70eb3d verified 20 days ago

preview code

raw

history blame contribute delete

11.7 kB

GSoC 2026 Proposal: LLM Support for 7B Models (OLMo) in DeepChem

CONTENT REFERENCE — REWRITE IN YOUR OWN WORDS BEFORE SUBMITTING

1. INTRODUCTION

What this project is about

DeepChem's HuggingFaceModel wrapper currently supports encoder-only models (ChemBERTa, MoLFormer) through masked language modeling. There is no support for decoder-only causal language models. This project adds OLMo-2 (Allen AI's open language model) to DeepChem, enabling:

Continued pretraining on molecular data (SMILES)
Fine-tuning for classification and regression
Autoregressive molecular generation

Why it matters

Encoder models (ChemBERTa) can classify and predict properties, but they CANNOT generate molecules. A causal LM like OLMo opens up:

De novo molecular generation (drug discovery)
Text-molecule bridging (OLMo understands English AND can learn SMILES)
In-context few-shot learning without fine-tuning
Transfer learning from scientific literature

Why OLMo specifically

Fully open (weights, data, training code) — unlike LLaMA/GPT
OLMo-2 is natively supported in HuggingFace transformers (no custom code)
1B and 7B variants available for different compute budgets
Trained on Dolma corpus which includes scientific papers

What I've already done (reference your PR and experiments)

Found and fixed a transformers 5.x compatibility bug in Chemberta (PR #4913)
Filed issue #4912 documenting broader transformers 5.x compat gap
Built a working OLMo wrapper prototype (locally) with:
- Olmo2ForSequenceClassification (doesn't exist in transformers)
- Causal LM pretraining on SMILES
- All 8 unit tests passing
Ran experiments on real MoleculeNet data:
- BBBP classification: ROC-AUC 0.67 (random init, tiny model)
- ESOL regression: R^2 = 0.37
- SMILES generation: 0% validity (expected — proves pretraining is essential)

2. RELEVANT EXPERIENCE & INTEREST

Technical background

Parameter Golf (OpenAI competition, March 2026): Trained language models from scratch under 16MB constraint. Custom SentencePiece tokenizers, GPTQ-lite quantization, flash attention, architecture design (11L 512d transformer). This is directly relevant — I understand transformer training at a low level.
GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 GitHub stars): Fine-tuning and distillation of large language models.
wingman-AI (29 GitHub stars): Production AI assistant system.
Open source contributions: PRs to HuggingFace transformers, Unsloth, Anthropic SDK, OpenAI SDK, Karpathy's nanochat.

Why I want to work on this

[WRITE THIS YOURSELF — what genuinely interests you about molecular ML? Why DeepChem? Be specific and honest. Don't say "I'm passionate about open source" — say what specific thing drew you to this project.]

3. WORK PLAN

3.1 Design

The implementation has four components:

Component A: Base class changes to HuggingFaceModel

Add causal_lm task support (DataCollatorForLanguageModeling with mlm=False)
Add AutoModelForCausalLM branch in load_from_pretrained()
Add generate() method for autoregressive text generation
Add causal LM batch preparation in _prepare_batch()

Note: PR #4907 by another contributor adds a similar generate() method. My work is complementary — I'm adding a full model wrapper, not just generation plumbing.

Component B: Olmo2ForSequenceClassification This class DOES NOT EXIST in HuggingFace transformers. OLMo only has OlmoForCausalLM — no classification head. I built one:

Extends Olmo2PreTrainedModel
Uses last-token pooling (last non-padded token's hidden state)
Linear projection head for classification/regression
Supports single-label, multi-label, and regression via problem_type config
Computes CrossEntropyLoss / BCEWithLogitsLoss / MSELoss based on task

This follows the same pattern as LlamaForSequenceClassification.

Component C: OLMo wrapper class

OLMo(HuggingFaceModel)
  __init__(task, model_name, n_tasks, config)
    - task: causal_lm | regression | classification | mtr
    - Loads tokenizer from HuggingFace Hub
    - Sets pad_token = eos_token (decoder models don't have pad by default)
    - Syncs vocab_size between config and tokenizer
    - Creates appropriate model class based on task

  _prepare_batch(batch)
    - causal_lm: labels = input_ids (model shifts internally)
    - regression/classification: labels from dataset, proper dtype casting
    - Multi-task classification: float labels for BCEWithLogitsLoss

Component D: Tokenization strategy Phase 1 (GSoC): Use OLMo's pretrained tokenizer as-is on SMILES.

OLMo's 100K BPE vocab actually tokenizes SMILES more efficiently than ChemBERTa's 600-token vocab (0.9x token ratio in my analysis)
BUT it fragments chemical semantics: [C@@H] -> [C, @@, H, ]
ChemBERTa learns chemistry-aware merges: (=O), ccccc, COc

Phase 2 (stretch): Extend tokenizer with SMILES-specific tokens.

Add special tokens for stereochemistry: [C@@H], [C@H], [nH]
Add aromatic ring tokens: c1ccccc1
Retrain BPE on mixed English + SMILES corpus

3.2 Pseudocode

Olmo2ForSequenceClassification.forward():

  hidden_states = self.model(input_ids, attention_mask)
  # Pool: use last non-padded token
  seq_lengths = (input_ids != pad_token_id).sum(-1) - 1
  pooled = hidden_states[batch_range, seq_lengths]
  logits = self.score(pooled)  # Linear(hidden_size, num_labels)
  if labels:
    if regression: loss = MSELoss(logits, labels)
    if single_class: loss = CrossEntropy(logits, labels)
    if multi_label: loss = BCEWithLogits(logits, labels)
  return {loss, logits}

OLMo._prepare_batch() for causal_lm:

  tokens = tokenizer(smiles_list, padding=True)
  input_ids = tokens.input_ids.to(device)
  labels = input_ids.clone()  # next-token prediction
  return {input_ids, attention_mask, labels}

HuggingFaceModel.generate():

  tokens = tokenizer(inputs, padding=True)
  output_ids = model.generate(**tokens, max_new_tokens=N, **kwargs)
  return tokenizer.batch_decode(output_ids)

3.3 Testing Plan

8 unit tests (all passing in my prototype):

Test	What it validates
test_olmo_causal_lm_pretraining	Causal LM trains, loss > 0
test_olmo_regression_finetuning	Regression trains, predictions match shape, MAE computable
test_olmo_classification	Classification on binary labels, loss > 0
test_olmo_multitask_regression	MTR with 2 tasks, predictions shape matches
test_olmo_save_and_restore	Checkpoint save/load, weights match exactly
test_olmo_load_from_pretrained	Pretrain causal LM -> load into regression model
test_olmo_generate	Single and batch generation returns strings
test_olmo_invalid_task	ValueError on bad task name

All tests use a tiny config (64 hidden, 2 layers, 2 heads) — no model download needed, runs in ~27 seconds on CPU.

Integration tests (to add during GSoC):

MoleculeNet benchmarks: BBBP, ESOL, FreeSolv, Lipophilicity
SMILES generation validity (RDKit validation)
Continued pretraining convergence on ZINC/PubChem subsets

3.4 Sources of Risk

Risk	Likelihood	Mitigation
OLMo-7B requires ~14GB VRAM for inference	High	Use OLMo-1B for CI/demos. Test with tiny configs. Document GPU requirements.
SMILES generation validity low without extensive pretraining	High	This IS the core problem. Budget 3 weeks for pretraining experiments. Use ZINC-250K as training corpus. Target >50% validity.
Olmo2ForSequenceClassification not upstream	Medium	Our implementation follows HF patterns exactly. If HF adds it later, we swap to theirs.
Tokenizer fragments chemical semantics	Medium	Phase 1: works as-is (my experiments show learning happens). Phase 2: extend vocabulary.
transformers version compatibility	Low	Already found and fixed one issue (PR #4913). Use top-level imports throughout.

3.5 Milestones & Timeline

Assuming Medium size (175 hours, ~12 weeks):

Milestone 1: Core wrapper (Weeks 1-3)

PR: Base class changes to HuggingFaceModel (causal_lm task, generate())
- Coordinate with PR #4907 to avoid duplication
PR: Olmo2ForSequenceClassification
PR: OLMo wrapper class with all task modes
PR: Unit tests (8 tests)
Deliverable: from deepchem.models import OLMo works for all tasks

Milestone 2: Continued pretraining (Weeks 4-6)

PR: Pretraining pipeline on molecular data (ZINC-250K)
PR: Data loading utilities for SMILES corpora
PR: Pretraining tutorial notebook
Deliverable: Pretrained OLMo checkpoint on molecular data

Milestone 3: Fine-tuning & benchmarks (Weeks 7-9)

PR: Classification tutorial (BBBP, Tox21)
PR: Regression tutorial (ESOL, FreeSolv, Lipophilicity)
PR: Benchmark results table vs ChemBERTa
Deliverable: Published benchmark comparing OLMo vs ChemBERTa on MoleculeNet

Milestone 4: Generation & polish (Weeks 10-12)

PR: SMILES generation tutorial with RDKit validity checking
PR: Documentation (numpydoc, API reference, user guide)
PR: Tokenizer extension experiments (stretch goal)
Deliverable: Complete documentation and tutorials

Each milestone = 1 evaluation checkpoint. PRs are <50 lines where possible, following DeepChem's contribution guidelines.

3.6 Pull Request Plan

I will follow DeepChem's guidelines: small PRs (<50 lines for initial ones), with tests and numpydoc documentation. Expected ~8-12 PRs total:

HuggingFaceModel causal_lm support (~40 lines)
generate() method (~50 lines)
Olmo2ForSequenceClassification (~100 lines — larger, will discuss with mentor)
OLMo wrapper class (~80 lines)
Unit tests (~180 lines)
Pretraining pipeline
Data utilities
Tutorial notebooks (3-4 notebooks)
Documentation updates
Benchmark scripts

4. COMMUNITY ENGAGEMENT

Already contributing: PR #4913 (bug fix), Issue #4912 (compat report)
Will attend office hours MWF 9am PST
Will join Discord for async discussion
Will write weekly progress updates
Happy to review other contributors' HuggingFace-related PRs

5. RESOURCES REQUIRED

GPU: I have access to 1x H100 NVL 96GB (Azure) for development
For CI: tiny model configs, no GPU needed
For pretraining experiments: my H100 is sufficient for OLMo-1B
OLMo-7B experiments: may need multi-GPU setup (discuss with mentor)

6. BIBLIOGRAPHY

Groeneveld et al. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838
Chithrananda et al. (2020). "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv:2010.09885
Ross et al. (2022). "Large-Scale Chemical Language Representations Capture Molecular Structure and Properties." Nature Machine Intelligence.
Weininger (1988). "SMILES, a chemical language and information system." J. Chem. Inf. Comput. Sci.
Wu et al. (2018). "MoleculeNet: A Benchmark for Molecular Machine Learning." Chemical Science.

KEY NUMBERS FROM YOUR EXPERIMENTS (reference these in proposal)

Tokenization: OLMo uses 0.9x tokens vs ChemBERTa on drug molecules
BBBP classification: ROC-AUC 0.67 (random init, 12.9M param model, 200 samples, 3 epochs)
ESOL regression: R^2 = 0.37, MAE = 1.27 (same conditions)
SMILES generation: 0% validity from random init (proves pretraining is the core challenge)
Test suite: 8/8 tests pass in 27 seconds on CPU
Stereochemistry fragmentation: [C@@H] splits into 4 tokens in OLMo vs 7 in ChemBERTa

vivekvar
/

azure-scripts