azure-scripts / gsoc_proposal_content.md
vivekvar's picture
azure home scripts: data gen, training, misc
a70eb3d verified

GSoC 2026 Proposal: LLM Support for 7B Models (OLMo) in DeepChem

CONTENT REFERENCE β€” REWRITE IN YOUR OWN WORDS BEFORE SUBMITTING


1. INTRODUCTION

What this project is about

DeepChem's HuggingFaceModel wrapper currently supports encoder-only models (ChemBERTa, MoLFormer) through masked language modeling. There is no support for decoder-only causal language models. This project adds OLMo-2 (Allen AI's open language model) to DeepChem, enabling:

  • Continued pretraining on molecular data (SMILES)
  • Fine-tuning for classification and regression
  • Autoregressive molecular generation

Why it matters

Encoder models (ChemBERTa) can classify and predict properties, but they CANNOT generate molecules. A causal LM like OLMo opens up:

  • De novo molecular generation (drug discovery)
  • Text-molecule bridging (OLMo understands English AND can learn SMILES)
  • In-context few-shot learning without fine-tuning
  • Transfer learning from scientific literature

Why OLMo specifically

  • Fully open (weights, data, training code) β€” unlike LLaMA/GPT
  • OLMo-2 is natively supported in HuggingFace transformers (no custom code)
  • 1B and 7B variants available for different compute budgets
  • Trained on Dolma corpus which includes scientific papers

What I've already done (reference your PR and experiments)

  • Found and fixed a transformers 5.x compatibility bug in Chemberta (PR #4913)
  • Filed issue #4912 documenting broader transformers 5.x compat gap
  • Built a working OLMo wrapper prototype (locally) with:
    • Olmo2ForSequenceClassification (doesn't exist in transformers)
    • Causal LM pretraining on SMILES
    • All 8 unit tests passing
  • Ran experiments on real MoleculeNet data:
    • BBBP classification: ROC-AUC 0.67 (random init, tiny model)
    • ESOL regression: R^2 = 0.37
    • SMILES generation: 0% validity (expected β€” proves pretraining is essential)

2. RELEVANT EXPERIENCE & INTEREST

Technical background

  • Parameter Golf (OpenAI competition, March 2026): Trained language models from scratch under 16MB constraint. Custom SentencePiece tokenizers, GPTQ-lite quantization, flash attention, architecture design (11L 512d transformer). This is directly relevant β€” I understand transformer training at a low level.
  • GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 GitHub stars): Fine-tuning and distillation of large language models.
  • wingman-AI (29 GitHub stars): Production AI assistant system.
  • Open source contributions: PRs to HuggingFace transformers, Unsloth, Anthropic SDK, OpenAI SDK, Karpathy's nanochat.

Why I want to work on this

[WRITE THIS YOURSELF β€” what genuinely interests you about molecular ML? Why DeepChem? Be specific and honest. Don't say "I'm passionate about open source" β€” say what specific thing drew you to this project.]

Links


3. WORK PLAN

3.1 Design

The implementation has four components:

Component A: Base class changes to HuggingFaceModel

  • Add causal_lm task support (DataCollatorForLanguageModeling with mlm=False)
  • Add AutoModelForCausalLM branch in load_from_pretrained()
  • Add generate() method for autoregressive text generation
  • Add causal LM batch preparation in _prepare_batch()

Note: PR #4907 by another contributor adds a similar generate() method. My work is complementary β€” I'm adding a full model wrapper, not just generation plumbing.

Component B: Olmo2ForSequenceClassification This class DOES NOT EXIST in HuggingFace transformers. OLMo only has OlmoForCausalLM β€” no classification head. I built one:

  • Extends Olmo2PreTrainedModel
  • Uses last-token pooling (last non-padded token's hidden state)
  • Linear projection head for classification/regression
  • Supports single-label, multi-label, and regression via problem_type config
  • Computes CrossEntropyLoss / BCEWithLogitsLoss / MSELoss based on task

This follows the same pattern as LlamaForSequenceClassification.

Component C: OLMo wrapper class

OLMo(HuggingFaceModel)
  __init__(task, model_name, n_tasks, config)
    - task: causal_lm | regression | classification | mtr
    - Loads tokenizer from HuggingFace Hub
    - Sets pad_token = eos_token (decoder models don't have pad by default)
    - Syncs vocab_size between config and tokenizer
    - Creates appropriate model class based on task

  _prepare_batch(batch)
    - causal_lm: labels = input_ids (model shifts internally)
    - regression/classification: labels from dataset, proper dtype casting
    - Multi-task classification: float labels for BCEWithLogitsLoss

Component D: Tokenization strategy Phase 1 (GSoC): Use OLMo's pretrained tokenizer as-is on SMILES.

  • OLMo's 100K BPE vocab actually tokenizes SMILES more efficiently than ChemBERTa's 600-token vocab (0.9x token ratio in my analysis)
  • BUT it fragments chemical semantics: [C@@H] -> [C, @@, H, ]
  • ChemBERTa learns chemistry-aware merges: (=O), ccccc, COc

Phase 2 (stretch): Extend tokenizer with SMILES-specific tokens.

  • Add special tokens for stereochemistry: [C@@H], [C@H], [nH]
  • Add aromatic ring tokens: c1ccccc1
  • Retrain BPE on mixed English + SMILES corpus

3.2 Pseudocode

Olmo2ForSequenceClassification.forward():

  hidden_states = self.model(input_ids, attention_mask)
  # Pool: use last non-padded token
  seq_lengths = (input_ids != pad_token_id).sum(-1) - 1
  pooled = hidden_states[batch_range, seq_lengths]
  logits = self.score(pooled)  # Linear(hidden_size, num_labels)
  if labels:
    if regression: loss = MSELoss(logits, labels)
    if single_class: loss = CrossEntropy(logits, labels)
    if multi_label: loss = BCEWithLogits(logits, labels)
  return {loss, logits}

OLMo._prepare_batch() for causal_lm:

  tokens = tokenizer(smiles_list, padding=True)
  input_ids = tokens.input_ids.to(device)
  labels = input_ids.clone()  # next-token prediction
  return {input_ids, attention_mask, labels}

HuggingFaceModel.generate():

  tokens = tokenizer(inputs, padding=True)
  output_ids = model.generate(**tokens, max_new_tokens=N, **kwargs)
  return tokenizer.batch_decode(output_ids)

3.3 Testing Plan

8 unit tests (all passing in my prototype):

Test What it validates
test_olmo_causal_lm_pretraining Causal LM trains, loss > 0
test_olmo_regression_finetuning Regression trains, predictions match shape, MAE computable
test_olmo_classification Classification on binary labels, loss > 0
test_olmo_multitask_regression MTR with 2 tasks, predictions shape matches
test_olmo_save_and_restore Checkpoint save/load, weights match exactly
test_olmo_load_from_pretrained Pretrain causal LM -> load into regression model
test_olmo_generate Single and batch generation returns strings
test_olmo_invalid_task ValueError on bad task name

All tests use a tiny config (64 hidden, 2 layers, 2 heads) β€” no model download needed, runs in ~27 seconds on CPU.

Integration tests (to add during GSoC):

  • MoleculeNet benchmarks: BBBP, ESOL, FreeSolv, Lipophilicity
  • SMILES generation validity (RDKit validation)
  • Continued pretraining convergence on ZINC/PubChem subsets

3.4 Sources of Risk

Risk Likelihood Mitigation
OLMo-7B requires ~14GB VRAM for inference High Use OLMo-1B for CI/demos. Test with tiny configs. Document GPU requirements.
SMILES generation validity low without extensive pretraining High This IS the core problem. Budget 3 weeks for pretraining experiments. Use ZINC-250K as training corpus. Target >50% validity.
Olmo2ForSequenceClassification not upstream Medium Our implementation follows HF patterns exactly. If HF adds it later, we swap to theirs.
Tokenizer fragments chemical semantics Medium Phase 1: works as-is (my experiments show learning happens). Phase 2: extend vocabulary.
transformers version compatibility Low Already found and fixed one issue (PR #4913). Use top-level imports throughout.

3.5 Milestones & Timeline

Assuming Medium size (175 hours, ~12 weeks):

Milestone 1: Core wrapper (Weeks 1-3)

  • PR: Base class changes to HuggingFaceModel (causal_lm task, generate())
    • Coordinate with PR #4907 to avoid duplication
  • PR: Olmo2ForSequenceClassification
  • PR: OLMo wrapper class with all task modes
  • PR: Unit tests (8 tests)
  • Deliverable: from deepchem.models import OLMo works for all tasks

Milestone 2: Continued pretraining (Weeks 4-6)

  • PR: Pretraining pipeline on molecular data (ZINC-250K)
  • PR: Data loading utilities for SMILES corpora
  • PR: Pretraining tutorial notebook
  • Deliverable: Pretrained OLMo checkpoint on molecular data

Milestone 3: Fine-tuning & benchmarks (Weeks 7-9)

  • PR: Classification tutorial (BBBP, Tox21)
  • PR: Regression tutorial (ESOL, FreeSolv, Lipophilicity)
  • PR: Benchmark results table vs ChemBERTa
  • Deliverable: Published benchmark comparing OLMo vs ChemBERTa on MoleculeNet

Milestone 4: Generation & polish (Weeks 10-12)

  • PR: SMILES generation tutorial with RDKit validity checking
  • PR: Documentation (numpydoc, API reference, user guide)
  • PR: Tokenizer extension experiments (stretch goal)
  • Deliverable: Complete documentation and tutorials

Each milestone = 1 evaluation checkpoint. PRs are <50 lines where possible, following DeepChem's contribution guidelines.

3.6 Pull Request Plan

I will follow DeepChem's guidelines: small PRs (<50 lines for initial ones), with tests and numpydoc documentation. Expected ~8-12 PRs total:

  1. HuggingFaceModel causal_lm support (~40 lines)
  2. generate() method (~50 lines)
  3. Olmo2ForSequenceClassification (~100 lines β€” larger, will discuss with mentor)
  4. OLMo wrapper class (~80 lines)
  5. Unit tests (~180 lines)
  6. Pretraining pipeline
  7. Data utilities
  8. Tutorial notebooks (3-4 notebooks)
  9. Documentation updates
  10. Benchmark scripts

4. COMMUNITY ENGAGEMENT

  • Already contributing: PR #4913 (bug fix), Issue #4912 (compat report)
  • Will attend office hours MWF 9am PST
  • Will join Discord for async discussion
  • Will write weekly progress updates
  • Happy to review other contributors' HuggingFace-related PRs

5. RESOURCES REQUIRED

  • GPU: I have access to 1x H100 NVL 96GB (Azure) for development
  • For CI: tiny model configs, no GPU needed
  • For pretraining experiments: my H100 is sufficient for OLMo-1B
  • OLMo-7B experiments: may need multi-GPU setup (discuss with mentor)

6. BIBLIOGRAPHY

  1. Groeneveld et al. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838
  2. Chithrananda et al. (2020). "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv:2010.09885
  3. Ross et al. (2022). "Large-Scale Chemical Language Representations Capture Molecular Structure and Properties." Nature Machine Intelligence.
  4. Weininger (1988). "SMILES, a chemical language and information system." J. Chem. Inf. Comput. Sci.
  5. Wu et al. (2018). "MoleculeNet: A Benchmark for Molecular Machine Learning." Chemical Science.

KEY NUMBERS FROM YOUR EXPERIMENTS (reference these in proposal)

  • Tokenization: OLMo uses 0.9x tokens vs ChemBERTa on drug molecules
  • BBBP classification: ROC-AUC 0.67 (random init, 12.9M param model, 200 samples, 3 epochs)
  • ESOL regression: R^2 = 0.37, MAE = 1.27 (same conditions)
  • SMILES generation: 0% validity from random init (proves pretraining is the core challenge)
  • Test suite: 8/8 tests pass in 27 seconds on CPU
  • Stereochemistry fragmentation: [C@@H] splits into 4 tokens in OLMo vs 7 in ChemBERTa