azure-scripts / gsoc_proposal_final.md
vivekvar's picture
azure home scripts: data gen, training, misc
a70eb3d verified

LLM Support for 7B Models (OLMo) in DeepChem

Vivek Varikuti github.com/vivekvar-dl

Introduction

DeepChems HuggingFaceModel wrapper works well for encoder models. ChemBERTa, MoLFormer, you can do MLM pretraining and finetune for classification or regression. But it has no support for decoder-only models. Like at all. You cant use any GPT style causal LM, and theres no way to do text generation.

I want to add OLMo-2 (Allen AIs language model) to DeepChem. The idea is to make it actually useful for molecular work, not just wrap an API. That means you should be able to pretrain on SMILES data, finetune for property prediction, and generate new molecules.

The reason I think this matters is that ChemBERTa can predict properties of molecules but it fundamentally cannot generate new ones. Its an encoder. OLMo is a causal LM so it can actually produce novel SMILES. For drug discovery thats a big deal. Also OLMo was trained on the Dolma corpus which has a bunch of scientific papers in it, so theres already some chemistry knowledge baked in before you even start finetuning.

Why OLMo and not some other model? Mainly because its actually fully open. Weights, data, code, everything is public. No weird license stuff like LLaMA. And OLMo-2 works natively in HuggingFace transformers without needing custom packages (the older OLMo-7B needs hf_olmo installed which is annoying). It also has a 1B version which is great for testing.

What I already did

I didnt want to just write a proposal without touching the code so I cloned DeepChem and started building.

First thing that happened was ChemBERTa wouldnt even import. Turns out transformers.models.roberta.tokenization_roberta_fast got removed in transformers 5.x. Nobody had reported it. I fixed it (PR #4913) and filed a bigger issue about transformers 5.x compatibility (#4912) because theres more stuff broken beyond just the import.

After that I started building an OLMo wrapper. Ran into an interesting problem pretty quickly. HuggingFace has OlmoForCausalLM but theres no OlmoForSequenceClassification. It doesnt exist. So if you want to do regression or classification with OLMo you have to build the classification head yourself. I wrote one using last-token pooling, basically the same thing LlamaForSequenceClassification does internally.

Got everything working and ran some quick experiments on MoleculeNet:

BBBP (blood brain barrier classification): ROC-AUC 0.67. This was with a tiny model, random init, 200 training samples, 3 epochs. Not great but its above 0.5 so the architecture is clearly learning something.

ESOL (solubility regression): R squared 0.37. Same deal, tiny model from scratch.

SMILES generation: 0% valid molecules. Every single generated SMILES was broken.

That generation result is honestly the most useful thing I found. It tells you exactly where the hard problem is. The wrapper code works, the training loop works, all the plumbing is fine. But without real pretraining on a molecular corpus the model just outputs garbage. Thats what this GSoC project needs to solve.

I also compared how OLMo and ChemBERTa tokenize drug molecules. Tested on aspirin, caffeine, penicillin, paclitaxel etc. OLMo actually uses fewer tokens overall (100K vocab vs 600) but it breaks up chemistry in weird ways. Like [C@@H] is a single concept (stereocenter) but OLMo splits it into four tokens. ChemBERTas tokenizer learned chemical groupings like (=O) and ccccc that make more sense.

Wrote 8 unit tests, all pass in 27 sec on CPU.

About Me

Im Vivek. I build AI stuff.

The thing thats most relevant here is probably the Parameter Golf competition by OpenAI (happening right now, March 2026). You have to train the best language model that fits in 16MB total. I built custom SentencePiece tokenizers, did GPTQ quantization, designed the transformer architecture from scratch (11 layers, 512 dim, 8 heads). Dealt with flash attention compatibility issues across different hardware. Point is I actually understand how transformers work at a low level, not just how to call .fit() on them.

Other stuff: GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 stars) where I did LLM distillation, wingman-AI (29 stars) which is a production AI assistant.

Ive submitted PRs to a bunch of repos. HuggingFace transformers, Unsloth (aarch64 support), Anthropic SDK (fixed a streaming bottleneck), OpenAI python SDK, Karpathys nanochat (NaN loss bug in SFT). Mix of bug fixes and features.

For DeepChem I found a real bug on day one (PR #4913), reported the broader compat issue (#4912), and built the OLMo prototype with passing tests.

I have GPU access through Azure cloud for development.

GitHub: https://github.com/vivekvar-dl Bug fix: https://github.com/deepchem/deepchem/pull/4913 Issue: https://github.com/deepchem/deepchem/issues/4912

Work Plan

Design

Four things need to be built.

1. Base class changes (HuggingFaceModel)

Right now HuggingFaceModel doesnt know causal LMs exist. Need to add:

  • "causal_lm" as a task type, using DataCollatorForLanguageModeling(mlm=False)
  • AutoModelForCausalLM branch in load_from_pretrained()
  • A generate() method wrapping HFs model.generate()
  • _prepare_batch() handling for causal LM where labels = input_ids

Theres already PR #4907 from someone else that adds generation. My stuff is different, Im building a whole model wrapper not just generation support. But will coordinate so we dont duplicate work.

2. Olmo2ForSequenceClassification

This class doesnt exist in HuggingFace. Had to write it.

How it works: run input through Olmo2Model, take the last non-padded tokens hidden state, project through a linear layer. Loss depends on the problem type — MSELoss for regression, CrossEntropyLoss for classification, BCEWithLogitsLoss for multi label. About 100 lines total, same pattern as LlamaForSequenceClassification.

3. OLMo wrapper class

User facing class extending HuggingFaceModel. Same structure as ChemBERTa/MoLFormer.

OLMo(HuggingFaceModel)
  __init__(task, model_name, n_tasks, config)
    task: causal_lm | regression | classification | mtr
    Loads tokenizer, sets pad_token = eos_token
    Syncs vocab_size with tokenizer
    Picks model class based on task

  _prepare_batch(batch)
    causal_lm: labels = input_ids clone
    regression/classification: labels from dataset, right dtype

4. Tokenization

For now, use OLMos tokenizer as is on SMILES. My experiments show it works well enough to learn (0.67 AUC, 0.37 R2 from random init even). Not perfect with stereocenters but functional.

Stretch goal: extend the vocab with chemistry tokens like [C@@H], (=O), aromatic rings. Retrain BPE on English+SMILES mix.

Pseudocode

Sequence classification forward:

hidden = base_model(input_ids, attention_mask)
seq_lengths = (input_ids != pad_id).sum(-1) - 1
pooled = hidden[range(batch_size), seq_lengths]
logits = linear_head(pooled)
loss = compute_loss(logits, labels, problem_type)

Causal LM batch:

tokens = tokenizer(smiles_list, padding=True)
inputs = {input_ids, attention_mask, labels: input_ids.clone()}

Generate:

encoded = tokenizer(prompts, padding=True)
output_ids = model.generate(**encoded, max_new_tokens=N)
return tokenizer.batch_decode(output_ids)

Testing

8 tests written and passing:

  1. Causal LM pretraining (loss > 0)
  2. Regression finetuning (correct prediction shape)
  3. Classification (binary labels)
  4. Multitask regression (2 targets)
  5. Save and restore checkpoint (weights match)
  6. Load pretrained into regression model
  7. Generation (single + batch)
  8. Invalid task raises error

Tiny config, no downloads, 27 sec on CPU.

Will add during GSoC:

  • MoleculeNet benchmarks (BBBP, ESOL, FreeSolv, Lipophilicity)
  • Generation validity checking with RDKit
  • Pretraining convergence on ZINC

Risks

Generation quality is the big one. 0% validity from random init is expected obviously but getting to 50%+ valid SMILES needs real pretraining on a decent corpus. Im allocating 3 weeks for this using ZINC-250K.

OLMo-7B is ~14GB just for inference. CI uses tiny configs so no GPU needed there. OLMo-1B for demos. 7B for real benchmarks, might need multi GPU, will figure that out with mentor.

Olmo2ForSequenceClassification isnt upstream. If HF adds one later we swap ours out.

Transformers compat — already found one issue, using top level imports everywhere going forward.

Timeline

12 weeks (Medium, 175 hours):

Weeks 1-3 Get the core wrapper merged. Small PRs, follow DeepChems contribution guidelines. Base class changes, Olmo2ForSequenceClassification, OLMo wrapper, tests. By week 3 you should be able to do from deepchem.models import OLMo.

Weeks 4-6 Pretraining pipeline. Load SMILES from ZINC-250K, causal LM training, checkpointing. Tutorial notebook for pretraining on custom molecular data.

Weeks 7-9 Finetune pretrained model on MoleculeNet. BBBP and Tox21 classification, ESOL and FreeSolv and Lipophilicity regression. Benchmark table vs ChemBERTa.

Weeks 10-12 Generation experiments with RDKit validity checking. Tutorial notebooks. Docs. If time allows, tokenizer extension experiments.

PRs

Small PRs especially at first. Bigger ones (Olmo2ForSequenceClassification ~100 lines) I will discuss with mentor before submitting. Expecting 8-12 PRs across the summer.

Community

Already in:

  • PR #4913 (bug fix)
  • Issue #4912 (compat report)

Going to do office hours MWF 9am PST, Discord for regular questions, weekly updates.

Resources

GPU access through Azure cloud. Tiny configs for CI. OLMo-7B training setup to be discussed with mentor depending on whats needed.

References

  1. Groeneveld et al. (2024). OLMo: Accelerating the Science of Language Models. arXiv:2402.00838
  2. Chithrananda et al. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv:2010.09885
  3. Ross et al. (2022). Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. Nature Machine Intelligence
  4. Weininger (1988). SMILES, a chemical language and information system. J Chem Inf Comput Sci
  5. Wu et al. (2018). MoleculeNet: A Benchmark for Molecular Machine Learning. Chemical Science