# GSoC 2026 Proposal: LLM Support for 7B Models (OLMo) in DeepChem # CONTENT REFERENCE — REWRITE IN YOUR OWN WORDS BEFORE SUBMITTING --- ## 1. INTRODUCTION ### What this project is about DeepChem's HuggingFaceModel wrapper currently supports encoder-only models (ChemBERTa, MoLFormer) through masked language modeling. There is no support for decoder-only causal language models. This project adds OLMo-2 (Allen AI's open language model) to DeepChem, enabling: - Continued pretraining on molecular data (SMILES) - Fine-tuning for classification and regression - Autoregressive molecular generation ### Why it matters Encoder models (ChemBERTa) can classify and predict properties, but they CANNOT generate molecules. A causal LM like OLMo opens up: - De novo molecular generation (drug discovery) - Text-molecule bridging (OLMo understands English AND can learn SMILES) - In-context few-shot learning without fine-tuning - Transfer learning from scientific literature ### Why OLMo specifically - Fully open (weights, data, training code) — unlike LLaMA/GPT - OLMo-2 is natively supported in HuggingFace transformers (no custom code) - 1B and 7B variants available for different compute budgets - Trained on Dolma corpus which includes scientific papers ### What I've already done (reference your PR and experiments) - Found and fixed a transformers 5.x compatibility bug in Chemberta (PR #4913) - Filed issue #4912 documenting broader transformers 5.x compat gap - Built a working OLMo wrapper prototype (locally) with: - Olmo2ForSequenceClassification (doesn't exist in transformers) - Causal LM pretraining on SMILES - All 8 unit tests passing - Ran experiments on real MoleculeNet data: - BBBP classification: ROC-AUC 0.67 (random init, tiny model) - ESOL regression: R^2 = 0.37 - SMILES generation: 0% validity (expected — proves pretraining is essential) --- ## 2. RELEVANT EXPERIENCE & INTEREST ### Technical background - Parameter Golf (OpenAI competition, March 2026): Trained language models from scratch under 16MB constraint. Custom SentencePiece tokenizers, GPTQ-lite quantization, flash attention, architecture design (11L 512d transformer). This is directly relevant — I understand transformer training at a low level. - GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 GitHub stars): Fine-tuning and distillation of large language models. - wingman-AI (29 GitHub stars): Production AI assistant system. - Open source contributions: PRs to HuggingFace transformers, Unsloth, Anthropic SDK, OpenAI SDK, Karpathy's nanochat. ### Why I want to work on this [WRITE THIS YOURSELF — what genuinely interests you about molecular ML? Why DeepChem? Be specific and honest. Don't say "I'm passionate about open source" — say what specific thing drew you to this project.] ### Links - GitHub: https://github.com/vivekvar-dl - PR #4913: https://github.com/deepchem/deepchem/pull/4913 - Issue #4912: https://github.com/deepchem/deepchem/issues/4912 --- ## 3. WORK PLAN ### 3.1 Design The implementation has four components: **Component A: Base class changes to HuggingFaceModel** - Add `causal_lm` task support (DataCollatorForLanguageModeling with mlm=False) - Add `AutoModelForCausalLM` branch in load_from_pretrained() - Add `generate()` method for autoregressive text generation - Add causal LM batch preparation in _prepare_batch() Note: PR #4907 by another contributor adds a similar generate() method. My work is complementary — I'm adding a full model wrapper, not just generation plumbing. **Component B: Olmo2ForSequenceClassification** This class DOES NOT EXIST in HuggingFace transformers. OLMo only has OlmoForCausalLM — no classification head. I built one: - Extends Olmo2PreTrainedModel - Uses last-token pooling (last non-padded token's hidden state) - Linear projection head for classification/regression - Supports single-label, multi-label, and regression via problem_type config - Computes CrossEntropyLoss / BCEWithLogitsLoss / MSELoss based on task This follows the same pattern as LlamaForSequenceClassification. **Component C: OLMo wrapper class** ``` OLMo(HuggingFaceModel) __init__(task, model_name, n_tasks, config) - task: causal_lm | regression | classification | mtr - Loads tokenizer from HuggingFace Hub - Sets pad_token = eos_token (decoder models don't have pad by default) - Syncs vocab_size between config and tokenizer - Creates appropriate model class based on task _prepare_batch(batch) - causal_lm: labels = input_ids (model shifts internally) - regression/classification: labels from dataset, proper dtype casting - Multi-task classification: float labels for BCEWithLogitsLoss ``` **Component D: Tokenization strategy** Phase 1 (GSoC): Use OLMo's pretrained tokenizer as-is on SMILES. - OLMo's 100K BPE vocab actually tokenizes SMILES more efficiently than ChemBERTa's 600-token vocab (0.9x token ratio in my analysis) - BUT it fragments chemical semantics: [C@@H] -> [C, @@, H, ] - ChemBERTa learns chemistry-aware merges: (=O), ccccc, COc Phase 2 (stretch): Extend tokenizer with SMILES-specific tokens. - Add special tokens for stereochemistry: [C@@H], [C@H], [nH] - Add aromatic ring tokens: c1ccccc1 - Retrain BPE on mixed English + SMILES corpus ### 3.2 Pseudocode Olmo2ForSequenceClassification.forward(): ``` hidden_states = self.model(input_ids, attention_mask) # Pool: use last non-padded token seq_lengths = (input_ids != pad_token_id).sum(-1) - 1 pooled = hidden_states[batch_range, seq_lengths] logits = self.score(pooled) # Linear(hidden_size, num_labels) if labels: if regression: loss = MSELoss(logits, labels) if single_class: loss = CrossEntropy(logits, labels) if multi_label: loss = BCEWithLogits(logits, labels) return {loss, logits} ``` OLMo._prepare_batch() for causal_lm: ``` tokens = tokenizer(smiles_list, padding=True) input_ids = tokens.input_ids.to(device) labels = input_ids.clone() # next-token prediction return {input_ids, attention_mask, labels} ``` HuggingFaceModel.generate(): ``` tokens = tokenizer(inputs, padding=True) output_ids = model.generate(**tokens, max_new_tokens=N, **kwargs) return tokenizer.batch_decode(output_ids) ``` ### 3.3 Testing Plan 8 unit tests (all passing in my prototype): | Test | What it validates | |------|-------------------| | test_olmo_causal_lm_pretraining | Causal LM trains, loss > 0 | | test_olmo_regression_finetuning | Regression trains, predictions match shape, MAE computable | | test_olmo_classification | Classification on binary labels, loss > 0 | | test_olmo_multitask_regression | MTR with 2 tasks, predictions shape matches | | test_olmo_save_and_restore | Checkpoint save/load, weights match exactly | | test_olmo_load_from_pretrained | Pretrain causal LM -> load into regression model | | test_olmo_generate | Single and batch generation returns strings | | test_olmo_invalid_task | ValueError on bad task name | All tests use a tiny config (64 hidden, 2 layers, 2 heads) — no model download needed, runs in ~27 seconds on CPU. Integration tests (to add during GSoC): - MoleculeNet benchmarks: BBBP, ESOL, FreeSolv, Lipophilicity - SMILES generation validity (RDKit validation) - Continued pretraining convergence on ZINC/PubChem subsets ### 3.4 Sources of Risk | Risk | Likelihood | Mitigation | |------|-----------|------------| | OLMo-7B requires ~14GB VRAM for inference | High | Use OLMo-1B for CI/demos. Test with tiny configs. Document GPU requirements. | | SMILES generation validity low without extensive pretraining | High | This IS the core problem. Budget 3 weeks for pretraining experiments. Use ZINC-250K as training corpus. Target >50% validity. | | Olmo2ForSequenceClassification not upstream | Medium | Our implementation follows HF patterns exactly. If HF adds it later, we swap to theirs. | | Tokenizer fragments chemical semantics | Medium | Phase 1: works as-is (my experiments show learning happens). Phase 2: extend vocabulary. | | transformers version compatibility | Low | Already found and fixed one issue (PR #4913). Use top-level imports throughout. | ### 3.5 Milestones & Timeline Assuming Medium size (175 hours, ~12 weeks): **Milestone 1: Core wrapper (Weeks 1-3)** - PR: Base class changes to HuggingFaceModel (causal_lm task, generate()) - Coordinate with PR #4907 to avoid duplication - PR: Olmo2ForSequenceClassification - PR: OLMo wrapper class with all task modes - PR: Unit tests (8 tests) - Deliverable: `from deepchem.models import OLMo` works for all tasks **Milestone 2: Continued pretraining (Weeks 4-6)** - PR: Pretraining pipeline on molecular data (ZINC-250K) - PR: Data loading utilities for SMILES corpora - PR: Pretraining tutorial notebook - Deliverable: Pretrained OLMo checkpoint on molecular data **Milestone 3: Fine-tuning & benchmarks (Weeks 7-9)** - PR: Classification tutorial (BBBP, Tox21) - PR: Regression tutorial (ESOL, FreeSolv, Lipophilicity) - PR: Benchmark results table vs ChemBERTa - Deliverable: Published benchmark comparing OLMo vs ChemBERTa on MoleculeNet **Milestone 4: Generation & polish (Weeks 10-12)** - PR: SMILES generation tutorial with RDKit validity checking - PR: Documentation (numpydoc, API reference, user guide) - PR: Tokenizer extension experiments (stretch goal) - Deliverable: Complete documentation and tutorials Each milestone = 1 evaluation checkpoint. PRs are <50 lines where possible, following DeepChem's contribution guidelines. ### 3.6 Pull Request Plan I will follow DeepChem's guidelines: small PRs (<50 lines for initial ones), with tests and numpydoc documentation. Expected ~8-12 PRs total: 1. HuggingFaceModel causal_lm support (~40 lines) 2. generate() method (~50 lines) 3. Olmo2ForSequenceClassification (~100 lines — larger, will discuss with mentor) 4. OLMo wrapper class (~80 lines) 5. Unit tests (~180 lines) 6. Pretraining pipeline 7. Data utilities 8. Tutorial notebooks (3-4 notebooks) 9. Documentation updates 10. Benchmark scripts --- ## 4. COMMUNITY ENGAGEMENT - Already contributing: PR #4913 (bug fix), Issue #4912 (compat report) - Will attend office hours MWF 9am PST - Will join Discord for async discussion - Will write weekly progress updates - Happy to review other contributors' HuggingFace-related PRs --- ## 5. RESOURCES REQUIRED - GPU: I have access to 1x H100 NVL 96GB (Azure) for development - For CI: tiny model configs, no GPU needed - For pretraining experiments: my H100 is sufficient for OLMo-1B - OLMo-7B experiments: may need multi-GPU setup (discuss with mentor) --- ## 6. BIBLIOGRAPHY 1. Groeneveld et al. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838 2. Chithrananda et al. (2020). "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv:2010.09885 3. Ross et al. (2022). "Large-Scale Chemical Language Representations Capture Molecular Structure and Properties." Nature Machine Intelligence. 4. Weininger (1988). "SMILES, a chemical language and information system." J. Chem. Inf. Comput. Sci. 5. Wu et al. (2018). "MoleculeNet: A Benchmark for Molecular Machine Learning." Chemical Science. --- ## KEY NUMBERS FROM YOUR EXPERIMENTS (reference these in proposal) - Tokenization: OLMo uses 0.9x tokens vs ChemBERTa on drug molecules - BBBP classification: ROC-AUC 0.67 (random init, 12.9M param model, 200 samples, 3 epochs) - ESOL regression: R^2 = 0.37, MAE = 1.27 (same conditions) - SMILES generation: 0% validity from random init (proves pretraining is the core challenge) - Test suite: 8/8 tests pass in 27 seconds on CPU - Stereochemistry fragmentation: [C@@H] splits into 4 tokens in OLMo vs 7 in ChemBERTa