| # GSoC 2026 Proposal: LLM Support for 7B Models (OLMo) in DeepChem |
| # CONTENT REFERENCE β REWRITE IN YOUR OWN WORDS BEFORE SUBMITTING |
|
|
| --- |
|
|
| ## 1. INTRODUCTION |
|
|
| ### What this project is about |
| DeepChem's HuggingFaceModel wrapper currently supports encoder-only models |
| (ChemBERTa, MoLFormer) through masked language modeling. There is no support |
| for decoder-only causal language models. This project adds OLMo-2 |
| (Allen AI's open language model) to DeepChem, enabling: |
| - Continued pretraining on molecular data (SMILES) |
| - Fine-tuning for classification and regression |
| - Autoregressive molecular generation |
|
|
| ### Why it matters |
| Encoder models (ChemBERTa) can classify and predict properties, but they |
| CANNOT generate molecules. A causal LM like OLMo opens up: |
| - De novo molecular generation (drug discovery) |
| - Text-molecule bridging (OLMo understands English AND can learn SMILES) |
| - In-context few-shot learning without fine-tuning |
| - Transfer learning from scientific literature |
|
|
| ### Why OLMo specifically |
| - Fully open (weights, data, training code) β unlike LLaMA/GPT |
| - OLMo-2 is natively supported in HuggingFace transformers (no custom code) |
| - 1B and 7B variants available for different compute budgets |
| - Trained on Dolma corpus which includes scientific papers |
|
|
| ### What I've already done (reference your PR and experiments) |
| - Found and fixed a transformers 5.x compatibility bug in Chemberta (PR #4913) |
| - Filed issue #4912 documenting broader transformers 5.x compat gap |
| - Built a working OLMo wrapper prototype (locally) with: |
| - Olmo2ForSequenceClassification (doesn't exist in transformers) |
| - Causal LM pretraining on SMILES |
| - All 8 unit tests passing |
| - Ran experiments on real MoleculeNet data: |
| - BBBP classification: ROC-AUC 0.67 (random init, tiny model) |
| - ESOL regression: R^2 = 0.37 |
| - SMILES generation: 0% validity (expected β proves pretraining is essential) |
|
|
| --- |
|
|
| ## 2. RELEVANT EXPERIENCE & INTEREST |
|
|
| ### Technical background |
| - Parameter Golf (OpenAI competition, March 2026): Trained language models |
| from scratch under 16MB constraint. Custom SentencePiece tokenizers, |
| GPTQ-lite quantization, flash attention, architecture design (11L 512d |
| transformer). This is directly relevant β I understand transformer |
| training at a low level. |
| - GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 GitHub stars): Fine-tuning and |
| distillation of large language models. |
| - wingman-AI (29 GitHub stars): Production AI assistant system. |
| - Open source contributions: PRs to HuggingFace transformers, Unsloth, |
| Anthropic SDK, OpenAI SDK, Karpathy's nanochat. |
|
|
| ### Why I want to work on this |
| [WRITE THIS YOURSELF β what genuinely interests you about molecular ML? |
| Why DeepChem? Be specific and honest. Don't say "I'm passionate about |
| open source" β say what specific thing drew you to this project.] |
|
|
| ### Links |
| - GitHub: https://github.com/vivekvar-dl |
| - PR #4913: https://github.com/deepchem/deepchem/pull/4913 |
| - Issue #4912: https://github.com/deepchem/deepchem/issues/4912 |
|
|
| --- |
|
|
| ## 3. WORK PLAN |
|
|
| ### 3.1 Design |
|
|
| The implementation has four components: |
|
|
| **Component A: Base class changes to HuggingFaceModel** |
| - Add `causal_lm` task support (DataCollatorForLanguageModeling with mlm=False) |
| - Add `AutoModelForCausalLM` branch in load_from_pretrained() |
| - Add `generate()` method for autoregressive text generation |
| - Add causal LM batch preparation in _prepare_batch() |
|
|
| Note: PR #4907 by another contributor adds a similar generate() method. |
| My work is complementary β I'm adding a full model wrapper, not just |
| generation plumbing. |
|
|
| **Component B: Olmo2ForSequenceClassification** |
| This class DOES NOT EXIST in HuggingFace transformers. OLMo only has |
| OlmoForCausalLM β no classification head. I built one: |
| - Extends Olmo2PreTrainedModel |
| - Uses last-token pooling (last non-padded token's hidden state) |
| - Linear projection head for classification/regression |
| - Supports single-label, multi-label, and regression via problem_type config |
| - Computes CrossEntropyLoss / BCEWithLogitsLoss / MSELoss based on task |
| |
| This follows the same pattern as LlamaForSequenceClassification. |
| |
| **Component C: OLMo wrapper class** |
| ``` |
| OLMo(HuggingFaceModel) |
| __init__(task, model_name, n_tasks, config) |
| - task: causal_lm | regression | classification | mtr |
| - Loads tokenizer from HuggingFace Hub |
| - Sets pad_token = eos_token (decoder models don't have pad by default) |
| - Syncs vocab_size between config and tokenizer |
| - Creates appropriate model class based on task |
| |
| _prepare_batch(batch) |
| - causal_lm: labels = input_ids (model shifts internally) |
| - regression/classification: labels from dataset, proper dtype casting |
| - Multi-task classification: float labels for BCEWithLogitsLoss |
| ``` |
| |
| **Component D: Tokenization strategy** |
| Phase 1 (GSoC): Use OLMo's pretrained tokenizer as-is on SMILES. |
| - OLMo's 100K BPE vocab actually tokenizes SMILES more efficiently |
| than ChemBERTa's 600-token vocab (0.9x token ratio in my analysis) |
| - BUT it fragments chemical semantics: [C@@H] -> [C, @@, H, ] |
| - ChemBERTa learns chemistry-aware merges: (=O), ccccc, COc |
| |
| Phase 2 (stretch): Extend tokenizer with SMILES-specific tokens. |
| - Add special tokens for stereochemistry: [C@@H], [C@H], [nH] |
| - Add aromatic ring tokens: c1ccccc1 |
| - Retrain BPE on mixed English + SMILES corpus |
| |
| ### 3.2 Pseudocode |
| |
| Olmo2ForSequenceClassification.forward(): |
| ``` |
| hidden_states = self.model(input_ids, attention_mask) |
| # Pool: use last non-padded token |
| seq_lengths = (input_ids != pad_token_id).sum(-1) - 1 |
| pooled = hidden_states[batch_range, seq_lengths] |
| logits = self.score(pooled) # Linear(hidden_size, num_labels) |
| if labels: |
| if regression: loss = MSELoss(logits, labels) |
| if single_class: loss = CrossEntropy(logits, labels) |
| if multi_label: loss = BCEWithLogits(logits, labels) |
| return {loss, logits} |
| ``` |
| |
| OLMo._prepare_batch() for causal_lm: |
| ``` |
| tokens = tokenizer(smiles_list, padding=True) |
| input_ids = tokens.input_ids.to(device) |
| labels = input_ids.clone() # next-token prediction |
| return {input_ids, attention_mask, labels} |
| ``` |
| |
| HuggingFaceModel.generate(): |
| ``` |
| tokens = tokenizer(inputs, padding=True) |
| output_ids = model.generate(**tokens, max_new_tokens=N, **kwargs) |
| return tokenizer.batch_decode(output_ids) |
| ``` |
| |
| ### 3.3 Testing Plan |
| |
| 8 unit tests (all passing in my prototype): |
| |
| | Test | What it validates | |
| |------|-------------------| |
| | test_olmo_causal_lm_pretraining | Causal LM trains, loss > 0 | |
| | test_olmo_regression_finetuning | Regression trains, predictions match shape, MAE computable | |
| | test_olmo_classification | Classification on binary labels, loss > 0 | |
| | test_olmo_multitask_regression | MTR with 2 tasks, predictions shape matches | |
| | test_olmo_save_and_restore | Checkpoint save/load, weights match exactly | |
| | test_olmo_load_from_pretrained | Pretrain causal LM -> load into regression model | |
| | test_olmo_generate | Single and batch generation returns strings | |
| | test_olmo_invalid_task | ValueError on bad task name | |
| |
| All tests use a tiny config (64 hidden, 2 layers, 2 heads) β no model |
| download needed, runs in ~27 seconds on CPU. |
| |
| Integration tests (to add during GSoC): |
| - MoleculeNet benchmarks: BBBP, ESOL, FreeSolv, Lipophilicity |
| - SMILES generation validity (RDKit validation) |
| - Continued pretraining convergence on ZINC/PubChem subsets |
| |
| ### 3.4 Sources of Risk |
| |
| | Risk | Likelihood | Mitigation | |
| |------|-----------|------------| |
| | OLMo-7B requires ~14GB VRAM for inference | High | Use OLMo-1B for CI/demos. Test with tiny configs. Document GPU requirements. | |
| | SMILES generation validity low without extensive pretraining | High | This IS the core problem. Budget 3 weeks for pretraining experiments. Use ZINC-250K as training corpus. Target >50% validity. | |
| | Olmo2ForSequenceClassification not upstream | Medium | Our implementation follows HF patterns exactly. If HF adds it later, we swap to theirs. | |
| | Tokenizer fragments chemical semantics | Medium | Phase 1: works as-is (my experiments show learning happens). Phase 2: extend vocabulary. | |
| | transformers version compatibility | Low | Already found and fixed one issue (PR #4913). Use top-level imports throughout. | |
| |
| ### 3.5 Milestones & Timeline |
| |
| Assuming Medium size (175 hours, ~12 weeks): |
| |
| **Milestone 1: Core wrapper (Weeks 1-3)** |
| - PR: Base class changes to HuggingFaceModel (causal_lm task, generate()) |
| - Coordinate with PR #4907 to avoid duplication |
| - PR: Olmo2ForSequenceClassification |
| - PR: OLMo wrapper class with all task modes |
| - PR: Unit tests (8 tests) |
| - Deliverable: `from deepchem.models import OLMo` works for all tasks |
| |
| **Milestone 2: Continued pretraining (Weeks 4-6)** |
| - PR: Pretraining pipeline on molecular data (ZINC-250K) |
| - PR: Data loading utilities for SMILES corpora |
| - PR: Pretraining tutorial notebook |
| - Deliverable: Pretrained OLMo checkpoint on molecular data |
| |
| **Milestone 3: Fine-tuning & benchmarks (Weeks 7-9)** |
| - PR: Classification tutorial (BBBP, Tox21) |
| - PR: Regression tutorial (ESOL, FreeSolv, Lipophilicity) |
| - PR: Benchmark results table vs ChemBERTa |
| - Deliverable: Published benchmark comparing OLMo vs ChemBERTa on MoleculeNet |
| |
| **Milestone 4: Generation & polish (Weeks 10-12)** |
| - PR: SMILES generation tutorial with RDKit validity checking |
| - PR: Documentation (numpydoc, API reference, user guide) |
| - PR: Tokenizer extension experiments (stretch goal) |
| - Deliverable: Complete documentation and tutorials |
| |
| Each milestone = 1 evaluation checkpoint. PRs are <50 lines where possible, |
| following DeepChem's contribution guidelines. |
| |
| ### 3.6 Pull Request Plan |
| |
| I will follow DeepChem's guidelines: small PRs (<50 lines for initial ones), |
| with tests and numpydoc documentation. Expected ~8-12 PRs total: |
| |
| 1. HuggingFaceModel causal_lm support (~40 lines) |
| 2. generate() method (~50 lines) |
| 3. Olmo2ForSequenceClassification (~100 lines β larger, will discuss with mentor) |
| 4. OLMo wrapper class (~80 lines) |
| 5. Unit tests (~180 lines) |
| 6. Pretraining pipeline |
| 7. Data utilities |
| 8. Tutorial notebooks (3-4 notebooks) |
| 9. Documentation updates |
| 10. Benchmark scripts |
| |
| --- |
| |
| ## 4. COMMUNITY ENGAGEMENT |
| |
| - Already contributing: PR #4913 (bug fix), Issue #4912 (compat report) |
| - Will attend office hours MWF 9am PST |
| - Will join Discord for async discussion |
| - Will write weekly progress updates |
| - Happy to review other contributors' HuggingFace-related PRs |
| |
| --- |
| |
| ## 5. RESOURCES REQUIRED |
| |
| - GPU: I have access to 1x H100 NVL 96GB (Azure) for development |
| - For CI: tiny model configs, no GPU needed |
| - For pretraining experiments: my H100 is sufficient for OLMo-1B |
| - OLMo-7B experiments: may need multi-GPU setup (discuss with mentor) |
| |
| --- |
| |
| ## 6. BIBLIOGRAPHY |
| |
| 1. Groeneveld et al. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838 |
| 2. Chithrananda et al. (2020). "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv:2010.09885 |
| 3. Ross et al. (2022). "Large-Scale Chemical Language Representations Capture Molecular Structure and Properties." Nature Machine Intelligence. |
| 4. Weininger (1988). "SMILES, a chemical language and information system." J. Chem. Inf. Comput. Sci. |
| 5. Wu et al. (2018). "MoleculeNet: A Benchmark for Molecular Machine Learning." Chemical Science. |
| |
| --- |
| |
| ## KEY NUMBERS FROM YOUR EXPERIMENTS (reference these in proposal) |
| |
| - Tokenization: OLMo uses 0.9x tokens vs ChemBERTa on drug molecules |
| - BBBP classification: ROC-AUC 0.67 (random init, 12.9M param model, 200 samples, 3 epochs) |
| - ESOL regression: R^2 = 0.37, MAE = 1.27 (same conditions) |
| - SMILES generation: 0% validity from random init (proves pretraining is the core challenge) |
| - Test suite: 8/8 tests pass in 27 seconds on CPU |
| - Stereochemistry fragmentation: [C@@H] splits into 4 tokens in OLMo vs 7 in ChemBERTa |
| |