azure-scripts / gsoc_proposal_content.md

azure home scripts: data gen, training, misc

a70eb3d verified 23 days ago

11.7 kB

	# GSoC 2026 Proposal: LLM Support for 7B Models (OLMo) in DeepChem
	# CONTENT REFERENCE — REWRITE IN YOUR OWN WORDS BEFORE SUBMITTING

	---

	## 1. INTRODUCTION

	### What this project is about
	DeepChem's HuggingFaceModel wrapper currently supports encoder-only models
	(ChemBERTa, MoLFormer) through masked language modeling. There is no support
	for decoder-only causal language models. This project adds OLMo-2
	(Allen AI's open language model) to DeepChem, enabling:
	- Continued pretraining on molecular data (SMILES)
	- Fine-tuning for classification and regression
	- Autoregressive molecular generation

	### Why it matters
	Encoder models (ChemBERTa) can classify and predict properties, but they
	CANNOT generate molecules. A causal LM like OLMo opens up:
	- De novo molecular generation (drug discovery)
	- Text-molecule bridging (OLMo understands English AND can learn SMILES)
	- In-context few-shot learning without fine-tuning
	- Transfer learning from scientific literature

	### Why OLMo specifically
	- Fully open (weights, data, training code) — unlike LLaMA/GPT
	- OLMo-2 is natively supported in HuggingFace transformers (no custom code)
	- 1B and 7B variants available for different compute budgets
	- Trained on Dolma corpus which includes scientific papers

	### What I've already done (reference your PR and experiments)
	- Found and fixed a transformers 5.x compatibility bug in Chemberta (PR #4913)
	- Filed issue #4912 documenting broader transformers 5.x compat gap
	- Built a working OLMo wrapper prototype (locally) with:
	- Olmo2ForSequenceClassification (doesn't exist in transformers)
	- Causal LM pretraining on SMILES
	- All 8 unit tests passing
	- Ran experiments on real MoleculeNet data:
	- BBBP classification: ROC-AUC 0.67 (random init, tiny model)
	- ESOL regression: R^2 = 0.37
	- SMILES generation: 0% validity (expected — proves pretraining is essential)

	---

	## 2. RELEVANT EXPERIENCE & INTEREST

	### Technical background
	- Parameter Golf (OpenAI competition, March 2026): Trained language models
	from scratch under 16MB constraint. Custom SentencePiece tokenizers,
	GPTQ-lite quantization, flash attention, architecture design (11L 512d
	transformer). This is directly relevant — I understand transformer
	training at a low level.
	- GSPO-DeepSeek-R1-Distill-Qwen-1.5B (15 GitHub stars): Fine-tuning and
	distillation of large language models.
	- wingman-AI (29 GitHub stars): Production AI assistant system.
	- Open source contributions: PRs to HuggingFace transformers, Unsloth,
	Anthropic SDK, OpenAI SDK, Karpathy's nanochat.

	### Why I want to work on this
	[WRITE THIS YOURSELF — what genuinely interests you about molecular ML?
	Why DeepChem? Be specific and honest. Don't say "I'm passionate about
	open source" — say what specific thing drew you to this project.]

	### Links
	- GitHub: https://github.com/vivekvar-dl
	- PR #4913: https://github.com/deepchem/deepchem/pull/4913
	- Issue #4912: https://github.com/deepchem/deepchem/issues/4912

	---

	## 3. WORK PLAN

	### 3.1 Design

	The implementation has four components:

	Component A: Base class changes to HuggingFaceModel
	- Add `causal_lm` task support (DataCollatorForLanguageModeling with mlm=False)
	- Add `AutoModelForCausalLM` branch in load_from_pretrained()
	- Add `generate()` method for autoregressive text generation
	- Add causal LM batch preparation in _prepare_batch()

	Note: PR #4907 by another contributor adds a similar generate() method.
	My work is complementary — I'm adding a full model wrapper, not just
	generation plumbing.

	Component B: Olmo2ForSequenceClassification
	This class DOES NOT EXIST in HuggingFace transformers. OLMo only has
	OlmoForCausalLM — no classification head. I built one:
	- Extends Olmo2PreTrainedModel
	- Uses last-token pooling (last non-padded token's hidden state)
	- Linear projection head for classification/regression
	- Supports single-label, multi-label, and regression via problem_type config
	- Computes CrossEntropyLoss / BCEWithLogitsLoss / MSELoss based on task

	This follows the same pattern as LlamaForSequenceClassification.

	Component C: OLMo wrapper class
	```
	OLMo(HuggingFaceModel)
	__init__(task, model_name, n_tasks, config)
	- task: causal_lm \| regression \| classification \| mtr
	- Loads tokenizer from HuggingFace Hub
	- Sets pad_token = eos_token (decoder models don't have pad by default)
	- Syncs vocab_size between config and tokenizer
	- Creates appropriate model class based on task

	_prepare_batch(batch)
	- causal_lm: labels = input_ids (model shifts internally)
	- regression/classification: labels from dataset, proper dtype casting
	- Multi-task classification: float labels for BCEWithLogitsLoss
	```

	Component D: Tokenization strategy
	Phase 1 (GSoC): Use OLMo's pretrained tokenizer as-is on SMILES.
	- OLMo's 100K BPE vocab actually tokenizes SMILES more efficiently
	than ChemBERTa's 600-token vocab (0.9x token ratio in my analysis)
	- BUT it fragments chemical semantics: [C@@H] -> [C, @@, H, ]
	- ChemBERTa learns chemistry-aware merges: (=O), ccccc, COc

	Phase 2 (stretch): Extend tokenizer with SMILES-specific tokens.
	- Add special tokens for stereochemistry: [C@@H], [C@H], [nH]
	- Add aromatic ring tokens: c1ccccc1
	- Retrain BPE on mixed English + SMILES corpus

	### 3.2 Pseudocode

	Olmo2ForSequenceClassification.forward():
	```
	hidden_states = self.model(input_ids, attention_mask)
	# Pool: use last non-padded token
	seq_lengths = (input_ids != pad_token_id).sum(-1) - 1
	pooled = hidden_states[batch_range, seq_lengths]
	logits = self.score(pooled) # Linear(hidden_size, num_labels)
	if labels:
	if regression: loss = MSELoss(logits, labels)
	if single_class: loss = CrossEntropy(logits, labels)
	if multi_label: loss = BCEWithLogits(logits, labels)
	return {loss, logits}
	```

	OLMo._prepare_batch() for causal_lm:
	```
	tokens = tokenizer(smiles_list, padding=True)
	input_ids = tokens.input_ids.to(device)
	labels = input_ids.clone() # next-token prediction
	return {input_ids, attention_mask, labels}
	```

	HuggingFaceModel.generate():
	```
	tokens = tokenizer(inputs, padding=True)
	output_ids = model.generate(tokens, max_new_tokens=N, kwargs)
	return tokenizer.batch_decode(output_ids)
	```

	### 3.3 Testing Plan

	8 unit tests (all passing in my prototype):

	\| Test \| What it validates \|
	\|------\|-------------------\|
	\| test_olmo_causal_lm_pretraining \| Causal LM trains, loss > 0 \|
	\| test_olmo_regression_finetuning \| Regression trains, predictions match shape, MAE computable \|
	\| test_olmo_classification \| Classification on binary labels, loss > 0 \|
	\| test_olmo_multitask_regression \| MTR with 2 tasks, predictions shape matches \|
	\| test_olmo_save_and_restore \| Checkpoint save/load, weights match exactly \|
	\| test_olmo_load_from_pretrained \| Pretrain causal LM -> load into regression model \|
	\| test_olmo_generate \| Single and batch generation returns strings \|
	\| test_olmo_invalid_task \| ValueError on bad task name \|

	All tests use a tiny config (64 hidden, 2 layers, 2 heads) — no model
	download needed, runs in ~27 seconds on CPU.

	Integration tests (to add during GSoC):
	- MoleculeNet benchmarks: BBBP, ESOL, FreeSolv, Lipophilicity
	- SMILES generation validity (RDKit validation)
	- Continued pretraining convergence on ZINC/PubChem subsets

	### 3.4 Sources of Risk

	\| Risk \| Likelihood \| Mitigation \|
	\|------\|-----------\|------------\|
	\| OLMo-7B requires ~14GB VRAM for inference \| High \| Use OLMo-1B for CI/demos. Test with tiny configs. Document GPU requirements. \|
	\| SMILES generation validity low without extensive pretraining \| High \| This IS the core problem. Budget 3 weeks for pretraining experiments. Use ZINC-250K as training corpus. Target >50% validity. \|
	\| Olmo2ForSequenceClassification not upstream \| Medium \| Our implementation follows HF patterns exactly. If HF adds it later, we swap to theirs. \|
	\| Tokenizer fragments chemical semantics \| Medium \| Phase 1: works as-is (my experiments show learning happens). Phase 2: extend vocabulary. \|
	\| transformers version compatibility \| Low \| Already found and fixed one issue (PR #4913). Use top-level imports throughout. \|

	### 3.5 Milestones & Timeline

	Assuming Medium size (175 hours, ~12 weeks):

	Milestone 1: Core wrapper (Weeks 1-3)
	- PR: Base class changes to HuggingFaceModel (causal_lm task, generate())
	- Coordinate with PR #4907 to avoid duplication
	- PR: Olmo2ForSequenceClassification
	- PR: OLMo wrapper class with all task modes
	- PR: Unit tests (8 tests)
	- Deliverable: `from deepchem.models import OLMo` works for all tasks

	Milestone 2: Continued pretraining (Weeks 4-6)
	- PR: Pretraining pipeline on molecular data (ZINC-250K)
	- PR: Data loading utilities for SMILES corpora
	- PR: Pretraining tutorial notebook
	- Deliverable: Pretrained OLMo checkpoint on molecular data

	Milestone 3: Fine-tuning & benchmarks (Weeks 7-9)
	- PR: Classification tutorial (BBBP, Tox21)
	- PR: Regression tutorial (ESOL, FreeSolv, Lipophilicity)
	- PR: Benchmark results table vs ChemBERTa
	- Deliverable: Published benchmark comparing OLMo vs ChemBERTa on MoleculeNet

	Milestone 4: Generation & polish (Weeks 10-12)
	- PR: SMILES generation tutorial with RDKit validity checking
	- PR: Documentation (numpydoc, API reference, user guide)
	- PR: Tokenizer extension experiments (stretch goal)
	- Deliverable: Complete documentation and tutorials

	Each milestone = 1 evaluation checkpoint. PRs are <50 lines where possible,
	following DeepChem's contribution guidelines.

	### 3.6 Pull Request Plan

	I will follow DeepChem's guidelines: small PRs (<50 lines for initial ones),
	with tests and numpydoc documentation. Expected ~8-12 PRs total:

	1. HuggingFaceModel causal_lm support (~40 lines)
	2. generate() method (~50 lines)
	3. Olmo2ForSequenceClassification (~100 lines — larger, will discuss with mentor)
	4. OLMo wrapper class (~80 lines)
	5. Unit tests (~180 lines)
	6. Pretraining pipeline
	7. Data utilities
	8. Tutorial notebooks (3-4 notebooks)
	9. Documentation updates
	10. Benchmark scripts

	---

	## 4. COMMUNITY ENGAGEMENT

	- Already contributing: PR #4913 (bug fix), Issue #4912 (compat report)
	- Will attend office hours MWF 9am PST
	- Will join Discord for async discussion
	- Will write weekly progress updates
	- Happy to review other contributors' HuggingFace-related PRs

	---

	## 5. RESOURCES REQUIRED

	- GPU: I have access to 1x H100 NVL 96GB (Azure) for development
	- For CI: tiny model configs, no GPU needed
	- For pretraining experiments: my H100 is sufficient for OLMo-1B
	- OLMo-7B experiments: may need multi-GPU setup (discuss with mentor)

	---

	## 6. BIBLIOGRAPHY

	1. Groeneveld et al. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838
	2. Chithrananda et al. (2020). "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv:2010.09885
	3. Ross et al. (2022). "Large-Scale Chemical Language Representations Capture Molecular Structure and Properties." Nature Machine Intelligence.
	4. Weininger (1988). "SMILES, a chemical language and information system." J. Chem. Inf. Comput. Sci.
	5. Wu et al. (2018). "MoleculeNet: A Benchmark for Molecular Machine Learning." Chemical Science.

	---

	## KEY NUMBERS FROM YOUR EXPERIMENTS (reference these in proposal)

	- Tokenization: OLMo uses 0.9x tokens vs ChemBERTa on drug molecules
	- BBBP classification: ROC-AUC 0.67 (random init, 12.9M param model, 200 samples, 3 epochs)
	- ESOL regression: R^2 = 0.37, MAE = 1.27 (same conditions)
	- SMILES generation: 0% validity from random init (proves pretraining is the core challenge)
	- Test suite: 8/8 tests pass in 27 seconds on CPU
	- Stereochemistry fragmentation: [C@@H] splits into 4 tokens in OLMo vs 7 in ChemBERTa