You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ChemReason v1.0 β€” 8B Chemistry Reasoning Model

ChemReason is an 8B parameter language model fine-tuned for multi-task organic chemistry reasoning. It performs forward reaction prediction, retrosynthesis, mechanism explanation, synthesis planning, and yield estimation using chain-of-thought reasoning with <think> blocks followed by structured answers with SMILES notation.

NOTE This model is objectively not great. It makes mistakes on SMILES and can ramble. This was to prove that you can take USPTO reaction data, clean it up, and make a language model that can "speak" chemistry with little time and money. It will get some reactions correct.

Model Details

Base Model DeepSeek-R1-Distill-Llama-8B
Fine-tuning Supervised Fine-Tuning (SFT) on 121K chemistry reasoning examples generated from 25K reactions from the USPTO
Architecture LlamaForCausalLM, 8.03B parameters
Context Length 2048 tokens
Precision bfloat16
License Apache 2.0

Base Model vs ChemReason β€” Side-by-Side Comparison

Q1: Retrosynthesis of Paracetamol

Prompt: Propose a retrosynthetic strategy for synthesizing CC(=O)Nc1ccc(O)cc1.

Base DeepSeek-R1-Distill-Llama-8B (click to expand)

Proposes a convoluted 4-step route: nitration of toluene β†’ reduction β†’ Friedel-Crafts acylation β†’ amide coupling with EDC/HOBt. Never identifies the obvious one-step amide disconnection. No SMILES output. Response truncates at 512 tokens without completing.

ChemReason v1.0 (click to expand)
<think>
Retrosynthetic Analysis for N-(4-hydroxyphenyl)acetamide:
The target molecule contains an amide functional group and a phenol.
The most obvious disconnection is the amide bond.

Disconnection 1: Amide C-N bond
Breaking the C-N bond reveals two synthons:
- An aryl amine synthon corresponding to aminophenol (Nc1ccccc1O)
- An acylium ion synthon from acetic anhydride (CC(=O)OC(=O)C)
  or acetyl chloride (CC(=O)Cl)

Acetic anhydride is generally preferred due to its mildness.
</think>
The key disconnection is the amide bond, leading back to aminophenol
and acetic anhydride as starting materials via N-acetylation.

Correctly identifies the one-step disconnection, produces valid SMILES for all reagents, and recommends the optimal acetylating agent.

Q2: Suzuki Coupling Product Prediction

Prompt: Predict the product of the reaction with reactants: c1ccc(Br)cc1.OB(O)c1ccccc1.[Pd].

Base DeepSeek-R1-Distill-Llama-8B (click to expand)

Identifies the reaction type inconsistently (sometimes "Ullmann Coupling", sometimes "Suzuki-Miyaura"). Misidentifies the boronic acid as phenol in some runs. Predicts the wrong product (diphenyl ether instead of biphenyl). Cannot write product SMILES.

ChemReason v1.0 (click to expand)
<think>
This is a classic Suzuki-Miyaura cross-coupling reaction.
The key transformation involves the formation of a new carbon-carbon
bond between an aryl halide and an organoboron compound.
</think>

Consistently identifies Suzuki-Miyaura coupling and reasons through the mechanism in structured <think> blocks.

Q3: Nucleophilic Acyl Substitution Mechanism

Prompt: Explain the mechanism of the reaction between an acyl chloride and an amine to form an amide.

Base DeepSeek-R1-Distill-Llama-8B (click to expand)

Correctly identifies nucleophilic acyl substitution but makes a key mechanistic error: states "the oxygen from the acyl chloride acts as a leaving group" β€” it is the chloride ion that leaves, not oxygen.

ChemReason v1.0 (click to expand)
<think>
The mechanism for the formation of an amide from a carboxylic acid
derivative (like an acyl chloride) and an amine is a nucleophilic
acyl substitution.

1. Nucleophilic Attack: The lone pair on the nitrogen attacks
   the electrophilic carbonyl carbon...
</think>

Correctly identifies the mechanism and describes each step accurately with proper terminology.

Summary

Capability Base DeepSeek ChemReason v1.0
Retrosynthetic strategy Overcomplicated, wrong approach Correct one-step disconnection
SMILES in output None Valid SMILES for reagents and intermediates
Reaction identification Inconsistent, sometimes wrong Consistent and correct
Mechanistic detail Key errors (wrong leaving group) Accurate step-by-step
Reasoning structure Verbose markdown Structured <think> blocks

Training Data

The model was trained on ~121,000 chemistry reasoning examples derived from:

  • USPTO patent reactions (~25K reactions from the 1976–2016 USPTO grants dataset), each expanded into multiple task types by Gemini 2.5 Pro with expert-verified gold SMILES
  • Task distribution: forward prediction, retrosynthesis, mechanism explanation, synthesis planning, yield estimation
  • All examples include <think> reasoning chains demonstrating step-by-step chemical logic before providing the final answer

Data Quality

  • Gold SMILES targets extracted from reaction SMILES with 90% coverage (28,793/31,782)
  • RDKit-validated SMILES throughout the pipeline
  • Gemini-generated reasoning chains with chemistry-specific quality filtering

Usage

Critical: Tokenizer Configuration

⚠️ You must use the Llama 3.1 tokenizer, not the DeepSeek tokenizer. The DeepSeek tokenizer has a known bug that strips spaces from SMILES strings. Both tokenizers share the same vocabulary, but the Llama tokenizer handles space encoding correctly.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "JamesBrewster/ChemReason_v1.0_8B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model.eval()

# CRITICAL: Use Llama tokenizer, NOT the default DeepSeek tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

query = "Propose a retrosynthetic strategy for synthesizing CC(=O)Nc1ccc(O)cc1."

messages = [{"role": "user", "content": query}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.3,
        do_sample=True,
        eos_token_id=[128009, 128001],  # Both EOS tokens required
        repetition_penalty=1.1
    )

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Generation Configuration

Parameter Recommended Value Notes
eos_token_id [128009, 128001] Both required. 128009 = Llama EOS, 128001 = end-of-text. Without both, the model may loop.
temperature 0.3 For deterministic answers. Use 0.7–1.0 for diverse sampling.
repetition_penalty 1.1 Prevents SMILES token repetition loops
max_new_tokens 512 Sufficient for most chemistry reasoning tasks

Capabilities

1. Forward Reaction Prediction

Given reactants and conditions, predict the product(s).

2. Retrosynthesis

Given a target molecule, propose synthetic precursors with SMILES.

3. Mechanism Explanation

Explain reaction mechanisms step by step.

4. Synthesis Planning

Multi-step route design for complex targets.

5. Yield Estimation

Predict expected yields given reaction conditions.

Output Format

The model produces responses in a structured format:

<think>
[Step-by-step chemical reasoning]
- Identify functional groups and reactive sites
- Consider selectivity and conditions
- Evaluate possible pathways
</think>

[Concise answer with SMILES where applicable]

Training Procedure

  • Method: LoRA SFT (rank 16, alpha 32)
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Learning rate: 1e-5 with cosine schedule
  • Batch size: 8 effective (1 x 8 gradient accumulation)
  • Epochs: 2
  • Hardware: NVIDIA H100 SXM

Limitations

  • Positional isomerism: The model occasionally confuses ortho/meta/para substitution patterns when reading complex SMILES
  • SMILES grounding: For very complex molecules (>50 heavy atoms), the model may describe chemistry correctly in English while producing approximate SMILES
  • Stereochemistry: Limited handling of E/Z and R/S stereochemistry in SMILES notation
  • Knowledge scope: Trained primarily on USPTO patent chemistry; may be less reliable on organometallic catalysis or biochemistry outside the training distribution
  • Not a replacement for experimental verification: All proposed syntheses should be evaluated by a qualified chemist

Citation

@misc{chemreason2026,
  title={ChemReason: Multi-Task Chemistry Reasoning with Chain-of-Thought},
  author={James Brewster},
  year={2026},
  url={https://huggingface.co/JamesBrewster/ChemReason_v1.0_8B}
}

Acknowledgments

Built on DeepSeek-R1-Distill-Llama-8B. Training data generated from the USPTO patent reaction dataset with reasoning chains produced by Gemini 2.5 Pro.

Downloads last month
9
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for JamesBrewster/ChemReason_v1.0_8B

Finetuned
(152)
this model