Llama-ChemLink-Parser-8B-MTYS

ChemLink is a LoRA fine-tune of tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3 for extracting chemical measurement values (MW, IC50, EC50, Yield) from scientific literature, with compound-name linkage for PubChem grounding and Graph RAG integration.

Key Capability

Under a measurement-only prompt (no explicit compound_name instruction), ChemLink uniquely outputs compound_name alongside each extracted value. Baseline models output 0% compound names under the same prompt.

{
  "document_understanding": {},
  "chemical_entities": [
    {
      "compound_name": "linezolid",
      "measurements": [{"type": "Molecular Weight", "value": 337.35, "unit": "g/mol"}]
    }
  ]
}

This behavior is baked into weights by fine-tuning and enables downstream Graph RAG pipelines where a measurement value must be linked to its chemical entity node without manual post-processing.

Model Overview

Item	Detail
Developer	MitzMitz
Base model	tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3
Training tool	unsloth + TRL (SFTTrainer)
Quantization	4-bit NF4 (QLoRA, load_in_4bit=True)
LoRA config	r=16, alpha=32, dropout=0, bias=none
Max seq length	2048
Supported languages	Japanese, English
License	Llama 3.1 Community License

Usage

Inference (Colab / GPU)

import torch, json, re
from unsloth import FastLanguageModel
from google.colab import userdata

HF_TOKEN = userdata.get('HF_TOKEN')

SYSTEM_PROMPT = (
    "You are a chemical data extraction assistant. "
    "Extract measurements from the given text and return a JSON array. "
    "Each element must have: type (IC50/EC50/MW/Yield), value (number), unit (string). "
    "If no target measurement is found, return []. "
    "Output only the JSON array, no explanation."
)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "MitzMitz/Llama-ChemLink-Parser-8B-MTYS",
    max_seq_length = 2048,
    dtype          = None,
    load_in_4bit   = True,
    token          = HF_TOKEN,
)
FastLanguageModel.for_inference(model)

def extract(text):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": text},
    ]
    input_ids = tokenizer.apply_chat_template(
        messages, tokenize=True,
        add_generation_prompt=True, return_tensors="pt"
    ).to("cuda")
    with torch.no_grad():
        output = model.generate(
            input_ids, max_new_tokens=256,
            temperature=0.1, do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )
    raw = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True).strip()
    raw = re.sub(r"^```(?:json)?\s*", "", raw, flags=re.MULTILINE)
    raw = re.sub(r"\s*```\s*$",     "", raw, flags=re.MULTILINE).strip()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        return raw

print(extract("The compound linezolid has a molecular weight of 337.35 g/mol."))

Local CPU inference (Ollama)

ollama create llama-chemlink-parser-8b-mtys -f Modelfile
ollama run llama-chemlink-parser-8b-mtys

Training Data

File	Total	MW	IC50	EC50	Yield	Negative ([])	Source
phase6_train_mix	3,763	2,283	717	0	44	719	PubChem / ChEMBL / ORD
additional_ec50_yield	2,534	0	0	1,000	1,000	534	ChEMBL / ORD
additional_yield_table	621	0	0	0	500	121	ORD
additional_mw_unit_fix	120	84	16	0	0	20	PubChem
additional_phase5	740	17	115	22	425	161	ChEMBL / ORD / PubChem
Total	7,778	2,384	848	1,022	1,969	1,555

Data licenses: ORD (CC-BY-SA 4.0), ChEMBL (CC-BY-SA 3.0), PubChem (Public Domain).

Evaluation

Dataset

Indicator	n	Source	Contamination
MW	744	PubChem (PMID-verified)	0
Yield	750	ORD (PMID-verified)	0
IC50	740	ChEMBL → PubMed Abstract	0
EC50	729	ChEMBL → PubMed Abstract	0
Total	2,963		0 confirmed

Condition A — No explicit compound_name instruction

Prompt requests type / value / unit only. No compound_name requested.

MW (n = 744)

Model	Environment	MW correct	compound_name output
ChemLink NF4 (this model)	Colab GPU / unsloth NF4	736 / 744 = 98.9%	736 / 744 = 98.9%
ChemLink q5_k_m (GGUF)	Local CPU / Ollama	663 / 744 = 89.1%	663 / 744 = 89.1%
GPT-4.1-mini	OpenAI API	744 / 744 = 100.0%	0 / 744 = 0%
Swallow-base	Colab GPU / unsloth NF4	692 / 744 = 93.0%	0 / 744 = 0%
Mistral-7B	Local CPU / Ollama	706 / 744 = 94.9%	0 / 744 = 0%
Gemma-7B	Local CPU / Ollama	283 / 744 = 38.0%	0 / 744 = 0%

Yield (n = 750)

Model	Environment	Yield correct	compound_name output
ChemLink NF4 (this model)	Colab GPU / unsloth NF4	730 / 750 = 97.3%	728 / 750 = 97.1%
ChemLink q5_k_m (GGUF)	Local CPU / Ollama	610 / 750 = 81.3%	610 / 750 = 81.3%
GPT-4.1-mini	OpenAI API	750 / 750 = 100.0%	0 / 750 = 0%
Swallow-base	Colab GPU / unsloth NF4	748 / 750 = 99.7%	0 / 750 = 0%
Mistral-7B	Local CPU / Ollama	517 / 750 = 68.9%	0 / 750 = 0%
Gemma-7B	Local CPU / Ollama	442 / 750 = 58.9%	0 / 750 = 0%

Inference note: Colab models used temperature=0.1 / max_new_tokens=256 / apply_chat_template. Local Ollama models used temperature=0.0 / num_predict=128 / manual Modelfile TEMPLATE. GPT-4.1-mini was evaluated via the OpenAI Chat Completion API in a separate run. These environment differences should be considered when comparing across rows.

ChemLink NF4 and q5_k_m are the only models that output compound_name under this prompt. This behavior is baked into weights by fine-tuning and does not require any additional instruction.

Condition B — Explicit compound_name instruction (Colab, same prompt for all models)

All 5 models received the same prompt explicitly requesting compound_name. Evaluated on the same 2,963-sample dataset from Colab GPU.

MW (n = 744)

Model	MW correct	compound_name output	PubChem hit	MW DB-verified
ChemLink NF4	740 / 744 = 99.5%	741 / 744 = 99.6%	0	0
ChemLink q5_k_m	744 / 744 = 100.0%	744 / 744 = 100.0%	4	4 (100% cond.)
GPT-4.1-mini	741 / 744 = 99.6%	741 / 744 = 99.6%	12	10 (83.3% cond.)
Mistral-7B	743 / 744 = 99.9%	443 / 744 = 59.5%	36	35 (97.2% cond.)
Swallow-base	734 / 744 = 98.7%	733 / 744 = 98.5%	0	0

Yield (n = 750)

Model	Yield correct	compound_name output	PubChem hit
ChemLink NF4	685 / 750 = 91.3%	685 / 750 = 91.3%	0
ChemLink q5_k_m	672 / 750 = 89.6%	672 / 750 = 89.6%	49
GPT-4.1-mini	750 / 750 = 100.0%	727 / 750 = 96.9%	18
Mistral-7B	669 / 750 = 89.2%	363 / 750 = 48.4%	18
Swallow-base	749 / 750 = 99.9%	706 / 750 = 94.1%	0

PubChem hit rates are low across all models because real PubMed abstracts frequently use generic compound codes ("compound 3", "44") rather than IUPAC names. This is a property of the input text, not of model capability.

PubChem MW grounding — Synthetic texts (V13.1 strict protocol)

Evaluated on synthetic texts where each sentence explicitly contains an IUPAC compound name and its MW value (sourced from PubChem). ChemLink outputs compound_name without instruction; the extracted name is then searched in PubChem and matched against the source MW.

Model	compound_name output	PubChem candidate	MW match	Full success rate	Conditional match
ChemLink NF4	736 / 744	380	375	375 / 744 = 50.4%	375 / 380 = 98.7%
ChemLink q5_k_m	663 / 744	392	389	389 / 744 = 52.3%	389 / 392 = 99.2%
All baselines	0 / 744	0	0	0%	—

Full success: MW correctly extracted AND compound_name output AND PubChem candidate found AND PubChem MolecularWeight matches extracted MW (±1%). Source: chemlink_v13_1_strict_db_normalization.xlsx, V13.1 strict fixed protocol. Baselines had 0% compound_name output in no-instruction condition; PubChem grounding not applicable.

Limitations

IC50 / EC50: Extraction scores were < 2% across all models and conditions. This reflects a limitation of the unified-output evaluation protocol, not model capability. Not suitable for cross-model comparison on these indicators.
compound_name in real abstracts: Real PubMed abstracts often use generic codes ("compound 3", "2b") rather than IUPAC names. ChemLink outputs whatever name appears in the source text. PubChem resolution depends on how the original literature names the compound.
Quantization gap: ChemLink NF4 (Colab) and q5_k_m (local GGUF) differ in quantization and inference backend. The q5_k_m variant shows ~10 pp lower MW extraction rate than NF4 in the no-instruction evaluation.
Inference environment: Colab GPU evaluations used temperature=0.1 / max_new_tokens=256. Local Ollama evaluations used temperature=0.0 / num_predict=128. Cross-environment comparisons should account for these differences.

Framework Versions (Training)

Library	Version
unsloth	2026.5.2
PEFT	0.19.1
Transformers	5.5.0
PyTorch	2.10.0
TRL	0.24.0
Datasets	4.3.0

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MitzMitz/Llama-ChemLink-Parser-8B-MTYS

Base model

tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3

Adapter

(2)

this model