Llama-ChemLink-Parser-8B-MTYS

ChemLink is a LoRA fine-tune of tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3 for extracting chemical measurement values (MW, IC50, EC50, Yield) from scientific literature, with compound-name linkage for PubChem grounding and Graph RAG integration.


Key Capability

Under a measurement-only prompt (no explicit compound_name instruction), ChemLink uniquely outputs compound_name alongside each extracted value. Baseline models output 0% compound names under the same prompt.

{
  "document_understanding": {},
  "chemical_entities": [
    {
      "compound_name": "linezolid",
      "measurements": [{"type": "Molecular Weight", "value": 337.35, "unit": "g/mol"}]
    }
  ]
}

This behavior is baked into weights by fine-tuning and enables downstream Graph RAG pipelines where a measurement value must be linked to its chemical entity node without manual post-processing.


Model Overview

Item Detail
Developer MitzMitz
Base model tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3
Training tool unsloth + TRL (SFTTrainer)
Quantization 4-bit NF4 (QLoRA, load_in_4bit=True)
LoRA config r=16, alpha=32, dropout=0, bias=none
Max seq length 2048
Supported languages Japanese, English
License Llama 3.1 Community License

Usage

Inference (Colab / GPU)

import torch, json, re
from unsloth import FastLanguageModel
from google.colab import userdata

HF_TOKEN = userdata.get('HF_TOKEN')

SYSTEM_PROMPT = (
    "You are a chemical data extraction assistant. "
    "Extract measurements from the given text and return a JSON array. "
    "Each element must have: type (IC50/EC50/MW/Yield), value (number), unit (string). "
    "If no target measurement is found, return []. "
    "Output only the JSON array, no explanation."
)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "MitzMitz/Llama-ChemLink-Parser-8B-MTYS",
    max_seq_length = 2048,
    dtype          = None,
    load_in_4bit   = True,
    token          = HF_TOKEN,
)
FastLanguageModel.for_inference(model)

def extract(text):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": text},
    ]
    input_ids = tokenizer.apply_chat_template(
        messages, tokenize=True,
        add_generation_prompt=True, return_tensors="pt"
    ).to("cuda")
    with torch.no_grad():
        output = model.generate(
            input_ids, max_new_tokens=256,
            temperature=0.1, do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )
    raw = tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True).strip()
    raw = re.sub(r"^```(?:json)?\s*", "", raw, flags=re.MULTILINE)
    raw = re.sub(r"\s*```\s*$",     "", raw, flags=re.MULTILINE).strip()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        return raw

print(extract("The compound linezolid has a molecular weight of 337.35 g/mol."))

Local CPU inference (Ollama)

ollama create llama-chemlink-parser-8b-mtys -f Modelfile
ollama run llama-chemlink-parser-8b-mtys

Training Data

File Total MW IC50 EC50 Yield Negative ([]) Source
phase6_train_mix 3,763 2,283 717 0 44 719 PubChem / ChEMBL / ORD
additional_ec50_yield 2,534 0 0 1,000 1,000 534 ChEMBL / ORD
additional_yield_table 621 0 0 0 500 121 ORD
additional_mw_unit_fix 120 84 16 0 0 20 PubChem
additional_phase5 740 17 115 22 425 161 ChEMBL / ORD / PubChem
Total 7,778 2,384 848 1,022 1,969 1,555

Data licenses: ORD (CC-BY-SA 4.0), ChEMBL (CC-BY-SA 3.0), PubChem (Public Domain).


Evaluation

Dataset

Indicator n Source Contamination
MW 744 PubChem (PMID-verified) 0
Yield 750 ORD (PMID-verified) 0
IC50 740 ChEMBL โ†’ PubMed Abstract 0
EC50 729 ChEMBL โ†’ PubMed Abstract 0
Total 2,963 0 confirmed

Condition A โ€” No explicit compound_name instruction

Prompt requests type / value / unit only. No compound_name requested.

MW (n = 744)

Model Environment MW correct compound_name output
ChemLink NF4 (this model) Colab GPU / unsloth NF4 736 / 744 = 98.9% 736 / 744 = 98.9%
ChemLink q5_k_m (GGUF) Local CPU / Ollama 663 / 744 = 89.1% 663 / 744 = 89.1%
GPT-4.1-mini OpenAI API 744 / 744 = 100.0% 0 / 744 = 0%
Swallow-base Colab GPU / unsloth NF4 692 / 744 = 93.0% 0 / 744 = 0%
Mistral-7B Local CPU / Ollama 706 / 744 = 94.9% 0 / 744 = 0%
Gemma-7B Local CPU / Ollama 283 / 744 = 38.0% 0 / 744 = 0%

Yield (n = 750)

Model Environment Yield correct compound_name output
ChemLink NF4 (this model) Colab GPU / unsloth NF4 730 / 750 = 97.3% 728 / 750 = 97.1%
ChemLink q5_k_m (GGUF) Local CPU / Ollama 610 / 750 = 81.3% 610 / 750 = 81.3%
GPT-4.1-mini OpenAI API 750 / 750 = 100.0% 0 / 750 = 0%
Swallow-base Colab GPU / unsloth NF4 748 / 750 = 99.7% 0 / 750 = 0%
Mistral-7B Local CPU / Ollama 517 / 750 = 68.9% 0 / 750 = 0%
Gemma-7B Local CPU / Ollama 442 / 750 = 58.9% 0 / 750 = 0%

Inference note: Colab models used temperature=0.1 / max_new_tokens=256 / apply_chat_template. Local Ollama models used temperature=0.0 / num_predict=128 / manual Modelfile TEMPLATE. GPT-4.1-mini was evaluated via the OpenAI Chat Completion API in a separate run. These environment differences should be considered when comparing across rows.

ChemLink NF4 and q5_k_m are the only models that output compound_name under this prompt. This behavior is baked into weights by fine-tuning and does not require any additional instruction.


Condition B โ€” Explicit compound_name instruction (Colab, same prompt for all models)

All 5 models received the same prompt explicitly requesting compound_name. Evaluated on the same 2,963-sample dataset from Colab GPU.

MW (n = 744)

Model MW correct compound_name output PubChem hit MW DB-verified
ChemLink NF4 740 / 744 = 99.5% 741 / 744 = 99.6% 0 0
ChemLink q5_k_m 744 / 744 = 100.0% 744 / 744 = 100.0% 4 4 (100% cond.)
GPT-4.1-mini 741 / 744 = 99.6% 741 / 744 = 99.6% 12 10 (83.3% cond.)
Mistral-7B 743 / 744 = 99.9% 443 / 744 = 59.5% 36 35 (97.2% cond.)
Swallow-base 734 / 744 = 98.7% 733 / 744 = 98.5% 0 0

Yield (n = 750)

Model Yield correct compound_name output PubChem hit
ChemLink NF4 685 / 750 = 91.3% 685 / 750 = 91.3% 0
ChemLink q5_k_m 672 / 750 = 89.6% 672 / 750 = 89.6% 49
GPT-4.1-mini 750 / 750 = 100.0% 727 / 750 = 96.9% 18
Mistral-7B 669 / 750 = 89.2% 363 / 750 = 48.4% 18
Swallow-base 749 / 750 = 99.9% 706 / 750 = 94.1% 0

PubChem hit rates are low across all models because real PubMed abstracts frequently use generic compound codes ("compound 3", "44") rather than IUPAC names. This is a property of the input text, not of model capability.


PubChem MW grounding โ€” Synthetic texts (V13.1 strict protocol)

Evaluated on synthetic texts where each sentence explicitly contains an IUPAC compound name and its MW value (sourced from PubChem). ChemLink outputs compound_name without instruction; the extracted name is then searched in PubChem and matched against the source MW.

Model compound_name output PubChem candidate MW match Full success rate Conditional match
ChemLink NF4 736 / 744 380 375 375 / 744 = 50.4% 375 / 380 = 98.7%
ChemLink q5_k_m 663 / 744 392 389 389 / 744 = 52.3% 389 / 392 = 99.2%
All baselines 0 / 744 0 0 0% โ€”

Full success: MW correctly extracted AND compound_name output AND PubChem candidate found AND PubChem MolecularWeight matches extracted MW (ยฑ1%). Source: chemlink_v13_1_strict_db_normalization.xlsx, V13.1 strict fixed protocol. Baselines had 0% compound_name output in no-instruction condition; PubChem grounding not applicable.


Limitations

  • IC50 / EC50: Extraction scores were < 2% across all models and conditions. This reflects a limitation of the unified-output evaluation protocol, not model capability. Not suitable for cross-model comparison on these indicators.

  • compound_name in real abstracts: Real PubMed abstracts often use generic codes ("compound 3", "2b") rather than IUPAC names. ChemLink outputs whatever name appears in the source text. PubChem resolution depends on how the original literature names the compound.

  • Quantization gap: ChemLink NF4 (Colab) and q5_k_m (local GGUF) differ in quantization and inference backend. The q5_k_m variant shows ~10 pp lower MW extraction rate than NF4 in the no-instruction evaluation.

  • Inference environment: Colab GPU evaluations used temperature=0.1 / max_new_tokens=256. Local Ollama evaluations used temperature=0.0 / num_predict=128. Cross-environment comparisons should account for these differences.


Framework Versions (Training)

Library Version
unsloth 2026.5.2
PEFT 0.19.1
Transformers 5.5.0
PyTorch 2.10.0
TRL 0.24.0
Datasets 4.3.0
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MitzMitz/Llama-ChemLink-Parser-8B-MTYS

Adapter
(2)
this model