mistral-e2e / README.md

nielsr HF Staff

Improve model card: Update license and add sample usage

9fdcb57 verified 7 months ago

2.9 kB

base_model:
  - mistralai/Mistral-7B-Instruct-v0.3
datasets:
  - noystl/Recombination-Extraction
language:
  - en
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation

This Hugging Face repository contains a fine-tuned Mistral model trained for the task of extracting recombination examples from scientific abstracts, as described in the paper CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation. The model utilizes a LoRA adapter on top of a Mistral base model.

The model can be used for the information extraction task of identifying recombination examples within scientific text.

Quick Links

🌐 Project
📃 Paper
🛠️ Code

Sample Usage

You can use this model with the Hugging Face transformers library to extract recombination instances from text. The model expects a specific prompt format for this task.

from transformers import pipeline, AutoTokenizer
import torch

model_id = "noystl/mistral-e2e" 

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Initialize the text generation pipeline
generator = pipeline(
    "text-generation", 
    model=model_id, 
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16, # Use bfloat16 for better performance on compatible GPUs
    device_map="auto", # Automatically select best device (GPU or CPU)
    trust_remote_code=True # Required for custom model components
)

# Example abstract for recombination extraction
abstract = """The multi-granular diagnostic approach of pathologists can inspire Histopathological image classification.
This suggests a novel way to improve accuracy in image classification tasks."""

# Format the input prompt as expected by the model
prompt = f"Extract any recombination instances (inspiration/combination) from the following abstract:\
Abstract: {abstract}\
Recombination:"

# Generate the output. Use do_sample=False for deterministic extraction.
# max_new_tokens should be set appropriately for the expected JSON output.
outputs = generator(prompt, max_new_tokens=200, do_sample=False)

# Print the generated text, which should contain the extracted recombination in JSON format
print(outputs[0]["generated_text"])

For more advanced usage, including training and evaluation, please refer to the GitHub repository.

Bibtex

@misc{sternlicht2025chimeraknowledgebaseidea,
      title={CHIMERA: A Knowledge Base of Idea Recombination in Scientific Literature}, 
      author={Noy Sternlicht and Tom Hope},
      year={2025},
      eprint={2505.20779},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20779}, 
}