Fix pipeline_tag in YAML metadata

265ed17 verified 3 months ago

5.41 kB

language:
  - en
license: cc-by-4.0
tags:
  - relation-extraction
  - knowledge-graph
  - triplet-extraction
  - REBEL
  - quantum-physics
  - seq2seq
base_model: Babelscape/rebel-large
datasets:
  - konsman/quantum-rebel-mixed-training
pipeline_tag: text-generation
widget:
  - text: >-
      Albert Einstein was a German-born physicist who developed the theory of
      relativity.
    example_title: Einstein Example
  - text: >-
      Quantum entanglement is a physical phenomenon that occurs when pairs of
      particles interact.
    example_title: Quantum Physics Example

REBEL Quantum Physics (Mixed Training)

This is a fine-tuned REBEL model for relation extraction and knowledge graph triplet generation, specialized for quantum physics domain while maintaining general knowledge extraction capabilities.

Model Description

REBEL (Relation Extraction By End-to-end Language generation) is a seq2seq model that performs end-to-end relation extraction for more than 200 different relation types. This model has been fine-tuned on a mixed dataset combining domain-specific quantum physics triplets with general knowledge triplets.

Base Model: Babelscape/rebel-large
Fine-tuned on: Mixed dataset (quantum physics + general REBEL data)
Training Data: ~203k examples (191k train, 6k val, 6k test)
Language: English
Task: Relation Extraction / Knowledge Graph Triplet Generation

Training Data

The model was fine-tuned on a carefully curated mixed dataset:

Domain-specific data: ~48,000 quantum physics triplets
General data: ~144,000 general knowledge triplets (1:3 ratio)
Validation: ~6,000 domain-only quantum physics examples
Test: ~6,000 domain-only quantum physics examples

This mixed training approach allows the model to:

Excel at quantum physics domain extraction
Maintain strong general knowledge extraction capabilities
Avoid catastrophic forgetting of general relations

Usage

Direct Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "konsman/rebel-quantum-mixed"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "Quantum entanglement is a physical phenomenon that occurs when pairs of particles interact."

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

# Generate triplets
outputs = model.generate(
    inputs["input_ids"],
    max_length=256,
    num_beams=5,
    early_stopping=True
)

# Decode output
triplets_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(triplets_text)

Parsing Triplets

import re

def extract_triplets(text):
    """Extract structured triplets from REBEL output."""
    triplets = []
    pattern = r'<triplet> (.+?) <subj> (.+?) <obj> (.+?)(?=<triplet>|</s>|$)'

    for match in re.finditer(pattern, text):
        subject = match.group(1).strip()
        obj = match.group(2).strip()
        relation = match.group(3).strip()
        triplets.append((subject, obj, relation))

    return triplets

# Parse the output
triplets = extract_triplets(triplets_text)
for subj, obj, rel in triplets:
    print(f"({subj} ; {obj} ; {rel})")

Output Format

The model generates triplets in the following format:

<triplet> SUBJECT <subj> OBJECT <obj> RELATION <triplet> ...

Example output:

<triplet> Albert Einstein <subj> German <obj> country of citizenship <triplet> theory of relativity <subj> Albert Einstein <obj> discoverer or inventor

Evaluation

The model achieves strong performance on both domain-specific and general relation extraction tasks due to the mixed training approach.

Intended Use

Knowledge graph construction from scientific texts
Relation extraction from quantum physics literature
General purpose triplet extraction
Domain adaptation for information extraction

Limitations

Primarily trained on English text
May have reduced performance on domains very different from quantum physics and general Wikipedia-style text
Triplet extraction quality depends on input text quality and clarity

Training Details

Base Model: Babelscape/rebel-large
Training Framework: PyTorch Lightning
Training Hardware: NVIDIA H200 GPU
Batch Size: 1 (with gradient accumulation)
Optimizer: AdamW
Learning Rate: 3e-5
Epochs: 3
Mixed Precision: bf16

Citation

If you use this model, please cite:

@model{rebel_quantum_mixed,
  author = {Konsman},
  title = {REBEL Quantum Physics (Mixed Training)},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/konsman/rebel-quantum-mixed}
}

Also cite the original REBEL paper:

@inproceedings{huguet-cabot-navigli-2021-rebel-relation,
    title = "{REBEL}: Relation Extraction By End-to-end Language generation",
    author = "Huguet Cabot, Pere-Lluis and Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.204",
    pages = "2370--2381",
}

License

CC-BY-4.0

Contact

For questions or issues, please open an issue on the model repository.