Generic Resource Type Classifier - LoRA Fine-tuned Qwen2.5-7B

A LoRA fine-tuned version of Qwen2.5-7B-Instruct for classifying academic metadata into 32 specific resource types. This model was developed as part of the COMET enrichment and curation workflow to improve generic resource type classification for ~25 million works currently classified only as "Text" in DataCite metadata.

Model Details

Model Description

This model classifies DataCite metadata records into granular resource types (e.g., JournalArticle, Preprint, Report, BookChapter, Dissertation) rather than the generic "Text" classification. It uses LoRA (Low-Rank Adaptation) fine-tuning on Qwen2.5-7B-Instruct to efficiently adapt the model for this specialized classification task.

  • Developed by: COMET Metadata Team
  • Model type: Text Classification (Resource Type)
  • Language(s): English
  • License: Apache 2.0
  • Finetuned from model: Qwen/Qwen2.5-7B-Instruct
  • Fine-tuning method: LoRA (Low-Rank Adaptation)

Model Sources

Performance

The model achieves strong overall performance with 96% accuracy, 98% precision, 96% recall, and 97% F1-score across all 32 categories.

High-Performance Categories

The following categories show excellent precision and recall, making them suitable for production use:

  • StudyRegistration: 97% precision, 100% recall
  • Software: 88% precision, 92% recall
  • Preprint: 100% precision, 95% recall
  • PhysicalObject: ~100% precision, ~100% recall
  • InteractiveResource: 91% precision, 98% recall
  • Image: 92% precision, 99% recall
  • Dataset: 100% precision, 96% recall
  • Collection: 93% precision, 97% recall
  • Audiovisual: 89% precision, 97% recall

Resource Type Categories (32 total)

The model classifies into these categories:

  1. Audiovisual, 2. Award, 3. Book, 4. BookChapter, 5. Collection, 6. ComputationalNotebook, 7. ConferencePaper, 8. ConferenceProceeding, 9. DataPaper, 10. Dataset, 11. Dissertation, 12. Event, 13. Image, 14. Instrument, 15. InteractiveResource, 16. Journal, 17. JournalArticle, 18. Model, 19. OutputManagementPlan, 20. PeerReview, 21. PhysicalObject, 22. Preprint, 23. Project, 24. Report, 25. Service, 26. Software, 27. Sound, 28. Standard, 29. StudyRegistration, 30. Text, 31. Workflow, 32. Other

Uses

Direct Use

This model is designed to classify DataCite metadata records into specific resource types. Input should be formatted as key-value pairs of metadata fields (excluding the target resourceTypeGeneral field).

Downstream Use

  • Metadata Enhancement: Automatically assign granular resource types to improve searchability and discoverability
  • Data Curation: Support large-scale metadata enrichment workflows
  • Repository Management: Improve content organization in digital repositories

Out-of-Scope Use

  • General text classification beyond academic metadata
  • Classification of non-English metadata (model trained primarily on English)
  • Real-time applications requiring sub-second response times

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "cometadata/generic-resource-type-lora-qwen2.5-7b")

# Example metadata record
metadata = """
attributes.titles[0].title: Machine Learning Approaches to Climate Modeling
attributes.publisher: Nature Publishing Group  
attributes.creators[0].name: Smith, Jane
attributes.publicationYear: 2024
attributes.types.resourceType: research article
"""

# Format as chat (you'll need the full SYSTEM_PROMPT from the training data)
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},  
    {"role": "user", "content": metadata}
]

# Generate classification
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=5, temperature=0)
prediction = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

Training Details

Training Data

The model was trained on the cometadata/generic-resource-type-training-data dataset, which contains balanced samples of DataCite metadata records across all 32 resource type categories.

Training Procedure

Training Hyperparameters

  • Base Model: Qwen/Qwen2.5-7B-Instruct
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • LoRA Rank (r): 8
  • LoRA Alpha: 16
  • LoRA Dropout: 0.1
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Learning Rate: 1e-4
  • Training Epochs: 0.27 epochs
  • Tokens Processed: 350M tokens
  • Max Sequence Length: 60,000 tokens
  • Batch Size: Auto-detected
  • LR Scheduler: Cosine
  • Warmup Steps: 100
  • Training Regime: Mixed precision (bfloat16)
  • Packing: Enabled for efficiency
  • Loss: Completion-only loss (only classification token)

Infrastructure

  • Hardware: 8x H100 GPUs
  • Training Framework: TRL (Transformers Reinforcement Learning)
  • Attention Implementation: Flash Attention 2
  • Memory Optimization: LoRA + gradient checkpointing

Evaluation

Testing Data & Metrics

The model was evaluated on a held-out test set with the same distribution as training data. Evaluation metrics include:

  • Overall Accuracy: 96%
  • Macro-averaged Precision: 49% (affected by low-support categories)
  • Weighted Precision: 98%
  • Macro-averaged Recall: 80%
  • Weighted Recall: 96%
  • Macro-averaged F1: 54%
  • Weighted F1: 97%

Detailed Performance by Category

Category Precision Recall F1-Score Support
High-Performance Categories
StudyRegistration 0.97 1.00 0.98 1,663
Software 0.88 0.92 0.90 46,328
Preprint 1.00 0.95 0.97 20,604
PhysicalObject 1.00 1.00 1.00 790,408
InteractiveResource 0.91 0.98 0.94 53,954
Image 0.92 0.99 0.95 197,524
Dataset 1.00 0.96 0.98 1,619,320
Collection 0.93 0.97 0.95 35,039
Audiovisual 0.89 0.97 0.92 51,977
Medium-Performance Categories
Dissertation 0.80 0.88 0.84 5,203
Event 0.66 0.96 0.78 3,415
JournalArticle 0.88 0.61 0.72 79,182
Sound 0.67 0.89 0.76 1,437

Results Summary

The model shows excellent performance on high-volume categories like Dataset, Image, and Audiovisual, with some challenges on rare categories like Instrument (6 samples) and Standard (131 samples). The weighted metrics better represent real-world performance given the natural class imbalance in academic metadata.

Limitations

  • Class Imbalance: Some categories have very few examples, leading to lower macro-averaged scores
  • Language Bias: Primarily trained on English metadata
  • Domain Specificity: Optimized for DataCite-style academic metadata
  • Pattern Memorization: May have memorized some specific patterns (e.g., "PGRFA Material" → PhysicalObject)

Bias, Risks, and Limitations

Technical Limitations

  • Performance varies significantly across categories due to training data imbalance
  • May not generalize well to metadata formats different from DataCite
  • Requires careful prompt formatting for optimal performance

Recommendations

  • Use primarily for the high-performance categories identified above
  • Validate predictions on categories with lower precision/recall
  • Consider ensemble approaches for critical applications
  • Monitor for domain shift when applying to new metadata sources

Citation

If you use this model in your research, please cite:

@misc{comet-resource-type-classifier-2024,
  title={Generic Resource Type Classifier: LoRA Fine-tuned Qwen2.5-7B for DataCite Metadata Classification},
  author={COMET Team},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b}
}

Model Card Contact

For questions about this model, please open an issue in the COMET project repository or contact the COMET metadata team.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cometadata/generic-resource-type-lora-qwen2.5-7b

Base model

Qwen/Qwen2.5-7B
Adapter
(802)
this model

Dataset used to train cometadata/generic-resource-type-lora-qwen2.5-7b