Generic Resource Type Classifier - LoRA Fine-tuned Qwen2.5-7B

A LoRA fine-tuned version of Qwen2.5-7B-Instruct for classifying academic metadata into 32 specific resource types. This model was developed as part of the COMET enrichment and curation workflow to improve generic resource type classification for ~25 million works currently classified only as "Text" in DataCite metadata.

Model Details

Model Description

This model classifies DataCite metadata records into granular resource types (e.g., JournalArticle, Preprint, Report, BookChapter, Dissertation) rather than the generic "Text" classification. It uses LoRA (Low-Rank Adaptation) fine-tuning on Qwen2.5-7B-Instruct to efficiently adapt the model for this specialized classification task.

Developed by: COMET Metadata Team
Model type: Text Classification (Resource Type)
Language(s): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen2.5-7B-Instruct
Fine-tuning method: LoRA (Low-Rank Adaptation)

Model Sources

Repository: cometadata/generic-resource-type-lora-qwen2.5-7b
Training Dataset: cometadata/generic-resource-type-training-data

Performance

The model achieves strong overall performance with 96% accuracy, 98% precision, 96% recall, and 97% F1-score across all 32 categories.

High-Performance Categories

The following categories show excellent precision and recall, making them suitable for production use:

StudyRegistration: 97% precision, 100% recall
Software: 88% precision, 92% recall
Preprint: 100% precision, 95% recall
PhysicalObject: ~100% precision, ~100% recall
InteractiveResource: 91% precision, 98% recall
Image: 92% precision, 99% recall
Dataset: 100% precision, 96% recall
Collection: 93% precision, 97% recall
Audiovisual: 89% precision, 97% recall

Resource Type Categories (32 total)

The model classifies into these categories:

Audiovisual, 2. Award, 3. Book, 4. BookChapter, 5. Collection, 6. ComputationalNotebook, 7. ConferencePaper, 8. ConferenceProceeding, 9. DataPaper, 10. Dataset, 11. Dissertation, 12. Event, 13. Image, 14. Instrument, 15. InteractiveResource, 16. Journal, 17. JournalArticle, 18. Model, 19. OutputManagementPlan, 20. PeerReview, 21. PhysicalObject, 22. Preprint, 23. Project, 24. Report, 25. Service, 26. Software, 27. Sound, 28. Standard, 29. StudyRegistration, 30. Text, 31. Workflow, 32. Other

Uses

Direct Use

This model is designed to classify DataCite metadata records into specific resource types. Input should be formatted as key-value pairs of metadata fields (excluding the target resourceTypeGeneral field).

Downstream Use

Metadata Enhancement: Automatically assign granular resource types to improve searchability and discoverability
Data Curation: Support large-scale metadata enrichment workflows
Repository Management: Improve content organization in digital repositories

Out-of-Scope Use

General text classification beyond academic metadata
Classification of non-English metadata (model trained primarily on English)
Real-time applications requiring sub-second response times

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "cometadata/generic-resource-type-lora-qwen2.5-7b")

# Example metadata record
metadata = """
attributes.titles[0].title: Machine Learning Approaches to Climate Modeling
attributes.publisher: Nature Publishing Group  
attributes.creators[0].name: Smith, Jane
attributes.publicationYear: 2024
attributes.types.resourceType: research article
"""

# Format as chat (you'll need the full SYSTEM_PROMPT from the training data)
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},  
    {"role": "user", "content": metadata}
]

# Generate classification
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=5, temperature=0)
prediction = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

Training Details

Training Data

The model was trained on the cometadata/generic-resource-type-training-data dataset, which contains balanced samples of DataCite metadata records across all 32 resource type categories.

Training Procedure

Training Hyperparameters

Base Model: Qwen/Qwen2.5-7B-Instruct
Fine-tuning Method: LoRA (Low-Rank Adaptation)
LoRA Rank (r): 8
LoRA Alpha: 16
LoRA Dropout: 0.1
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning Rate: 1e-4
Training Epochs: 0.27 epochs
Tokens Processed: 350M tokens
Max Sequence Length: 60,000 tokens
Batch Size: Auto-detected
LR Scheduler: Cosine
Warmup Steps: 100
Training Regime: Mixed precision (bfloat16)
Packing: Enabled for efficiency
Loss: Completion-only loss (only classification token)

Infrastructure

Hardware: 8x H100 GPUs
Training Framework: TRL (Transformers Reinforcement Learning)
Attention Implementation: Flash Attention 2
Memory Optimization: LoRA + gradient checkpointing

Evaluation

Testing Data & Metrics

The model was evaluated on a held-out test set with the same distribution as training data. Evaluation metrics include:

Overall Accuracy: 96%
Macro-averaged Precision: 49% (affected by low-support categories)
Weighted Precision: 98%
Macro-averaged Recall: 80%
Weighted Recall: 96%
Macro-averaged F1: 54%
Weighted F1: 97%

Detailed Performance by Category

Category	Precision	Recall	F1-Score	Support
High-Performance Categories
StudyRegistration	0.97	1.00	0.98	1,663
Software	0.88	0.92	0.90	46,328
Preprint	1.00	0.95	0.97	20,604
PhysicalObject	1.00	1.00	1.00	790,408
InteractiveResource	0.91	0.98	0.94	53,954
Image	0.92	0.99	0.95	197,524
Dataset	1.00	0.96	0.98	1,619,320
Collection	0.93	0.97	0.95	35,039
Audiovisual	0.89	0.97	0.92	51,977
Medium-Performance Categories
Dissertation	0.80	0.88	0.84	5,203
Event	0.66	0.96	0.78	3,415
JournalArticle	0.88	0.61	0.72	79,182
Sound	0.67	0.89	0.76	1,437

Results Summary

The model shows excellent performance on high-volume categories like Dataset, Image, and Audiovisual, with some challenges on rare categories like Instrument (6 samples) and Standard (131 samples). The weighted metrics better represent real-world performance given the natural class imbalance in academic metadata.

Limitations

Class Imbalance: Some categories have very few examples, leading to lower macro-averaged scores
Language Bias: Primarily trained on English metadata
Domain Specificity: Optimized for DataCite-style academic metadata
Pattern Memorization: May have memorized some specific patterns (e.g., "PGRFA Material" → PhysicalObject)

Bias, Risks, and Limitations

Technical Limitations

Performance varies significantly across categories due to training data imbalance
May not generalize well to metadata formats different from DataCite
Requires careful prompt formatting for optimal performance

Recommendations

Use primarily for the high-performance categories identified above
Validate predictions on categories with lower precision/recall
Consider ensemble approaches for critical applications
Monitor for domain shift when applying to new metadata sources

Citation

If you use this model in your research, please cite:

@misc{comet-resource-type-classifier-2024,
  title={Generic Resource Type Classifier: LoRA Fine-tuned Qwen2.5-7B for DataCite Metadata Classification},
  author={COMET Team},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b}
}

Model Card Contact

For questions about this model, please open an issue in the COMET project repository or contact the COMET metadata team.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for cometadata/generic-resource-type-lora-qwen2.5-7b

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(802)

this model

cometadata
/

generic-resource-type-lora-qwen2.5-7b