Generic Resource Type Classifier - LoRA Fine-tuned Qwen2.5-7B
A LoRA fine-tuned version of Qwen2.5-7B-Instruct for classifying academic metadata into 32 specific resource types. This model was developed as part of the COMET enrichment and curation workflow to improve generic resource type classification for ~25 million works currently classified only as "Text" in DataCite metadata.
Model Details
Model Description
This model classifies DataCite metadata records into granular resource types (e.g., JournalArticle, Preprint, Report, BookChapter, Dissertation) rather than the generic "Text" classification. It uses LoRA (Low-Rank Adaptation) fine-tuning on Qwen2.5-7B-Instruct to efficiently adapt the model for this specialized classification task.
- Developed by: COMET Metadata Team
- Model type: Text Classification (Resource Type)
- Language(s): English
- License: Apache 2.0
- Finetuned from model: Qwen/Qwen2.5-7B-Instruct
- Fine-tuning method: LoRA (Low-Rank Adaptation)
Model Sources
- Repository: cometadata/generic-resource-type-lora-qwen2.5-7b
- Training Dataset: cometadata/generic-resource-type-training-data
Performance
The model achieves strong overall performance with 96% accuracy, 98% precision, 96% recall, and 97% F1-score across all 32 categories.
High-Performance Categories
The following categories show excellent precision and recall, making them suitable for production use:
- StudyRegistration: 97% precision, 100% recall
- Software: 88% precision, 92% recall
- Preprint: 100% precision, 95% recall
- PhysicalObject: ~100% precision, ~100% recall
- InteractiveResource: 91% precision, 98% recall
- Image: 92% precision, 99% recall
- Dataset: 100% precision, 96% recall
- Collection: 93% precision, 97% recall
- Audiovisual: 89% precision, 97% recall
Resource Type Categories (32 total)
The model classifies into these categories:
- Audiovisual, 2. Award, 3. Book, 4. BookChapter, 5. Collection, 6. ComputationalNotebook, 7. ConferencePaper, 8. ConferenceProceeding, 9. DataPaper, 10. Dataset, 11. Dissertation, 12. Event, 13. Image, 14. Instrument, 15. InteractiveResource, 16. Journal, 17. JournalArticle, 18. Model, 19. OutputManagementPlan, 20. PeerReview, 21. PhysicalObject, 22. Preprint, 23. Project, 24. Report, 25. Service, 26. Software, 27. Sound, 28. Standard, 29. StudyRegistration, 30. Text, 31. Workflow, 32. Other
Uses
Direct Use
This model is designed to classify DataCite metadata records into specific resource types. Input should be formatted as key-value pairs of metadata fields (excluding the target resourceTypeGeneral field).
Downstream Use
- Metadata Enhancement: Automatically assign granular resource types to improve searchability and discoverability
- Data Curation: Support large-scale metadata enrichment workflows
- Repository Management: Improve content organization in digital repositories
Out-of-Scope Use
- General text classification beyond academic metadata
- Classification of non-English metadata (model trained primarily on English)
- Real-time applications requiring sub-second response times
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "cometadata/generic-resource-type-lora-qwen2.5-7b")
# Example metadata record
metadata = """
attributes.titles[0].title: Machine Learning Approaches to Climate Modeling
attributes.publisher: Nature Publishing Group
attributes.creators[0].name: Smith, Jane
attributes.publicationYear: 2024
attributes.types.resourceType: research article
"""
# Format as chat (you'll need the full SYSTEM_PROMPT from the training data)
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": metadata}
]
# Generate classification
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=5, temperature=0)
prediction = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
Training Details
Training Data
The model was trained on the cometadata/generic-resource-type-training-data dataset, which contains balanced samples of DataCite metadata records across all 32 resource type categories.
Training Procedure
Training Hyperparameters
- Base Model: Qwen/Qwen2.5-7B-Instruct
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- LoRA Rank (r): 8
- LoRA Alpha: 16
- LoRA Dropout: 0.1
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Learning Rate: 1e-4
- Training Epochs: 0.27 epochs
- Tokens Processed: 350M tokens
- Max Sequence Length: 60,000 tokens
- Batch Size: Auto-detected
- LR Scheduler: Cosine
- Warmup Steps: 100
- Training Regime: Mixed precision (bfloat16)
- Packing: Enabled for efficiency
- Loss: Completion-only loss (only classification token)
Infrastructure
- Hardware: 8x H100 GPUs
- Training Framework: TRL (Transformers Reinforcement Learning)
- Attention Implementation: Flash Attention 2
- Memory Optimization: LoRA + gradient checkpointing
Evaluation
Testing Data & Metrics
The model was evaluated on a held-out test set with the same distribution as training data. Evaluation metrics include:
- Overall Accuracy: 96%
- Macro-averaged Precision: 49% (affected by low-support categories)
- Weighted Precision: 98%
- Macro-averaged Recall: 80%
- Weighted Recall: 96%
- Macro-averaged F1: 54%
- Weighted F1: 97%
Detailed Performance by Category
| Category | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| High-Performance Categories | ||||
| StudyRegistration | 0.97 | 1.00 | 0.98 | 1,663 |
| Software | 0.88 | 0.92 | 0.90 | 46,328 |
| Preprint | 1.00 | 0.95 | 0.97 | 20,604 |
| PhysicalObject | 1.00 | 1.00 | 1.00 | 790,408 |
| InteractiveResource | 0.91 | 0.98 | 0.94 | 53,954 |
| Image | 0.92 | 0.99 | 0.95 | 197,524 |
| Dataset | 1.00 | 0.96 | 0.98 | 1,619,320 |
| Collection | 0.93 | 0.97 | 0.95 | 35,039 |
| Audiovisual | 0.89 | 0.97 | 0.92 | 51,977 |
| Medium-Performance Categories | ||||
| Dissertation | 0.80 | 0.88 | 0.84 | 5,203 |
| Event | 0.66 | 0.96 | 0.78 | 3,415 |
| JournalArticle | 0.88 | 0.61 | 0.72 | 79,182 |
| Sound | 0.67 | 0.89 | 0.76 | 1,437 |
Results Summary
The model shows excellent performance on high-volume categories like Dataset, Image, and Audiovisual, with some challenges on rare categories like Instrument (6 samples) and Standard (131 samples). The weighted metrics better represent real-world performance given the natural class imbalance in academic metadata.
Limitations
- Class Imbalance: Some categories have very few examples, leading to lower macro-averaged scores
- Language Bias: Primarily trained on English metadata
- Domain Specificity: Optimized for DataCite-style academic metadata
- Pattern Memorization: May have memorized some specific patterns (e.g., "PGRFA Material" → PhysicalObject)
Bias, Risks, and Limitations
Technical Limitations
- Performance varies significantly across categories due to training data imbalance
- May not generalize well to metadata formats different from DataCite
- Requires careful prompt formatting for optimal performance
Recommendations
- Use primarily for the high-performance categories identified above
- Validate predictions on categories with lower precision/recall
- Consider ensemble approaches for critical applications
- Monitor for domain shift when applying to new metadata sources
Citation
If you use this model in your research, please cite:
@misc{comet-resource-type-classifier-2024,
title={Generic Resource Type Classifier: LoRA Fine-tuned Qwen2.5-7B for DataCite Metadata Classification},
author={COMET Team},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b}
}
Model Card Contact
For questions about this model, please open an issue in the COMET project repository or contact the COMET metadata team.