|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-classification |
|
|
- metadata-classification |
|
|
- datacite |
|
|
- lora |
|
|
- qwen2.5 |
|
|
- resource-type |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen2.5-7B-Instruct |
|
|
datasets: |
|
|
- cometadata/generic-resource-type-training-data |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Generic Resource Type Classifier - LoRA Fine-tuned Qwen2.5-7B |
|
|
|
|
|
A LoRA fine-tuned version of Qwen2.5-7B-Instruct for classifying academic metadata into 32 specific resource types. This model was developed as part of the COMET enrichment and curation workflow to improve generic resource type classification for ~25 million works currently classified only as "Text" in DataCite metadata. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model classifies DataCite metadata records into granular resource types (e.g., JournalArticle, Preprint, Report, BookChapter, Dissertation) rather than the generic "Text" classification. It uses LoRA (Low-Rank Adaptation) fine-tuning on Qwen2.5-7B-Instruct to efficiently adapt the model for this specialized classification task. |
|
|
|
|
|
- **Developed by:** COMET Metadata Team |
|
|
- **Model type:** Text Classification (Resource Type) |
|
|
- **Language(s):** English |
|
|
- **License:** Apache 2.0 |
|
|
- **Finetuned from model:** Qwen/Qwen2.5-7B-Instruct |
|
|
- **Fine-tuning method:** LoRA (Low-Rank Adaptation) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [cometadata/generic-resource-type-lora-qwen2.5-7b](https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b) |
|
|
- **Training Dataset:** [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data) |
|
|
|
|
|
## Performance |
|
|
|
|
|
The model achieves strong overall performance with **96% accuracy**, **98% precision**, **96% recall**, and **97% F1-score** across all 32 categories. |
|
|
|
|
|
### High-Performance Categories |
|
|
The following categories show excellent precision and recall, making them suitable for production use: |
|
|
- **StudyRegistration**: 97% precision, 100% recall |
|
|
- **Software**: 88% precision, 92% recall |
|
|
- **Preprint**: 100% precision, 95% recall |
|
|
- **PhysicalObject**: ~100% precision, ~100% recall |
|
|
- **InteractiveResource**: 91% precision, 98% recall |
|
|
- **Image**: 92% precision, 99% recall |
|
|
- **Dataset**: 100% precision, 96% recall |
|
|
- **Collection**: 93% precision, 97% recall |
|
|
- **Audiovisual**: 89% precision, 97% recall |
|
|
|
|
|
### Resource Type Categories (32 total) |
|
|
|
|
|
The model classifies into these categories: |
|
|
1. Audiovisual, 2. Award, 3. Book, 4. BookChapter, 5. Collection, 6. ComputationalNotebook, 7. ConferencePaper, 8. ConferenceProceeding, 9. DataPaper, 10. Dataset, 11. Dissertation, 12. Event, 13. Image, 14. Instrument, 15. InteractiveResource, 16. Journal, 17. JournalArticle, 18. Model, 19. OutputManagementPlan, 20. PeerReview, 21. PhysicalObject, 22. Preprint, 23. Project, 24. Report, 25. Service, 26. Software, 27. Sound, 28. Standard, 29. StudyRegistration, 30. Text, 31. Workflow, 32. Other |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
This model is designed to classify DataCite metadata records into specific resource types. Input should be formatted as key-value pairs of metadata fields (excluding the target `resourceTypeGeneral` field). |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
- **Metadata Enhancement**: Automatically assign granular resource types to improve searchability and discoverability |
|
|
- **Data Curation**: Support large-scale metadata enrichment workflows |
|
|
- **Repository Management**: Improve content organization in digital repositories |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- General text classification beyond academic metadata |
|
|
- Classification of non-English metadata (model trained primarily on English) |
|
|
- Real-time applications requiring sub-second response times |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from peft import PeftModel |
|
|
|
|
|
# Load base model and tokenizer |
|
|
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct") |
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct") |
|
|
|
|
|
# Load LoRA adapter |
|
|
model = PeftModel.from_pretrained(base_model, "cometadata/generic-resource-type-lora-qwen2.5-7b") |
|
|
|
|
|
# Example metadata record |
|
|
metadata = """ |
|
|
attributes.titles[0].title: Machine Learning Approaches to Climate Modeling |
|
|
attributes.publisher: Nature Publishing Group |
|
|
attributes.creators[0].name: Smith, Jane |
|
|
attributes.publicationYear: 2024 |
|
|
attributes.types.resourceType: research article |
|
|
""" |
|
|
|
|
|
# Format as chat (you'll need the full SYSTEM_PROMPT from the training data) |
|
|
messages = [ |
|
|
{"role": "system", "content": SYSTEM_PROMPT}, |
|
|
{"role": "user", "content": metadata} |
|
|
] |
|
|
|
|
|
# Generate classification |
|
|
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=5, temperature=0) |
|
|
prediction = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on the [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data) dataset, which contains balanced samples of DataCite metadata records across all 32 resource type categories. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Base Model:** Qwen/Qwen2.5-7B-Instruct |
|
|
- **Fine-tuning Method:** LoRA (Low-Rank Adaptation) |
|
|
- **LoRA Rank (r):** 8 |
|
|
- **LoRA Alpha:** 16 |
|
|
- **LoRA Dropout:** 0.1 |
|
|
- **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
|
- **Learning Rate:** 1e-4 |
|
|
- **Training Epochs:** 0.27 epochs |
|
|
- **Tokens Processed:** 350M tokens |
|
|
- **Max Sequence Length:** 60,000 tokens |
|
|
- **Batch Size:** Auto-detected |
|
|
- **LR Scheduler:** Cosine |
|
|
- **Warmup Steps:** 100 |
|
|
- **Training Regime:** Mixed precision (bfloat16) |
|
|
- **Packing:** Enabled for efficiency |
|
|
- **Loss:** Completion-only loss (only classification token) |
|
|
|
|
|
#### Infrastructure |
|
|
|
|
|
- **Hardware:** 8x H100 GPUs |
|
|
- **Training Framework:** TRL (Transformers Reinforcement Learning) |
|
|
- **Attention Implementation:** Flash Attention 2 |
|
|
- **Memory Optimization:** LoRA + gradient checkpointing |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data & Metrics |
|
|
|
|
|
The model was evaluated on a held-out test set with the same distribution as training data. Evaluation metrics include: |
|
|
|
|
|
- **Overall Accuracy:** 96% |
|
|
- **Macro-averaged Precision:** 49% (affected by low-support categories) |
|
|
- **Weighted Precision:** 98% |
|
|
- **Macro-averaged Recall:** 80% |
|
|
- **Weighted Recall:** 96% |
|
|
- **Macro-averaged F1:** 54% |
|
|
- **Weighted F1:** 97% |
|
|
|
|
|
### Detailed Performance by Category |
|
|
|
|
|
| Category | Precision | Recall | F1-Score | Support | |
|
|
|----------|-----------|---------|----------|---------| |
|
|
| **High-Performance Categories** | | | | | |
|
|
| StudyRegistration | 0.97 | 1.00 | 0.98 | 1,663 | |
|
|
| Software | 0.88 | 0.92 | 0.90 | 46,328 | |
|
|
| Preprint | 1.00 | 0.95 | 0.97 | 20,604 | |
|
|
| PhysicalObject | 1.00 | 1.00 | 1.00 | 790,408 | |
|
|
| InteractiveResource | 0.91 | 0.98 | 0.94 | 53,954 | |
|
|
| Image | 0.92 | 0.99 | 0.95 | 197,524 | |
|
|
| Dataset | 1.00 | 0.96 | 0.98 | 1,619,320 | |
|
|
| Collection | 0.93 | 0.97 | 0.95 | 35,039 | |
|
|
| Audiovisual | 0.89 | 0.97 | 0.92 | 51,977 | |
|
|
| **Medium-Performance Categories** | | | | | |
|
|
| Dissertation | 0.80 | 0.88 | 0.84 | 5,203 | |
|
|
| Event | 0.66 | 0.96 | 0.78 | 3,415 | |
|
|
| JournalArticle | 0.88 | 0.61 | 0.72 | 79,182 | |
|
|
| Sound | 0.67 | 0.89 | 0.76 | 1,437 | |
|
|
|
|
|
### Results Summary |
|
|
|
|
|
The model shows excellent performance on high-volume categories like Dataset, Image, and Audiovisual, with some challenges on rare categories like Instrument (6 samples) and Standard (131 samples). The weighted metrics better represent real-world performance given the natural class imbalance in academic metadata. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Class Imbalance**: Some categories have very few examples, leading to lower macro-averaged scores |
|
|
- **Language Bias**: Primarily trained on English metadata |
|
|
- **Domain Specificity**: Optimized for DataCite-style academic metadata |
|
|
- **Pattern Memorization**: May have memorized some specific patterns (e.g., "PGRFA Material" → PhysicalObject) |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
### Technical Limitations |
|
|
- Performance varies significantly across categories due to training data imbalance |
|
|
- May not generalize well to metadata formats different from DataCite |
|
|
- Requires careful prompt formatting for optimal performance |
|
|
|
|
|
### Recommendations |
|
|
- Use primarily for the high-performance categories identified above |
|
|
- Validate predictions on categories with lower precision/recall |
|
|
- Consider ensemble approaches for critical applications |
|
|
- Monitor for domain shift when applying to new metadata sources |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{comet-resource-type-classifier-2024, |
|
|
title={Generic Resource Type Classifier: LoRA Fine-tuned Qwen2.5-7B for DataCite Metadata Classification}, |
|
|
author={COMET Team}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions about this model, please open an issue in the COMET project repository or contact the COMET metadata team. |