parthsarin's picture
Update README.md
ae325b4 verified
---
library_name: transformers
tags:
- text-classification
- metadata-classification
- datacite
- lora
- qwen2.5
- resource-type
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
datasets:
- cometadata/generic-resource-type-training-data
language:
- en
---
# Generic Resource Type Classifier - LoRA Fine-tuned Qwen2.5-7B
A LoRA fine-tuned version of Qwen2.5-7B-Instruct for classifying academic metadata into 32 specific resource types. This model was developed as part of the COMET enrichment and curation workflow to improve generic resource type classification for ~25 million works currently classified only as "Text" in DataCite metadata.
## Model Details
### Model Description
This model classifies DataCite metadata records into granular resource types (e.g., JournalArticle, Preprint, Report, BookChapter, Dissertation) rather than the generic "Text" classification. It uses LoRA (Low-Rank Adaptation) fine-tuning on Qwen2.5-7B-Instruct to efficiently adapt the model for this specialized classification task.
- **Developed by:** COMET Metadata Team
- **Model type:** Text Classification (Resource Type)
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from model:** Qwen/Qwen2.5-7B-Instruct
- **Fine-tuning method:** LoRA (Low-Rank Adaptation)
### Model Sources
- **Repository:** [cometadata/generic-resource-type-lora-qwen2.5-7b](https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b)
- **Training Dataset:** [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data)
## Performance
The model achieves strong overall performance with **96% accuracy**, **98% precision**, **96% recall**, and **97% F1-score** across all 32 categories.
### High-Performance Categories
The following categories show excellent precision and recall, making them suitable for production use:
- **StudyRegistration**: 97% precision, 100% recall
- **Software**: 88% precision, 92% recall
- **Preprint**: 100% precision, 95% recall
- **PhysicalObject**: ~100% precision, ~100% recall
- **InteractiveResource**: 91% precision, 98% recall
- **Image**: 92% precision, 99% recall
- **Dataset**: 100% precision, 96% recall
- **Collection**: 93% precision, 97% recall
- **Audiovisual**: 89% precision, 97% recall
### Resource Type Categories (32 total)
The model classifies into these categories:
1. Audiovisual, 2. Award, 3. Book, 4. BookChapter, 5. Collection, 6. ComputationalNotebook, 7. ConferencePaper, 8. ConferenceProceeding, 9. DataPaper, 10. Dataset, 11. Dissertation, 12. Event, 13. Image, 14. Instrument, 15. InteractiveResource, 16. Journal, 17. JournalArticle, 18. Model, 19. OutputManagementPlan, 20. PeerReview, 21. PhysicalObject, 22. Preprint, 23. Project, 24. Report, 25. Service, 26. Software, 27. Sound, 28. Standard, 29. StudyRegistration, 30. Text, 31. Workflow, 32. Other
## Uses
### Direct Use
This model is designed to classify DataCite metadata records into specific resource types. Input should be formatted as key-value pairs of metadata fields (excluding the target `resourceTypeGeneral` field).
### Downstream Use
- **Metadata Enhancement**: Automatically assign granular resource types to improve searchability and discoverability
- **Data Curation**: Support large-scale metadata enrichment workflows
- **Repository Management**: Improve content organization in digital repositories
### Out-of-Scope Use
- General text classification beyond academic metadata
- Classification of non-English metadata (model trained primarily on English)
- Real-time applications requiring sub-second response times
## How to Get Started with the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "cometadata/generic-resource-type-lora-qwen2.5-7b")
# Example metadata record
metadata = """
attributes.titles[0].title: Machine Learning Approaches to Climate Modeling
attributes.publisher: Nature Publishing Group
attributes.creators[0].name: Smith, Jane
attributes.publicationYear: 2024
attributes.types.resourceType: research article
"""
# Format as chat (you'll need the full SYSTEM_PROMPT from the training data)
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": metadata}
]
# Generate classification
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=5, temperature=0)
prediction = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
```
## Training Details
### Training Data
The model was trained on the [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data) dataset, which contains balanced samples of DataCite metadata records across all 32 resource type categories.
### Training Procedure
#### Training Hyperparameters
- **Base Model:** Qwen/Qwen2.5-7B-Instruct
- **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
- **LoRA Rank (r):** 8
- **LoRA Alpha:** 16
- **LoRA Dropout:** 0.1
- **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- **Learning Rate:** 1e-4
- **Training Epochs:** 0.27 epochs
- **Tokens Processed:** 350M tokens
- **Max Sequence Length:** 60,000 tokens
- **Batch Size:** Auto-detected
- **LR Scheduler:** Cosine
- **Warmup Steps:** 100
- **Training Regime:** Mixed precision (bfloat16)
- **Packing:** Enabled for efficiency
- **Loss:** Completion-only loss (only classification token)
#### Infrastructure
- **Hardware:** 8x H100 GPUs
- **Training Framework:** TRL (Transformers Reinforcement Learning)
- **Attention Implementation:** Flash Attention 2
- **Memory Optimization:** LoRA + gradient checkpointing
## Evaluation
### Testing Data & Metrics
The model was evaluated on a held-out test set with the same distribution as training data. Evaluation metrics include:
- **Overall Accuracy:** 96%
- **Macro-averaged Precision:** 49% (affected by low-support categories)
- **Weighted Precision:** 98%
- **Macro-averaged Recall:** 80%
- **Weighted Recall:** 96%
- **Macro-averaged F1:** 54%
- **Weighted F1:** 97%
### Detailed Performance by Category
| Category | Precision | Recall | F1-Score | Support |
|----------|-----------|---------|----------|---------|
| **High-Performance Categories** | | | | |
| StudyRegistration | 0.97 | 1.00 | 0.98 | 1,663 |
| Software | 0.88 | 0.92 | 0.90 | 46,328 |
| Preprint | 1.00 | 0.95 | 0.97 | 20,604 |
| PhysicalObject | 1.00 | 1.00 | 1.00 | 790,408 |
| InteractiveResource | 0.91 | 0.98 | 0.94 | 53,954 |
| Image | 0.92 | 0.99 | 0.95 | 197,524 |
| Dataset | 1.00 | 0.96 | 0.98 | 1,619,320 |
| Collection | 0.93 | 0.97 | 0.95 | 35,039 |
| Audiovisual | 0.89 | 0.97 | 0.92 | 51,977 |
| **Medium-Performance Categories** | | | | |
| Dissertation | 0.80 | 0.88 | 0.84 | 5,203 |
| Event | 0.66 | 0.96 | 0.78 | 3,415 |
| JournalArticle | 0.88 | 0.61 | 0.72 | 79,182 |
| Sound | 0.67 | 0.89 | 0.76 | 1,437 |
### Results Summary
The model shows excellent performance on high-volume categories like Dataset, Image, and Audiovisual, with some challenges on rare categories like Instrument (6 samples) and Standard (131 samples). The weighted metrics better represent real-world performance given the natural class imbalance in academic metadata.
## Limitations
- **Class Imbalance**: Some categories have very few examples, leading to lower macro-averaged scores
- **Language Bias**: Primarily trained on English metadata
- **Domain Specificity**: Optimized for DataCite-style academic metadata
- **Pattern Memorization**: May have memorized some specific patterns (e.g., "PGRFA Material" → PhysicalObject)
## Bias, Risks, and Limitations
### Technical Limitations
- Performance varies significantly across categories due to training data imbalance
- May not generalize well to metadata formats different from DataCite
- Requires careful prompt formatting for optimal performance
### Recommendations
- Use primarily for the high-performance categories identified above
- Validate predictions on categories with lower precision/recall
- Consider ensemble approaches for critical applications
- Monitor for domain shift when applying to new metadata sources
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{comet-resource-type-classifier-2024,
title={Generic Resource Type Classifier: LoRA Fine-tuned Qwen2.5-7B for DataCite Metadata Classification},
author={COMET Team},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b}
}
```
## Model Card Contact
For questions about this model, please open an issue in the COMET project repository or contact the COMET metadata team.