---
library_name: transformers
tags:
- text-classification
- metadata-classification
- datacite
- lora
- qwen2.5
- resource-type
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
datasets:
- cometadata/generic-resource-type-training-data
language:
- en
---

# Generic Resource Type Classifier - LoRA Fine-tuned Qwen2.5-7B

A LoRA fine-tuned version of Qwen2.5-7B-Instruct for classifying academic metadata into 32 specific resource types. This model was developed as part of the COMET enrichment and curation workflow to improve generic resource type classification for ~25 million works currently classified only as "Text" in DataCite metadata.

## Model Details

### Model Description

This model classifies DataCite metadata records into granular resource types (e.g., JournalArticle, Preprint, Report, BookChapter, Dissertation) rather than the generic "Text" classification. It uses LoRA (Low-Rank Adaptation) fine-tuning on Qwen2.5-7B-Instruct to efficiently adapt the model for this specialized classification task.

- **Developed by:** COMET Metadata Team
- **Model type:** Text Classification (Resource Type)
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from model:** Qwen/Qwen2.5-7B-Instruct
- **Fine-tuning method:** LoRA (Low-Rank Adaptation)

### Model Sources

- **Repository:** [cometadata/generic-resource-type-lora-qwen2.5-7b](https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b)
- **Training Dataset:** [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data)

## Performance

The model achieves strong overall performance with **96% accuracy**, **98% precision**, **96% recall**, and **97% F1-score** across all 32 categories.

### High-Performance Categories
The following categories show excellent precision and recall, making them suitable for production use:
- **StudyRegistration**: 97% precision, 100% recall
- **Software**: 88% precision, 92% recall  
- **Preprint**: 100% precision, 95% recall
- **PhysicalObject**: ~100% precision, ~100% recall
- **InteractiveResource**: 91% precision, 98% recall
- **Image**: 92% precision, 99% recall
- **Dataset**: 100% precision, 96% recall
- **Collection**: 93% precision, 97% recall
- **Audiovisual**: 89% precision, 97% recall

### Resource Type Categories (32 total)

The model classifies into these categories:
1. Audiovisual, 2. Award, 3. Book, 4. BookChapter, 5. Collection, 6. ComputationalNotebook, 7. ConferencePaper, 8. ConferenceProceeding, 9. DataPaper, 10. Dataset, 11. Dissertation, 12. Event, 13. Image, 14. Instrument, 15. InteractiveResource, 16. Journal, 17. JournalArticle, 18. Model, 19. OutputManagementPlan, 20. PeerReview, 21. PhysicalObject, 22. Preprint, 23. Project, 24. Report, 25. Service, 26. Software, 27. Sound, 28. Standard, 29. StudyRegistration, 30. Text, 31. Workflow, 32. Other

## Uses

### Direct Use

This model is designed to classify DataCite metadata records into specific resource types. Input should be formatted as key-value pairs of metadata fields (excluding the target `resourceTypeGeneral` field).

### Downstream Use

- **Metadata Enhancement**: Automatically assign granular resource types to improve searchability and discoverability
- **Data Curation**: Support large-scale metadata enrichment workflows
- **Repository Management**: Improve content organization in digital repositories

### Out-of-Scope Use

- General text classification beyond academic metadata
- Classification of non-English metadata (model trained primarily on English)
- Real-time applications requiring sub-second response times

## How to Get Started with the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "cometadata/generic-resource-type-lora-qwen2.5-7b")

# Example metadata record
metadata = """
attributes.titles[0].title: Machine Learning Approaches to Climate Modeling
attributes.publisher: Nature Publishing Group  
attributes.creators[0].name: Smith, Jane
attributes.publicationYear: 2024
attributes.types.resourceType: research article
"""

# Format as chat (you'll need the full SYSTEM_PROMPT from the training data)
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},  
    {"role": "user", "content": metadata}
]

# Generate classification
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=5, temperature=0)
prediction = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
```

## Training Details

### Training Data

The model was trained on the [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data) dataset, which contains balanced samples of DataCite metadata records across all 32 resource type categories.

### Training Procedure

#### Training Hyperparameters

- **Base Model:** Qwen/Qwen2.5-7B-Instruct
- **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
- **LoRA Rank (r):** 8
- **LoRA Alpha:** 16  
- **LoRA Dropout:** 0.1
- **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- **Learning Rate:** 1e-4
- **Training Epochs:** 0.27 epochs
- **Tokens Processed:** 350M tokens
- **Max Sequence Length:** 60,000 tokens
- **Batch Size:** Auto-detected
- **LR Scheduler:** Cosine
- **Warmup Steps:** 100
- **Training Regime:** Mixed precision (bfloat16)
- **Packing:** Enabled for efficiency
- **Loss:** Completion-only loss (only classification token)

#### Infrastructure

- **Hardware:** 8x H100 GPUs
- **Training Framework:** TRL (Transformers Reinforcement Learning)
- **Attention Implementation:** Flash Attention 2
- **Memory Optimization:** LoRA + gradient checkpointing

## Evaluation

### Testing Data & Metrics

The model was evaluated on a held-out test set with the same distribution as training data. Evaluation metrics include:

- **Overall Accuracy:** 96%
- **Macro-averaged Precision:** 49% (affected by low-support categories)
- **Weighted Precision:** 98% 
- **Macro-averaged Recall:** 80%
- **Weighted Recall:** 96%
- **Macro-averaged F1:** 54%
- **Weighted F1:** 97%

### Detailed Performance by Category

| Category | Precision | Recall | F1-Score | Support |
|----------|-----------|---------|----------|---------|
| **High-Performance Categories** | | | | |
| StudyRegistration | 0.97 | 1.00 | 0.98 | 1,663 |
| Software | 0.88 | 0.92 | 0.90 | 46,328 |
| Preprint | 1.00 | 0.95 | 0.97 | 20,604 |
| PhysicalObject | 1.00 | 1.00 | 1.00 | 790,408 |
| InteractiveResource | 0.91 | 0.98 | 0.94 | 53,954 |
| Image | 0.92 | 0.99 | 0.95 | 197,524 |
| Dataset | 1.00 | 0.96 | 0.98 | 1,619,320 |
| Collection | 0.93 | 0.97 | 0.95 | 35,039 |
| Audiovisual | 0.89 | 0.97 | 0.92 | 51,977 |
| **Medium-Performance Categories** | | | | |
| Dissertation | 0.80 | 0.88 | 0.84 | 5,203 |
| Event | 0.66 | 0.96 | 0.78 | 3,415 |
| JournalArticle | 0.88 | 0.61 | 0.72 | 79,182 |
| Sound | 0.67 | 0.89 | 0.76 | 1,437 |

### Results Summary

The model shows excellent performance on high-volume categories like Dataset, Image, and Audiovisual, with some challenges on rare categories like Instrument (6 samples) and Standard (131 samples). The weighted metrics better represent real-world performance given the natural class imbalance in academic metadata.

## Limitations

- **Class Imbalance**: Some categories have very few examples, leading to lower macro-averaged scores
- **Language Bias**: Primarily trained on English metadata
- **Domain Specificity**: Optimized for DataCite-style academic metadata
- **Pattern Memorization**: May have memorized some specific patterns (e.g., "PGRFA Material" → PhysicalObject)

## Bias, Risks, and Limitations

### Technical Limitations
- Performance varies significantly across categories due to training data imbalance
- May not generalize well to metadata formats different from DataCite
- Requires careful prompt formatting for optimal performance

### Recommendations
- Use primarily for the high-performance categories identified above
- Validate predictions on categories with lower precision/recall
- Consider ensemble approaches for critical applications
- Monitor for domain shift when applying to new metadata sources

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{comet-resource-type-classifier-2024,
  title={Generic Resource Type Classifier: LoRA Fine-tuned Qwen2.5-7B for DataCite Metadata Classification},
  author={COMET Team},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b}
}
```

## Model Card Contact

For questions about this model, please open an issue in the COMET project repository or contact the COMET metadata team.