Update README.md

ae325b4 verified 6 months ago

9.13 kB

	---
	library_name: transformers
	tags:
	- text-classification
	- metadata-classification
	- datacite
	- lora
	- qwen2.5
	- resource-type
	license: apache-2.0
	base_model: Qwen/Qwen2.5-7B-Instruct
	datasets:
	- cometadata/generic-resource-type-training-data
	language:
	- en
	---

	# Generic Resource Type Classifier - LoRA Fine-tuned Qwen2.5-7B

	A LoRA fine-tuned version of Qwen2.5-7B-Instruct for classifying academic metadata into 32 specific resource types. This model was developed as part of the COMET enrichment and curation workflow to improve generic resource type classification for ~25 million works currently classified only as "Text" in DataCite metadata.

	## Model Details

	### Model Description

	This model classifies DataCite metadata records into granular resource types (e.g., JournalArticle, Preprint, Report, BookChapter, Dissertation) rather than the generic "Text" classification. It uses LoRA (Low-Rank Adaptation) fine-tuning on Qwen2.5-7B-Instruct to efficiently adapt the model for this specialized classification task.

	- Developed by: COMET Metadata Team
	- Model type: Text Classification (Resource Type)
	- Language(s): English
	- License: Apache 2.0
	- Finetuned from model: Qwen/Qwen2.5-7B-Instruct
	- Fine-tuning method: LoRA (Low-Rank Adaptation)

	### Model Sources

	- Repository: [cometadata/generic-resource-type-lora-qwen2.5-7b](https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b)
	- Training Dataset: [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data)

	## Performance

	The model achieves strong overall performance with 96% accuracy, 98% precision, 96% recall, and 97% F1-score across all 32 categories.

	### High-Performance Categories
	The following categories show excellent precision and recall, making them suitable for production use:
	- StudyRegistration: 97% precision, 100% recall
	- Software: 88% precision, 92% recall
	- Preprint: 100% precision, 95% recall
	- PhysicalObject: ~100% precision, ~100% recall
	- InteractiveResource: 91% precision, 98% recall
	- Image: 92% precision, 99% recall
	- Dataset: 100% precision, 96% recall
	- Collection: 93% precision, 97% recall
	- Audiovisual: 89% precision, 97% recall

	### Resource Type Categories (32 total)

	The model classifies into these categories:
	1. Audiovisual, 2. Award, 3. Book, 4. BookChapter, 5. Collection, 6. ComputationalNotebook, 7. ConferencePaper, 8. ConferenceProceeding, 9. DataPaper, 10. Dataset, 11. Dissertation, 12. Event, 13. Image, 14. Instrument, 15. InteractiveResource, 16. Journal, 17. JournalArticle, 18. Model, 19. OutputManagementPlan, 20. PeerReview, 21. PhysicalObject, 22. Preprint, 23. Project, 24. Report, 25. Service, 26. Software, 27. Sound, 28. Standard, 29. StudyRegistration, 30. Text, 31. Workflow, 32. Other

	## Uses

	### Direct Use

	This model is designed to classify DataCite metadata records into specific resource types. Input should be formatted as key-value pairs of metadata fields (excluding the target `resourceTypeGeneral` field).

	### Downstream Use

	- Metadata Enhancement: Automatically assign granular resource types to improve searchability and discoverability
	- Data Curation: Support large-scale metadata enrichment workflows
	- Repository Management: Improve content organization in digital repositories

	### Out-of-Scope Use

	- General text classification beyond academic metadata
	- Classification of non-English metadata (model trained primarily on English)
	- Real-time applications requiring sub-second response times

	## How to Get Started with the Model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load base model and tokenizer
	base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

	# Load LoRA adapter
	model = PeftModel.from_pretrained(base_model, "cometadata/generic-resource-type-lora-qwen2.5-7b")

	# Example metadata record
	metadata = """
	attributes.titles[0].title: Machine Learning Approaches to Climate Modeling
	attributes.publisher: Nature Publishing Group
	attributes.creators[0].name: Smith, Jane
	attributes.publicationYear: 2024
	attributes.types.resourceType: research article
	"""

	# Format as chat (you'll need the full SYSTEM_PROMPT from the training data)
	messages = [
	{"role": "system", "content": SYSTEM_PROMPT},
	{"role": "user", "content": metadata}
	]

	# Generate classification
	input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=5, temperature=0)
	prediction = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
	```

	## Training Details

	### Training Data

	The model was trained on the [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data) dataset, which contains balanced samples of DataCite metadata records across all 32 resource type categories.

	### Training Procedure

	#### Training Hyperparameters

	- Base Model: Qwen/Qwen2.5-7B-Instruct
	- Fine-tuning Method: LoRA (Low-Rank Adaptation)
	- LoRA Rank (r): 8
	- LoRA Alpha: 16
	- LoRA Dropout: 0.1
	- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
	- Learning Rate: 1e-4
	- Training Epochs: 0.27 epochs
	- Tokens Processed: 350M tokens
	- Max Sequence Length: 60,000 tokens
	- Batch Size: Auto-detected
	- LR Scheduler: Cosine
	- Warmup Steps: 100
	- Training Regime: Mixed precision (bfloat16)
	- Packing: Enabled for efficiency
	- Loss: Completion-only loss (only classification token)

	#### Infrastructure

	- Hardware: 8x H100 GPUs
	- Training Framework: TRL (Transformers Reinforcement Learning)
	- Attention Implementation: Flash Attention 2
	- Memory Optimization: LoRA + gradient checkpointing

	## Evaluation

	### Testing Data & Metrics

	The model was evaluated on a held-out test set with the same distribution as training data. Evaluation metrics include:

	- Overall Accuracy: 96%
	- Macro-averaged Precision: 49% (affected by low-support categories)
	- Weighted Precision: 98%
	- Macro-averaged Recall: 80%
	- Weighted Recall: 96%
	- Macro-averaged F1: 54%
	- Weighted F1: 97%

	### Detailed Performance by Category

	\| Category \| Precision \| Recall \| F1-Score \| Support \|
	\|----------\|-----------\|---------\|----------\|---------\|
	\| High-Performance Categories \| \| \| \| \|
	\| StudyRegistration \| 0.97 \| 1.00 \| 0.98 \| 1,663 \|
	\| Software \| 0.88 \| 0.92 \| 0.90 \| 46,328 \|
	\| Preprint \| 1.00 \| 0.95 \| 0.97 \| 20,604 \|
	\| PhysicalObject \| 1.00 \| 1.00 \| 1.00 \| 790,408 \|
	\| InteractiveResource \| 0.91 \| 0.98 \| 0.94 \| 53,954 \|
	\| Image \| 0.92 \| 0.99 \| 0.95 \| 197,524 \|
	\| Dataset \| 1.00 \| 0.96 \| 0.98 \| 1,619,320 \|
	\| Collection \| 0.93 \| 0.97 \| 0.95 \| 35,039 \|
	\| Audiovisual \| 0.89 \| 0.97 \| 0.92 \| 51,977 \|
	\| Medium-Performance Categories \| \| \| \| \|
	\| Dissertation \| 0.80 \| 0.88 \| 0.84 \| 5,203 \|
	\| Event \| 0.66 \| 0.96 \| 0.78 \| 3,415 \|
	\| JournalArticle \| 0.88 \| 0.61 \| 0.72 \| 79,182 \|
	\| Sound \| 0.67 \| 0.89 \| 0.76 \| 1,437 \|

	### Results Summary

	The model shows excellent performance on high-volume categories like Dataset, Image, and Audiovisual, with some challenges on rare categories like Instrument (6 samples) and Standard (131 samples). The weighted metrics better represent real-world performance given the natural class imbalance in academic metadata.

	## Limitations

	- Class Imbalance: Some categories have very few examples, leading to lower macro-averaged scores
	- Language Bias: Primarily trained on English metadata
	- Domain Specificity: Optimized for DataCite-style academic metadata
	- Pattern Memorization: May have memorized some specific patterns (e.g., "PGRFA Material" → PhysicalObject)

	## Bias, Risks, and Limitations

	### Technical Limitations
	- Performance varies significantly across categories due to training data imbalance
	- May not generalize well to metadata formats different from DataCite
	- Requires careful prompt formatting for optimal performance

	### Recommendations
	- Use primarily for the high-performance categories identified above
	- Validate predictions on categories with lower precision/recall
	- Consider ensemble approaches for critical applications
	- Monitor for domain shift when applying to new metadata sources

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{comet-resource-type-classifier-2024,
	title={Generic Resource Type Classifier: LoRA Fine-tuned Qwen2.5-7B for DataCite Metadata Classification},
	author={COMET Team},
	year={2025},
	publisher={HuggingFace},
	url={https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b}
	}
	```

	## Model Card Contact

	For questions about this model, please open an issue in the COMET project repository or contact the COMET metadata team.