--- library_name: transformers tags: - text-classification - metadata-classification - datacite - lora - qwen2.5 - resource-type license: apache-2.0 base_model: Qwen/Qwen2.5-7B-Instruct datasets: - cometadata/generic-resource-type-training-data language: - en --- # Generic Resource Type Classifier - LoRA Fine-tuned Qwen2.5-7B A LoRA fine-tuned version of Qwen2.5-7B-Instruct for classifying academic metadata into 32 specific resource types. This model was developed as part of the COMET enrichment and curation workflow to improve generic resource type classification for ~25 million works currently classified only as "Text" in DataCite metadata. ## Model Details ### Model Description This model classifies DataCite metadata records into granular resource types (e.g., JournalArticle, Preprint, Report, BookChapter, Dissertation) rather than the generic "Text" classification. It uses LoRA (Low-Rank Adaptation) fine-tuning on Qwen2.5-7B-Instruct to efficiently adapt the model for this specialized classification task. - **Developed by:** COMET Metadata Team - **Model type:** Text Classification (Resource Type) - **Language(s):** English - **License:** Apache 2.0 - **Finetuned from model:** Qwen/Qwen2.5-7B-Instruct - **Fine-tuning method:** LoRA (Low-Rank Adaptation) ### Model Sources - **Repository:** [cometadata/generic-resource-type-lora-qwen2.5-7b](https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b) - **Training Dataset:** [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data) ## Performance The model achieves strong overall performance with **96% accuracy**, **98% precision**, **96% recall**, and **97% F1-score** across all 32 categories. ### High-Performance Categories The following categories show excellent precision and recall, making them suitable for production use: - **StudyRegistration**: 97% precision, 100% recall - **Software**: 88% precision, 92% recall - **Preprint**: 100% precision, 95% recall - **PhysicalObject**: ~100% precision, ~100% recall - **InteractiveResource**: 91% precision, 98% recall - **Image**: 92% precision, 99% recall - **Dataset**: 100% precision, 96% recall - **Collection**: 93% precision, 97% recall - **Audiovisual**: 89% precision, 97% recall ### Resource Type Categories (32 total) The model classifies into these categories: 1. Audiovisual, 2. Award, 3. Book, 4. BookChapter, 5. Collection, 6. ComputationalNotebook, 7. ConferencePaper, 8. ConferenceProceeding, 9. DataPaper, 10. Dataset, 11. Dissertation, 12. Event, 13. Image, 14. Instrument, 15. InteractiveResource, 16. Journal, 17. JournalArticle, 18. Model, 19. OutputManagementPlan, 20. PeerReview, 21. PhysicalObject, 22. Preprint, 23. Project, 24. Report, 25. Service, 26. Software, 27. Sound, 28. Standard, 29. StudyRegistration, 30. Text, 31. Workflow, 32. Other ## Uses ### Direct Use This model is designed to classify DataCite metadata records into specific resource types. Input should be formatted as key-value pairs of metadata fields (excluding the target `resourceTypeGeneral` field). ### Downstream Use - **Metadata Enhancement**: Automatically assign granular resource types to improve searchability and discoverability - **Data Curation**: Support large-scale metadata enrichment workflows - **Repository Management**: Improve content organization in digital repositories ### Out-of-Scope Use - General text classification beyond academic metadata - Classification of non-English metadata (model trained primarily on English) - Real-time applications requiring sub-second response times ## How to Get Started with the Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model and tokenizer base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct") # Load LoRA adapter model = PeftModel.from_pretrained(base_model, "cometadata/generic-resource-type-lora-qwen2.5-7b") # Example metadata record metadata = """ attributes.titles[0].title: Machine Learning Approaches to Climate Modeling attributes.publisher: Nature Publishing Group attributes.creators[0].name: Smith, Jane attributes.publicationYear: 2024 attributes.types.resourceType: research article """ # Format as chat (you'll need the full SYSTEM_PROMPT from the training data) messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": metadata} ] # Generate classification input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=5, temperature=0) prediction = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) ``` ## Training Details ### Training Data The model was trained on the [cometadata/generic-resource-type-training-data](https://huggingface.co/datasets/cometadata/generic-resource-type-training-data) dataset, which contains balanced samples of DataCite metadata records across all 32 resource type categories. ### Training Procedure #### Training Hyperparameters - **Base Model:** Qwen/Qwen2.5-7B-Instruct - **Fine-tuning Method:** LoRA (Low-Rank Adaptation) - **LoRA Rank (r):** 8 - **LoRA Alpha:** 16 - **LoRA Dropout:** 0.1 - **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - **Learning Rate:** 1e-4 - **Training Epochs:** 0.27 epochs - **Tokens Processed:** 350M tokens - **Max Sequence Length:** 60,000 tokens - **Batch Size:** Auto-detected - **LR Scheduler:** Cosine - **Warmup Steps:** 100 - **Training Regime:** Mixed precision (bfloat16) - **Packing:** Enabled for efficiency - **Loss:** Completion-only loss (only classification token) #### Infrastructure - **Hardware:** 8x H100 GPUs - **Training Framework:** TRL (Transformers Reinforcement Learning) - **Attention Implementation:** Flash Attention 2 - **Memory Optimization:** LoRA + gradient checkpointing ## Evaluation ### Testing Data & Metrics The model was evaluated on a held-out test set with the same distribution as training data. Evaluation metrics include: - **Overall Accuracy:** 96% - **Macro-averaged Precision:** 49% (affected by low-support categories) - **Weighted Precision:** 98% - **Macro-averaged Recall:** 80% - **Weighted Recall:** 96% - **Macro-averaged F1:** 54% - **Weighted F1:** 97% ### Detailed Performance by Category | Category | Precision | Recall | F1-Score | Support | |----------|-----------|---------|----------|---------| | **High-Performance Categories** | | | | | | StudyRegistration | 0.97 | 1.00 | 0.98 | 1,663 | | Software | 0.88 | 0.92 | 0.90 | 46,328 | | Preprint | 1.00 | 0.95 | 0.97 | 20,604 | | PhysicalObject | 1.00 | 1.00 | 1.00 | 790,408 | | InteractiveResource | 0.91 | 0.98 | 0.94 | 53,954 | | Image | 0.92 | 0.99 | 0.95 | 197,524 | | Dataset | 1.00 | 0.96 | 0.98 | 1,619,320 | | Collection | 0.93 | 0.97 | 0.95 | 35,039 | | Audiovisual | 0.89 | 0.97 | 0.92 | 51,977 | | **Medium-Performance Categories** | | | | | | Dissertation | 0.80 | 0.88 | 0.84 | 5,203 | | Event | 0.66 | 0.96 | 0.78 | 3,415 | | JournalArticle | 0.88 | 0.61 | 0.72 | 79,182 | | Sound | 0.67 | 0.89 | 0.76 | 1,437 | ### Results Summary The model shows excellent performance on high-volume categories like Dataset, Image, and Audiovisual, with some challenges on rare categories like Instrument (6 samples) and Standard (131 samples). The weighted metrics better represent real-world performance given the natural class imbalance in academic metadata. ## Limitations - **Class Imbalance**: Some categories have very few examples, leading to lower macro-averaged scores - **Language Bias**: Primarily trained on English metadata - **Domain Specificity**: Optimized for DataCite-style academic metadata - **Pattern Memorization**: May have memorized some specific patterns (e.g., "PGRFA Material" → PhysicalObject) ## Bias, Risks, and Limitations ### Technical Limitations - Performance varies significantly across categories due to training data imbalance - May not generalize well to metadata formats different from DataCite - Requires careful prompt formatting for optimal performance ### Recommendations - Use primarily for the high-performance categories identified above - Validate predictions on categories with lower precision/recall - Consider ensemble approaches for critical applications - Monitor for domain shift when applying to new metadata sources ## Citation If you use this model in your research, please cite: ```bibtex @misc{comet-resource-type-classifier-2024, title={Generic Resource Type Classifier: LoRA Fine-tuned Qwen2.5-7B for DataCite Metadata Classification}, author={COMET Team}, year={2025}, publisher={HuggingFace}, url={https://huggingface.co/cometadata/generic-resource-type-lora-qwen2.5-7b} } ``` ## Model Card Contact For questions about this model, please open an issue in the COMET project repository or contact the COMET metadata team.