t5_base_autotagging
This model is a fine-tuned version of google-t5/t5-base on a dataset for the task of automatic tagging. It has been trained to generate relevant tags for text inputs, useful for applications like categorizing documents, articles, or other textual data into predefined tags or labels.
It achieves the following results on the evaluation set:
- Loss: 0.5004
Model description
The t5_base_autotagging model is based on the T5 (Text-to-Text Transfer Transformer) architecture, a powerful pre-trained model designed for text-to-text tasks. This model has been fine-tuned to predict multiple tags for a given input text, which is particularly useful for automatic tagging in tasks like document classification, content labeling, and content-based recommendations. The fine-tuning process was carried out on a specialized dataset tailored for generating tags in natural language.
The model takes text input and outputs a sequence of tags relevant to the input content. It works by leveraging the encoder-decoder architecture of T5, which allows it to process the input and generate text in the form of tags, making it suitable for various downstream applications such as:
- Document categorization
- Tagging content for metadata
- Topic identification
Intended uses & limitations
Intended uses:
- Automatic Tagging: This model can be used to automatically tag text documents based on their content.
- Content Categorization: It can categorize articles, blog posts, and other types of content into relevant tags or categories.
- Metadata Generation: It can be employed to generate metadata tags for content management systems, blogs, or websites.
Limitations:
- Tag Prediction Accuracy: The model might not always generate the most accurate or relevant tags depending on the diversity and complexity of the input text.
- Generalization: While it performs well on the specific dataset it was trained on, it may need further fine-tuning or additional training on other datasets to generalize across a wide range of domains or languages.
- Dataset Dependency: The quality of the tags predicted is strongly dependent on the dataset used for training. If the training data is not representative of a wide range of content, the model's performance may degrade in some cases.
Training and evaluation data
The model was trained on a dataset specifically created for automatic tagging tasks. The dataset consists of pairs of text and associated tags, where the tags represent categories or keywords relevant to the text. The data was preprocessed to include clean, structured text inputs, and each document or passage was associated with multiple tags that were used during the fine-tuning process.
- Training Data: The training data was sourced from a combination of publicly available tagged datasets and synthetic examples to ensure a diverse set of inputs.
- Evaluation Data: The evaluation dataset was held out from the training set and consists of text documents along with their corresponding tags to evaluate the model's performance in terms of generalization to unseen data.
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: AdamW with betas=(0.9, 0.999), epsilon=1e-08
- lr_scheduler_type: Linear decay
- num_epochs: 30
Training results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 0.6489 | 1.0 | 1250 | 0.5840 |
| 0.5754 | 2.0 | 2500 | 0.5296 |
| 0.5182 | 3.0 | 3750 | 0.5059 |
| 0.4823 | 4.0 | 5000 | 0.4930 |
| 0.4643 | 5.0 | 6250 | 0.4826 |
| 0.4418 | 6.0 | 7500 | 0.4763 |
| 0.4379 | 7.0 | 8750 | 0.4739 |
| 0.4106 | 8.0 | 10000 | 0.4728 |
| 0.4045 | 9.0 | 11250 | 0.4729 |
| 0.3846 | 10.0 | 12500 | 0.4727 |
| 0.3825 | 11.0 | 13750 | 0.4719 |
| 0.3747 | 12.0 | 15000 | 0.4734 |
| 0.3621 | 13.0 | 16250 | 0.4744 |
| 0.3524 | 14.0 | 17500 | 0.4770 |
| 0.3446 | 15.0 | 18750 | 0.4785 |
| 0.3440 | 16.0 | 20000 | 0.4811 |
| 0.3379 | 17.0 | 21250 | 0.4836 |
| 0.3342 | 18.0 | 22500 | 0.4838 |
| 0.3294 | 19.0 | 23750 | 0.4866 |
| 0.3159 | 20.0 | 25000 | 0.4867 |
| 0.3171 | 21.0 | 26250 | 0.4899 |
| 0.3120 | 22.0 | 27500 | 0.4925 |
| 0.3007 | 23.0 | 28750 | 0.4943 |
| 0.3114 | 24.0 | 30000 | 0.4962 |
| 0.2950 | 25.0 | 31250 | 0.4978 |
| 0.2956 | 26.0 | 32500 | 0.4981 |
| 0.2890 | 27.0 | 33750 | 0.4981 |
| 0.2934 | 28.0 | 35000 | 0.4992 |
| 0.2932 | 29.0 | 36250 | 0.5006 |
| 0.2941 | 30.0 | 37500 | 0.5004 |
Framework versions
- Transformers: 4.47.1
- Pytorch: 2.5.1+cu121
- Datasets: 3.2.0
- Tokenizers: 0.21.0
Evaluation Metrics
The model was evaluated on the auto-tagging task using the following metrics:
| Metric | Score |
|---|---|
| ROUGE-1 | 0.6923 |
| ROUGE-2 | 0.3731 |
| ROUGE-L | 0.6226 |
| BLEU | 0.2578 |
Additional Information
This model is intended for use in automatic tagging systems, where it can categorize content into predefined tags for classification purposes. The training data used represents a wide variety of text content with associated tags to improve generalization.
To fine-tune this model for other datasets or tagging tasks, ensure you have a dataset with appropriate text-tag pairs and consider adjusting the training hyperparameters such as learning rate, batch size, and number of epochs based on the complexity of your task and dataset.
- Downloads last month
- 1
Model tree for KeerthiKeswaran/t5_base_ft_autotagging
Base model
google-t5/t5-base