--- library_name: transformers tags: - topic - multi-sentiment license: mit datasets: - valurank/Topic_Classification language: - en metrics: - accuracy - f1 - precision - recall base_model: - distilbert/distilbert-base-uncased --- # Model Card for Topic Classification Model A fine-tuned DistilBERT model for multi-class topic classification. This model predicts the most relevant topic label from a predefined set based on input text. It was trained using 🤗 Transformers and PyTorch on a custom dataset derived from academic and news-style corpora. ## Model Details ### Model Description This model was developed by Daniel (@AfroLogicInsect) to classify text into one of several predefined topics. It builds on the `distilbert-base-uncased` architecture and was fine-tuned for multi-class classification using a softmax output layer. - **Developed by:** Daniel 🇳🇬 (@AfroLogicInsect) - **Model type:** DistilBERT-based multi-class sequence classifier - **Language(s):** English - **License:** MIT - **Finetuned from:** distilbert-base-uncased ### Model Sources - **Repository:** [AfroLogicInsect/topic-model-analysis-model](https://huggingface.co/AfroLogicInsect/topic-model-analysis-model) - **Paper:** arXiv:1910.09700 (DistilBERT) - **Demo:** [Coming soon] ## Uses ### Direct Use - Classify academic or news-style text into topics such as AI, finance, sports, climate, etc. - Embed in dashboards or content moderation tools for automatic tagging ### Downstream Use - Can be extended to hierarchical topic classification - Useful for building recommendation engines or content filters ### Out-of-Scope Use - Not suitable for sentiment or emotion classification - May not generalize well to informal or slang-heavy text ## Bias, Risks, and Limitations - Trained on curated corpora — may reflect biases in source material - Topics are predefined and static — emerging topics may be misclassified - Confidence scores are probabilistic, not definitive ### Recommendations - Use `top_k=5` with `return_all_scores=True` to retrieve multiple topic predictions - Consider fine-tuning on domain-specific data for improved accuracy ## How to Get Started ```python from transformers import pipeline classifier = pipeline( "text-classification", model="AfroLogicInsect/topic-model-analysis-model", tokenizer="AfroLogicInsect/topic-model-analysis-model", return_all_scores=True ) text = "New AI breakthrough in natural language processing" results = classifier(text) top_5 = sorted(results[0], key=lambda x: x['score'], reverse=True)[:5] for i, res in enumerate(top_5): print(f"Top {i+1}: {res['label']} ({res['score']:.3f})") ``` ## Training Details ### Dataset - Custom multi-class topic dataset based on arXiv abstracts and news articles - Labels include domains like AI, finance, sports, climate, etc. ### Hyperparameters - Epochs: 3 - Batch size: 16 - Learning rate: 2e-5 - Evaluation every 200 steps - Metric: F1 score ### Trainer Setup Used Hugging Face `Trainer` API with `TrainingArguments` configured for early stopping and best model selection. ## Evaluation Model achieved strong performance across multiple topic categories. Evaluation metrics include: - **Accuracy:** ~90.8% - **F1 Score:** ~0.91 - **Precision:** ~0.89 - **Recall:** ~0.93 ## Environmental Impact - **Hardware:** Google Colab (NVIDIA T4 GPU) - **Training Time:** ~2.5 hours - **Carbon Emitted:** ~0.3 kg CO₂eq (estimated via [ML Impact Calculator](https://mlco2.github.io/impact#compute)) ## Citation ```bibtex @misc{afrologicinsect2025topicmodel, title = {AfroLogicInsect Topic Classification Model}, author = {Akan Daniel}, year = {2025}, howpublished = {\url{https://huggingface.co/AfroLogicInsect/topic-model-analysis-model}}, } ``` ## Contact - Name: Daniel (@AfroLogicInsect) - Location: Lagos, Nigeria - Contact: GitHub / Hugging Face / email (danielamahtoday@gmail.com)