Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 中文邮件分类模型 (Chinese Email Classification Model)
|
| 2 |
+
|
| 3 |
+
## 模型概述
|
| 4 |
+
|
| 5 |
+
这是一个基于MobileBERT的中文邮件分类模型,专门用于对邮件内容进行6类分类。该模型经过优化,适合在移动设备和资源受限的环境中使用。
|
| 6 |
+
|
| 7 |
+
## 模型信息
|
| 8 |
+
|
| 9 |
+
- **模型架构**: MobileBERTForSequenceClassification
|
| 10 |
+
- **基础模型**: MobileBERT (针对移动设备优化的BERT变体)
|
| 11 |
+
- **语言**: 中文 (Chinese)
|
| 12 |
+
- **任务**: 文本分类 (Text Classification)
|
| 13 |
+
- **类别数**: 6个邮件类别
|
| 14 |
+
|
| 15 |
+
## 分类标签
|
| 16 |
+
|
| 17 |
+
```
|
| 18 |
+
0: 工作邮件 (Work Email)
|
| 19 |
+
1: 个人邮件 (Personal Email)
|
| 20 |
+
2: 促销邮件 (Promotional Email)
|
| 21 |
+
3: 垃圾邮件 (Spam Email)
|
| 22 |
+
4: 通知邮件 (Notification Email)
|
| 23 |
+
5: 其他邮件 (Other Email)
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
## 模型性能
|
| 27 |
+
|
| 28 |
+
基于3000个训练样本和500个测试样本的评估结果:
|
| 29 |
+
|
| 30 |
+
- **准确率 (Accuracy)**: 98.2%
|
| 31 |
+
- **F1分数 (F1-Score)**: 98.2%
|
| 32 |
+
- **精确率 (Precision)**: 98.2%
|
| 33 |
+
- **召回率 (Recall)**: 98.2%
|
| 34 |
+
|
| 35 |
+
## 模型参数
|
| 36 |
+
|
| 37 |
+
- **参数量**: 24,584,966 (~2450万)
|
| 38 |
+
- **模型大小**: ~94 MB
|
| 39 |
+
- **最大序列长度**: 128 tokens
|
| 40 |
+
- **词汇表大小**: 30,522
|
| 41 |
+
|
| 42 |
+
## 使用方法
|
| 43 |
+
|
| 44 |
+
### 快速开始
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
from transformers import MobileBertForSequenceClassification, MobileBertTokenizer
|
| 48 |
+
import torch
|
| 49 |
+
|
| 50 |
+
# 加载模型和tokenizer
|
| 51 |
+
model = MobileBertForSequenceClassification.from_pretrained('your-username/email-classifier-chinese')
|
| 52 |
+
tokenizer = MobileBertTokenizer.from_pretrained('your-username/email-classifier-chinese')
|
| 53 |
+
|
| 54 |
+
# 预测示例
|
| 55 |
+
def predict_email_category(text):
|
| 56 |
+
inputs = tokenizer(
|
| 57 |
+
text,
|
| 58 |
+
return_tensors='pt',
|
| 59 |
+
truncation=True,
|
| 60 |
+
padding='max_length',
|
| 61 |
+
max_length=128
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
with torch.no_grad():
|
| 65 |
+
outputs = model(**inputs)
|
| 66 |
+
predictions = torch.softmax(outputs.logits, dim=-1)
|
| 67 |
+
predicted_class = torch.argmax(predictions, dim=-1).item()
|
| 68 |
+
|
| 69 |
+
labels = {
|
| 70 |
+
0: "工作邮件",
|
| 71 |
+
1: "个人邮件",
|
| 72 |
+
2: "促销邮件",
|
| 73 |
+
3: "垃圾邮件",
|
| 74 |
+
4: "通知邮件",
|
| 75 |
+
5: "其他邮件"
|
| 76 |
+
}
|
| 77 |
+
|
| 78 |
+
confidence = predictions[0][predicted_class].item()
|
| 79 |
+
return labels[predicted_class], confidence
|
| 80 |
+
|
| 81 |
+
# 使用示例
|
| 82 |
+
email_text = "恭喜您中奖了!点击链接领取奖品。"
|
| 83 |
+
category, confidence = predict_email_category(email_text)
|
| 84 |
+
print(f"邮件类型: {category}, 置信度: {confidence:.3f}")
|
| 85 |
+
# 输出: 邮件类型: 促销邮件, 置信度: 0.920
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Transformers Pipeline
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
from transformers import pipeline
|
| 92 |
+
|
| 93 |
+
classifier = pipeline(
|
| 94 |
+
"text-classification",
|
| 95 |
+
model="your-username/email-classifier-chinese",
|
| 96 |
+
tokenizer="your-username/email-classifier-chinese"
|
| 97 |
+
)
|
| 98 |
+
|
| 99 |
+
result = classifier("您好,请查收今天的工作报告。")
|
| 100 |
+
print(result)
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
## 训练详情
|
| 104 |
+
|
| 105 |
+
- **训练设备**: CPU
|
| 106 |
+
- **训练轮次**: 2 epochs
|
| 107 |
+
- **批次大小**: 4
|
| 108 |
+
- **梯度累积步数**: 4
|
| 109 |
+
- **学习率**: 3e-05
|
| 110 |
+
- **优化器**: AdamW
|
| 111 |
+
- **训练时间**: ~10分钟
|
| 112 |
+
|
| 113 |
+
## 移动端部署
|
| 114 |
+
|
| 115 |
+
该模型特别适合移动端部署:
|
| 116 |
+
|
| 117 |
+
1. **Android集成**: 支持PyTorch Mobile
|
| 118 |
+
2. **iOS集成**: 支持Core ML转换
|
| 119 |
+
3. **边缘计算**: 可在边缘设备上运行
|
| 120 |
+
4. **量化支持**: 支持INT8量化以减少模型大小
|
| 121 |
+
|
| 122 |
+
详细的移动端集成指南请参考模型仓库中的文档。
|
| 123 |
+
|
| 124 |
+
## 使用场景
|
| 125 |
+
|
| 126 |
+
- 邮件客户端自动分类
|
| 127 |
+
- 垃圾邮件过滤
|
| 128 |
+
- 邮件管理系统
|
| 129 |
+
- 企业邮件自动化处理
|
| 130 |
+
- 移动邮件应用
|
| 131 |
+
|
| 132 |
+
## 限制和注意事项
|
| 133 |
+
|
| 134 |
+
1. **语言限制**: 主要针对中文邮件优化
|
| 135 |
+
2. **领域适应**: 可能需要针对特定领域进行微调
|
| 136 |
+
3. **上下文长度**: 最大支持128个token
|
| 137 |
+
4. **数据隐私**: 建议在本地设备上处理敏感邮件内容
|
| 138 |
+
|
| 139 |
+
## 引用
|
| 140 |
+
|
| 141 |
+
如果您使用了这个模型,请考虑引用:
|
| 142 |
+
|
| 143 |
+
```bibtex
|
| 144 |
+
@misc{chinese-email-classifier-2024,
|
| 145 |
+
title={Chinese Email Classification Model Based on MobileBERT},
|
| 146 |
+
author={Your Name},
|
| 147 |
+
year={2024},
|
| 148 |
+
publisher={Hugging Face},
|
| 149 |
+
journal={Hugging Face Model Hub},
|
| 150 |
+
howpublished={\\url{https://huggingface.co/your-username/email-classifier-chinese}}
|
| 151 |
+
}
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
## 许可证
|
| 155 |
+
|
| 156 |
+
本模型基于Apache 2.0许可证发布。
|
| 157 |
+
|
| 158 |
+
## 联系信息
|
| 159 |
+
|
| 160 |
+
如有问题或建议,请通过以下方式联系:
|
| 161 |
+
- GitHub Issues: [项目链接]
|
| 162 |
+
- Email: [您的邮箱]
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
**免责声明**: 本模型仅供研究和非商业用途。在生产环境中使用前,请进行充分的测试和验证。
|