Upload README.md with huggingface_hub

39853e4 verified 6 months ago

4.36 kB

	# 中文邮件分类模型 (Chinese Email Classification Model)

	## 模型概述

	这是一个基于MobileBERT的中文邮件分类模型，专门用于对邮件内容进行6类分类。该模型经过优化，适合在移动设备和资源受限的环境中使用。

	## 模型信息

	- 模型架构: MobileBERTForSequenceClassification
	- 基础模型: MobileBERT (针对移动设备优化的BERT变体)
	- 语言: 中文 (Chinese)
	- 任务: 文本分类 (Text Classification)
	- 类别数: 6个邮件类别

	## 分类标签

	```
	0: 工作邮件 (Work Email)
	1: 个人邮件 (Personal Email)
	2: 促销邮件 (Promotional Email)
	3: 垃圾邮件 (Spam Email)
	4: 通知邮件 (Notification Email)
	5: 其他邮件 (Other Email)
	```

	## 模型性能

	基于3000个训练样本和500个测试样本的评估结果：

	- 准确率 (Accuracy): 98.2%
	- F1分数 (F1-Score): 98.2%
	- 精确率 (Precision): 98.2%
	- 召回率 (Recall): 98.2%

	## 模型参数

	- 参数量: 24,584,966 (~2450万)
	- 模型大小: ~94 MB
	- 最大序列长度: 128 tokens
	- 词汇表大小: 30,522

	## 使用方法

	### 快速开始

	```python
	from transformers import MobileBertForSequenceClassification, MobileBertTokenizer
	import torch

	# 加载模型和tokenizer
	model = MobileBertForSequenceClassification.from_pretrained('your-username/email-classifier-chinese')
	tokenizer = MobileBertTokenizer.from_pretrained('your-username/email-classifier-chinese')

	# 预测示例
	def predict_email_category(text):
	inputs = tokenizer(
	text,
	return_tensors='pt',
	truncation=True,
	padding='max_length',
	max_length=128
	)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(predictions, dim=-1).item()

	labels = {
	0: "工作邮件",
	1: "个人邮件",
	2: "促销邮件",
	3: "垃圾邮件",
	4: "通知邮件",
	5: "其他邮件"
	}

	confidence = predictions[0][predicted_class].item()
	return labels[predicted_class], confidence

	# 使用示例
	email_text = "恭喜您中奖了！点击链接领取奖品。"
	category, confidence = predict_email_category(email_text)
	print(f"邮件类型: {category}, 置信度: {confidence:.3f}")
	# 输出: 邮件类型: 促销邮件, 置信度: 0.920
	```

	### Transformers Pipeline

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="your-username/email-classifier-chinese",
	tokenizer="your-username/email-classifier-chinese"
	)

	result = classifier("您好，请查收今天的工作报告。")
	print(result)
	```

	## 训练详情

	- 训练设备: CPU
	- 训练轮次: 2 epochs
	- 批次大小: 4
	- 梯度累积步数: 4
	- 学习率: 3e-05
	- 优化器: AdamW
	- 训练时间: ~10分钟

	## 移动端部署

	该模型特别适合移动端部署：

	1. Android集成: 支持PyTorch Mobile
	2. iOS集成: 支持Core ML转换
	3. 边缘计算: 可在边缘设备上运行
	4. 量化支持: 支持INT8量化以减少模型大小

	详细的移动端集成指南请参考模型仓库中的文档。

	## 使用场景

	- 邮件客户端自动分类
	- 垃圾邮件过滤
	- 邮件管理系统
	- 企业邮件自动化处理
	- 移动邮件应用

	## 限制和注意事项

	1. 语言限制: 主要针对中文邮件优化
	2. 领域适应: 可能需要针对特定领域进行微调
	3. 上下文长度: 最大支持128个token
	4. 数据隐私: 建议在本地设备上处理敏感邮件内容

	## 引用

	如果您使用了这个模型，请考虑引用：

	```bibtex
	@misc{chinese-email-classifier-2024,
	title={Chinese Email Classification Model Based on MobileBERT},
	author={Your Name},
	year={2024},
	publisher={Hugging Face},
	journal={Hugging Face Model Hub},
	howpublished={\\url{https://huggingface.co/your-username/email-classifier-chinese}}
	}
	```

	## 许可证

	本模型基于Apache 2.0许可证发布。

	## 联系信息

	如有问题或建议，请通过以下方式联系：
	- GitHub Issues: [项目链接]
	- Email: [您的邮箱]

	---

	免责声明: 本模型仅供研究和非商业用途。在生产环境中使用前，请进行充分的测试和验证。