Upload chinese-crypto-importance v1.1

489b83d verified 3 days ago

8.73 kB

	---
	language:
	- zh
	license: apache-2.0
	tags:
	- finance
	- cryptocurrency
	- chinese
	- news-scoring
	- text-classification
	- text-regression
	pipeline_tag: text-classification
	library_name: transformers
	base_model: LocalOptimum/chinese-crypto-sentiment
	metrics:
	- mae
	- accuracy
	- pearsonr
	model-index:
	- name: chinese-crypto-importance (v1.1)
	results:
	- task:
	type: text-classification
	name: News Importance Binning
	metrics:
	- type: mae
	value: 6.87
	name: MAE
	- type: accuracy
	value: 61.8%
	name: Bin Accuracy
	- type: pearsonr
	value: 0.532
	name: Pearson r
	---

	# Chinese Crypto News Importance Scoring Model \| 中文加密货币新闻重要性评分模型 (v1.1)

	## 模型描述 \| Model Description

	本模型基于 [LocalOptimum/chinese-crypto-sentiment](https://huggingface.co/LocalOptimum/chinese-crypto-sentiment) 进行 LoRA 微调，专门用于评估中文加密货币新闻的“市场重要性”，而不是传统的情感极性。

	模型采用双头结构，同时输出：

	- `importance_score`：0-100 连续分数，用于衡量新闻对市场的潜在影响
	- `importance_bin`：4 档区间分类，分别为 `noise` / `low` / `medium` / `high`

	它要回答的问题是：这条新闻是否值得交易员、研究员或自动化新闻流优先关注，而不只是判断文本是利好还是利空。

	This model is LoRA fine-tuned from [LocalOptimum/chinese-crypto-sentiment](https://huggingface.co/LocalOptimum/chinese-crypto-sentiment) for Chinese cryptocurrency news importance scoring rather than plain sentiment classification. It outputs both a continuous score and a 4-way importance bin for ranking and filtering workflows.

	## 训练数据 \| Training Data

	- 数据量 \| Size: 20286 条中文加密货币新闻样本 \| 20286 Chinese crypto news samples
	- 数据来源 \| Source: EventAlpha / WatchTower 采集的 19729 条新闻 + 557 条推文 \| 19729 news articles + 557 tweets collected via EventAlpha / WatchTower
	- 标注方式 \| Labeling: 自动四维评分管线 + 规则修正 \| 4-axis automatic scoring pipeline with rule-based cleanup
	- 划分方式 \| Split: 随机划分，训练集 17243 / 验证集 3043 \| Random split with 17243 train and 3043 validation samples
	- 平均分数 \| Average Score: 41.7

	### 标注维度 \| Scoring Axes

	\| Axis \| Range \| Description \|
	\|---\|---:\|---\|
	\| Market Reaction \| 0-40 \| Post-news price move, volume expansion, and volatility reaction \|
	\| Novelty \| 0-30 \| Whether the item is first-hand, repeated, or part of a digest \|
	\| Content Quality \| 0-20 \| Information density, numeric detail, token relevance, and noise penalties \|
	\| Source Authority \| 0-10 \| Credibility of the outlet, platform, and whether it is official \|

	### 数据分布 \| Label Distribution

	\| Bin \| Score Range \| Count \| Share \| 含义 / Interpretation \|
	\|---\|---:\|---:\|---:\|---\|
	\| `noise` \| 0-25 \| 1626 \| 8.0% \| Low-signal, duplicate, digest, or weakly relevant content \|
	\| `low` \| 25-50 \| 14773 \| 72.8% \| Routine updates that rarely move the market on their own \|
	\| `medium` \| 50-75 \| 3840 \| 18.9% \| Tradeable developments with meaningful but limited impact \|
	\| `high` \| 75-100 \| 47 \| 0.2% \| Major events that may materially change price or risk appetite \|

	## 性能指标 \| Performance Metrics

	当前公开版本在验证集上的表现如下：

	\| 指标 Metric \| 数值 Value \|
	\|---\|---:\|
	\| MAE \| 6.87 \|
	\| Bin Accuracy \| 61.8% \|
	\| Pearson r \| 0.532 \|
	\| Best Epoch \| 4 \|

	## 分数解释 \| Score Interpretation

	\| Bin \| Score Range \| 典型含义 \|
	\|---\|---:\|---\|
	\| `noise` \| 0-25 \| 摘要类、弱相关信息、重复快讯、低信号内容 \|
	\| `low` \| 25-50 \| 常规更新、普通运营动作、主观评论、有限催化 \|
	\| `medium` \| 50-75 \| 有交易意义的重要进展，但未必足以改变大趋势 \|
	\| `high` \| 75-100 \| 黑客攻击、ETF 获批、重大监管变化、系统性风险事件 \|

	## 使用方法 \| Usage

	### 方式一：加载完整双头模型（推荐） \| Option 1: load the full dual-head model

	这种方式可以同时得到 `importance_score` 和 `importance_bin`。

	```python
	import __main__
	import sys
	import torch
	from huggingface_hub import snapshot_download
	from transformers import AutoTokenizer

	repo_id = "LocalOptimum/chinese-crypto-importance"
	local_dir = snapshot_download(repo_id)
	sys.path.insert(0, local_dir)

	from model import NewsImportanceModel

	__main__.NewsImportanceModel = NewsImportanceModel

	tokenizer = AutoTokenizer.from_pretrained(local_dir)
	model = torch.load(f"{local_dir}/model.pt", map_location="cpu", weights_only=False)
	model.eval()

	text = "美国现货以太坊 ETF 获批"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

	with torch.no_grad():
	logits, score = model(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	token_type_ids=inputs.get("token_type_ids"),
	)
	probs = torch.softmax(logits, dim=-1)[0]
	labels = ["noise", "low", "medium", "high"]
	importance_bin = labels[probs.argmax().item()]
	importance_score = score.item() * 100

	print(importance_bin)
	print(round(importance_score, 1))
	```

	### 方式二：仅使用 HuggingFace 分类头 \| Option 2: use the classification head only

	这种方式兼容 `pipeline("text-classification")`，但只能直接输出 4 档分类，不包含连续分数。

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

	repo_id = "LocalOptimum/chinese-crypto-importance"
	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	model = AutoModelForSequenceClassification.from_pretrained(repo_id)

	pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
	print(pipe("比特币突破关键阻力位并创下阶段新高"))
	```

	## 训练配置 \| Training Configuration

	- 基础模型 \| Base Model: `LocalOptimum/chinese-crypto-sentiment`
	- 模型结构 \| Architecture: BERT backbone + classification head + regression head
	- 最大长度 \| Max Length: 256
	- 训练轮数 \| Epochs: 10（Early Stopping patience=3，最佳 epoch=4）
	- 批次大小 \| Batch Size: 16
	- 学习率 \| Learning Rate: 2e-5
	- LoRA: `r=16`, `alpha=32`, `dropout=0.05`
	- 损失函数 \| Loss: `0.6 * cross_entropy + 0.4 * mse`
	- 混合精度 \| Mixed Precision: FP16

	## 适用场景 \| Use Cases

	- 加密货币新闻优先级排序
	- 实时快讯过滤与告警降噪
	- 研究员 / 交易员新闻流预筛选
	- 回测与研究中的事件权重特征构建
	- 市场重大事件回溯分析

	## 核心标注原则 \| Annotation Principles

	- 重要性不等于情绪：利好和利空都可能是高重要性
	- 优先看市场反应，再结合新颖度、内容质量和来源可信度
	- 重复快讯、摘要汇总、弱相关宏观噪声会被系统性降分
	- 官方公告、重大安全事件、ETF / 监管突破通常更高分
	- 主观观点和常规运营更新通常落在 `low` 或 `noise`

	## 局限性 \| Limitations

	- 数据分布明显偏向 `low`，当前版本对高重要性事件仍偏保守
	- `high` 样本较少，模型对极端高分事件的区分能力仍有提升空间
	- 主要适用于中文加密货币新闻，跨领域泛化能力有限
	- HuggingFace 原生 `pipeline` 只暴露分类头；连续分数需要加载 `model.pt`
	- 标签来自自动评分管线与规则修正，不等同于大规模人工金融标注

	## 许可证 \| License

	Apache-2.0

	## 引用 \| Citation

	如果你在研究或产品中使用本模型，可以引用：

	```bibtex
	@misc{onefly_crypto_importance_2026,
	title={Chinese Crypto News Importance Scoring Model},
	author={Onefly},
	year={2026},
	howpublished={\url{https://huggingface.co/LocalOptimum/chinese-crypto-importance}},
	note={LoRA fine-tuned from LocalOptimum/chinese-crypto-sentiment, 20286 samples, MAE=6.87, BinAcc=61.8%}
	}
	```

	## 基础模型 \| Base Model

	本模型基于以下模型继续训练：

	- [LocalOptimum/chinese-crypto-sentiment](https://huggingface.co/LocalOptimum/chinese-crypto-sentiment)

	## 更新日志 \| Changelog

	### 当前公开版本 \| Current Public Version

	- 首个公开的重要性评分模型版本
	- 支持双头输出：连续重要性分数 + 4 档重要性分类
	- 基于 20286 条中文加密货币新闻样本完成训练
	- 当前验证指标：MAE=6.87，Bin Accuracy=61.8%，Pearson r=0.532

	如有问题或建议，欢迎提 issue 或 PR。