EcomBert-DC-V1 access request

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This repository contains both code and model weights. Please describe your intended use. Access is manually reviewed by the author, and commercial use is not permitted unless prior written authorization has been granted.

EcomBert-DC-V1

EcomBert-DC-V1 is a 50-class text classification model for cross-border e-commerce seller questions. It uses jhu-clsp/mmBERT-small as the backbone, with a custom mean-pooling classifier head inspired by ModernBERT and an auxiliary primary-category head.

This repository is organized both as a Hugging Face model repository and as a lightweight business inference project:

.
|-- infer.py
|-- ecombert_dc/
|   |-- inference.py
|   |-- model.py
|   `-- config.py
`-- models/
    `-- ecombert-dc-v1/
        |-- model.safetensors
        |-- backbone_config.json
        |-- tokenizer.json
        |-- label2id.json
        `-- ...

models/ecombert-dc-v1/model.safetensors already contains the fused mmBERT-small backbone and classification-head weights. Default inference does not require users to download the mmBERT-small weights separately.

Architecture

Backbone: jhu-clsp/mmBERT-small
Pooling: mean pooling
Classification head: ModernBERT-style dense + GELU + LayerNorm + dropout
Dropout: 0.1
Class weighting: none
Max length: 768
Labels: 10 primary categories and 50 secondary categories

This is a custom PyTorch classifier, not native AutoModelForSequenceClassification weights. Use the root-level infer.py script or ecombert_dc.EcomBertDocumentClassifier for inference.

Performance

The test set comes from the fixed split used by this project and contains 1,199 records.

Metric	Value
Primary accuracy	83.74%
Secondary accuracy / Accuracy	72.31%
Conditional accuracy	86.35%
Macro F1	66.36%
Weighted F1	72.07%
Cross-primary error rate	16.26%
Share of errors that cross primary categories	58.73%

Installation

pip install -r requirements.txt

CLI Inference

Run from the repository root. The default model directory is models/ecombert-dc-v1:

python infer.py --text "广告花费突然上涨，关键词点击很多但是没有转化，应该怎么优化？"

You can also specify the model directory explicitly. Both the project root and the model asset directory are supported:

python infer.py --model-dir . --text "新品刚上架，Vine和Coupon应该怎么配合启动？"
python infer.py --model-dir models/ecombert-dc-v1 --text "新品刚上架，Vine和Coupon应该怎么配合启动？"

For long documents, chunk averaging can be enabled:

python infer.py --input samples.jsonl --max-chunks-per-doc 3 --chunk-stride 128 --batch-size 4

Python Inference

from ecombert_dc import EcomBertDocumentClassifier

clf = EcomBertDocumentClassifier("models/ecombert-dc-v1")
print(clf.predict("新品刚上架，Vine和Coupon应该怎么配合启动？", top_k=3))

Files

infer.py: command-line inference entrypoint
ecombert_dc/: custom model and inference pipeline
models/ecombert-dc-v1/model.safetensors: fused backbone and classification-head weights
models/ecombert-dc-v1/backbone_config.json: mmBERT-small backbone structure configuration
models/ecombert-dc-v1/model_config.json: classifier structure configuration
models/ecombert-dc-v1/train_config.json: training and inference defaults
models/ecombert-dc-v1/label2id.json / id2label.json: secondary-category mappings
models/ecombert-dc-v1/category2id.json / id2category.json: primary-category mappings
models/ecombert-dc-v1/tokenizer.json: mmBERT tokenizer
models/ecombert-dc-v1/metrics.json: validation metrics saved with the best checkpoint
models/ecombert-dc-v1/test_metrics.json: metrics on the fixed test set

License

This project is released under a custom non-commercial license. See LICENSE for the full terms.

Unless you have obtained prior written authorization from the author, you may not directly or indirectly use this repository, model, weights, code, outputs, or derivative works for commercial activities or any profit-making activities.

Limitations

This model is designed for business classification over cross-border e-commerce text. Generalization to other domains should be evaluated separately. Some category boundaries naturally overlap, so high-risk workflows should combine the model with human review or confidence thresholds.

Downloads last month: -; Downloads are not tracked for this model. How to track