training data licenses

by weizhong1 - opened Jan 23

Jan 23

Hi Team,

Thank you for releasing the pai-bert-tiny-zh model and the EasyNLP toolkit. It has been very helpful for downstream Chinese NLP tasks.

I would like to inquire about the training corpus used for pretraining pai-bert-tiny-zh, specifically regarding:

The source(s) of the training corpus (e.g., public datasets, proprietary datasets, web-scraped corpora, or internal datasets).
Whether the training corpus has identifiable dataset names that can be cited or referenced.
The licensing terms associated with the corpus, especially in relation to commercial usage scenarios.
Whether Alibaba PAI can provide formal documentation or clarification on dataset licensing if required for compliance purposes.

I noticed that the model itself is released under Apache-2.0 license on Hugging Face, but the training data licenses are not explicitly described. For users considering deployment in commercial or regulated environments, clarity on data provenance and licensing is important.

Any additional documentation, links, dataset descriptions, or clarification would be greatly appreciated.

Thank you in advance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment