Fill-Mask
Transformers
PyTorch
Chinese
bert

training data licenses

#3
by weizhong1 - opened

Hi Team,

Thank you for releasing the pai-bert-tiny-zh model and the EasyNLP toolkit. It has been very helpful for downstream Chinese NLP tasks.

I would like to inquire about the training corpus used for pretraining pai-bert-tiny-zh, specifically regarding:

  1. The source(s) of the training corpus (e.g., public datasets, proprietary datasets, web-scraped corpora, or internal datasets).
  2. Whether the training corpus has identifiable dataset names that can be cited or referenced.
  3. The licensing terms associated with the corpus, especially in relation to commercial usage scenarios.
  4. Whether Alibaba PAI can provide formal documentation or clarification on dataset licensing if required for compliance purposes.

I noticed that the model itself is released under Apache-2.0 license on Hugging Face, but the training data licenses are not explicitly described. For users considering deployment in commercial or regulated environments, clarity on data provenance and licensing is important.

Any additional documentation, links, dataset descriptions, or clarification would be greatly appreciated.

Thank you in advance.

Sign up or log in to comment