ibm-research
/

CTI-BERT

Generated from Trainer

Model card Files Files and versions

Youngja Park commited on Jan 17, 2025

Commit

4cd0a0e

·

verified ·

1 Parent(s): ea74b51

Update README.md

Files changed (1) hide show

README.md +14 -12

README.md CHANGED Viewed

@@ -9,23 +9,28 @@ model-index:
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# security-bert256-50k
-This model is a fine-tuned version of [](https://huggingface.co/) on the None dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters
@@ -41,9 +46,6 @@ The following hyperparameters were used during training:
 - lr_scheduler_warmup_steps: 10000
 - training_steps: 200000
-### Training results
 ### Framework versions

 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# CTI-BERT
+CTI-BERT is a pre-trained language model for the cybersecurity domain.
+The model was trained on a large corpus of security-related text data, comprising approximately 1.2 billion tokens sourced from
+a diverse range of sources, including security news articles, vulnerability descriptions, books, academic publications, and security-related Wikipedia pages.
+For additional technical details and the model's performance metrics, please refer to [this paper](https://aclanthology.org/2023.emnlp-industry.12.pdf).
+## Model description
+This model has a vocabulary of 50,000 tokens and the sequence length of 256.
+Both the tokenizer and the BERT model were trained from scratch using the [run_mlm script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py)
+with the Masked language modeling (MLM) objective.
+## Intended uses & limitations
+You can use the model for masked language modeling or token embedding generation, but the model is aimed at being fine-tuned on a downstream task, such as
+sequence classification, text classification or question answering.
+The model has shown improved performance for various cybersecurity text classification. However, it is not designed to be used as the main model for general-domain text.
 ### Training hyperparameters
 - lr_scheduler_warmup_steps: 10000
 - training_steps: 200000
 ### Framework versions