| --- |
| license: apache-2.0 |
| language: |
| - ru |
| base_model: |
| - ai-forever/ruBert-large |
| tags: |
| - difficulty |
| - cefr |
| - regression |
| --- |
| # Model Card for Model ID |
|
|
| Regression model which predicts difficulty score for an input text. Predicted scores can be mapped to CEFR levels. |
|
|
|
|
| ## Model Details |
|
|
| Frozen BERT-large layers with a regressor on top. Trained on a mix of manually annotated datasets (more details on data will follow). |
|
|
|
|
|
|
| ## How to Get Started with the Model |
|
|
| Use the code below to get started with the model. |
|
|
| ``` |
| class CustomModel(BertPreTrainedModel): |
| def __init__(self, config, load_path=None, use_auth_token: str = None,): |
| super().__init__(config) |
| self.bert = BertModel(config) |
| self.pre_classifier = nn.Linear(config.hidden_size, 128) |
| self.dropout = nn.Dropout(0.2) |
| self.classifier = nn.Linear(128, 1) |
| |
| # Apply Xavier initialization |
| nn.init.xavier_uniform_(self.pre_classifier.weight) |
| nn.init.xavier_uniform_(self.classifier.weight) |
| if self.pre_classifier.bias is not None: |
| nn.init.constant_(self.pre_classifier.bias, 0) |
| if self.classifier.bias is not None: |
| nn.init.constant_(self.classifier.bias, 0) |
| |
| |
| def forward( |
| self, |
| input_ids, |
| labels=None, |
| attention_mask=None, |
| token_type_ids=None, |
| position_ids=None, |
| ): |
| outputs = self.bert( |
| input_ids, |
| attention_mask=attention_mask, |
| token_type_ids=token_type_ids, |
| position_ids=position_ids, |
| ) |
| |
| |
| pooled_output = outputs[0][:, 0] |
| pooled_output = self.pre_classifier(pooled_output) |
| pooled_output = nn.ReLU()(pooled_output) |
| pooled_output = self.dropout(pooled_output) |
| logits = self.classifier(pooled_output) |
| |
| if labels is not None: |
| loss_fn = nn.MSELoss() |
| loss = loss_fn(logits.view(-1), labels.view(-1)) |
| return loss, logits |
| else: |
| return None, logits |
| |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
| config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) |
| config.num_labels = 1 |
| |
| model = CustomModel(config) |
| model.load_state_dict(torch.load(f'{model_path}/pytorch_model.bin')) |
| |
| inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
| inputs = {key: value.to(device) for key, value in inputs.items()} |
| |
| with torch.no_grad(): |
| _, logits = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], token_type_ids=inputs["token_type_ids"]) |
| |
| ``` |
|
|
| To map to CEFR, use: |
| ``` |
| reg2cl2 = {'1.0': 'A1', '1.5': 'A12', '2.0': 'A2', '2.5': 'A2', '3.0': 'B1', '3.5': 'B12', '4.0': 'B2', '4.5': 'B2', '5.0': 'C1', '5.5': 'C12', '6.0': 'C2', '0.0': 'A1'} |
| print("Predicted output (logits):", logits.item(), reg2cl2[str(float(round(logits.item())))]) |
| ``` |
|
|
|
|
|
|
| ## Training Details |
|
|
|
|
| #### Training Hyperparameters |
|
|
| + learning_rate: 3e-4 |
| + num_train_epochs: 15.0 |
| + batch_size: 32 |
| + weight_decay: 0.1 |
| + adam_beta1: 0.9 |
| + adam_beta2: 0.99 |
| + adam_epsilon: 1e-8 |
| + max_grad_norm: 1.0 |
| + fp16: True |
|
|
|
|
|
|
| ## Evaluation on test set |
|
|
|
|
|  |
|
|
| ## Citation |
|
|
| Please refer to this repo when using the model. |