File size: 3,443 Bytes
1febaec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
license: mit
base_model: answerdotai/ModernBERT-base
tags:
- modernbert
- entity-infilling
- text-summarization
- masked-modeling
- pytorch
library_name: transformers
datasets:
- cnn_dailymail
model-index:
- name: Glazkov/sum-entity-infilling
  results:
  - task:
      type: entity-infilling
      name: Entity Infilling
    dataset:
      name: cnn_dailymail
      type: cnn_dailymail
    metrics:
    - name: Entity Recall
      type: entity_recall
      value: TBD
---

# Glazkov/sum-entity-infilling

This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) trained on the [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) dataset for entity infilling tasks.

## Model Description

The model is designed to reconstruct masked entities in text using summary context. It was trained using a sequence-to-sequence approach where the model learns to predict original entities that have been replaced with `<mask>` tokens in the source text.

## Intended Uses & Limitations

**Intended Uses:**
- Entity reconstruction in summarization
- Text completion and infilling
- Research in masked language modeling
- Educational purposes

**Limitations:**
- Trained primarily on news article data
- May not perform well on highly technical or domain-specific content
- Performance varies with entity length and context

## Training Details

### Training Procedure


### Evaluation Results
The model was evaluated using entity recall metrics on a validation set from the CNN/DailyMail dataset.

**Metrics:**
- Entity Recall: Percentage of correctly reconstructed entities
- Token Accuracy: Token-level prediction accuracy
- Exact Match: Full sequence reconstruction accuracy

## Usage

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
from src.train.inference import EntityInfillingInference

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/Glazkov/sum-entity-infilling")
model = AutoModelForMaskedLM.from_pretrained("your-username/Glazkov/sum-entity-infilling")

# Initialize inference
inference = EntityInfillingInference(
    model_path="your-username/Glazkov/sum-entity-infilling",
    device="cuda"  # or "cpu"
)

# Example inference
summary = "Membership gives the ICC jurisdiction over alleged crimes..."
masked_text = "(<mask> officially became the 123rd member of the International Criminal Court..."

predictions = inference.predict_masked_entities(
    summary=summary,
    masked_text=masked_text
)
```

## Training Configuration

This model was trained using the following configuration:
- Base Model: answerdotai/ModernBERT-base
- Dataset: cnn_dailymail
- Task: Entity Infilling
- Framework: PyTorch with Accelerate
- Training Date: 2025-10-17

For more details about the training process, see the [training configuration](training_config.txt) file.

## Model Architecture

The model uses ModernBERT architecture with:
- 12 transformer layers
- Hidden size: 768
- Vocabulary: Custom with `<mask>` token support
- Maximum sequence length: 512 tokens

## Acknowledgments

- [Hugging Face Transformers](https://github.com/huggingface/transformers) for the model architecture
- [CNN/DailyMail dataset](https://huggingface.co/datasets/cnn_dailymail) for training data
- [Answer.AI](https://huggingface.co/answerdotai) for the ModernBERT base model

## License

This model is licensed under the MIT License.