tokenizers-training / src /tokenizers_analysis.py
theformatisvalid's picture
Update src/tokenizers_analysis.py
a55d080 verified
raw
history blame contribute delete
202 Bytes
def calculate_oov(text, vocabulary):
words = text.split(' ')
oov_count = 0
for word in words:
if word not in vocabulary:
oov_count += 1
return oov_count / len(words)