|
|
--- |
|
|
language: |
|
|
- is |
|
|
tags: |
|
|
- nlp |
|
|
- pos |
|
|
library_name: transformers |
|
|
paper: https://arxiv.org/abs/2201.05601 |
|
|
--- |
|
|
## Prediction Methods |
|
|
|
|
|
The model provides several prediction methods: |
|
|
|
|
|
- **`prepare_inputs(words, tokenizer, truncate=False)`**: Prepares inputs for a single list of words, returning tensors without batch dimension. |
|
|
- **`predict_labels(input_ids, attention_mask, word_mask)`**: Low-level prediction from prepared tensors with batch dimension. |
|
|
- **`predict_labels_from_text(sentences, tokenizer, truncate=False)`**: Returns structured predictions as (category, [attributes]) tuples from word lists. These can be slightly more readable and more suitable for some applications. |
|
|
- **`predict_ifd_labels_from_text(sentences, tokenizer, truncate=False)`**: Returns predictions in IFD (Icelandic Frequency Dictionary) format from word lists. Use this for evaluation against MIM-GOLD datasets or when you need compatibility with traditional Icelandic POS taggers. |
|
|
|
|
|
All methods accept pre-tokenized word lists rather than raw sentences for better control over tokenization. |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
model = AutoModel.from_pretrained("mideind/IceBERT-PoS", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("mideind/IceBERT-PoS") |
|
|
|
|
|
# Example sentence |
|
|
sentence = "Ég veit að þú kemur í kvöld til mín ." |
|
|
sentence_words = sentence.split() |
|
|
|
|
|
# Get predictions in (category, [attributes]) format |
|
|
result = model.predict_labels_from_text([sentence_words], tokenizer) |
|
|
expected = [ |
|
|
[ |
|
|
("fp", ["1", "sing", "nom"]), |
|
|
("sf", ["sing", "act", "1", "pres"]), |
|
|
("c", []), |
|
|
("fp", ["2", "sing", "nom"]), |
|
|
("sf", ["sing", "act", "2", "pres"]), |
|
|
("af", []), |
|
|
("n", ["neut", "sing", "acc"]), |
|
|
("af", []), |
|
|
("fp", ["1", "sing", "gen"]), |
|
|
("pl", []), |
|
|
] |
|
|
] |
|
|
assert result == expected, f"Expected {expected}, but got {result}" |
|
|
print("Test passed successfully!") |
|
|
|
|
|
# Get predictions in IFD format (for MIM-GOLD evaluation) |
|
|
ifd_result = model.predict_ifd_labels_from_text([sentence_words], tokenizer) |
|
|
ifd_expected = [ |
|
|
["fp1en", "sfg1en", "c", "fp2en", "sfg2en", "af", "nheo", "af", "fp1ee", "pl"] |
|
|
] |
|
|
assert ifd_result == ifd_expected, f"Expected {ifd_expected}, but got {ifd_result}" |
|
|
print("IFD conversion test passed successfully!") |
|
|
|
|
|
# Alternative: use prepare_inputs for single sentence prediction |
|
|
input_ids, attention_mask, word_mask = model.prepare_inputs(sentence_words, tokenizer) |
|
|
single_result = model.predict_labels(input_ids.unsqeeze(0), attention_mask.unsqeeze(0), word_mask.unsqeeze(0)) |
|
|
assert single_result == expected, f"Expected {expected}, but got {single_result}" |
|
|
print("Single sentence prediction test passed successfully!") |
|
|
``` |
|
|
|
|
|
## Handling Long Sequences with Truncation |
|
|
|
|
|
By default, `truncate=False` to avoid hard-to-debug issues where input is silently truncated. However, very long sequences will cause errors: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
model = AutoModel.from_pretrained("mideind/IceBERT-PoS", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("mideind/IceBERT-PoS") |
|
|
|
|
|
# Create a very long sentence that exceeds model limits |
|
|
words = ["Þetta", "er", "mjög", "löng", "setning"] * 200 # Very long sentence |
|
|
print(f"Input length: {len(words)} words") |
|
|
|
|
|
# This will crash due to sequence length exceeding model limits |
|
|
try: |
|
|
result = model.predict_labels_from_text([words], tokenizer, truncate=False) |
|
|
print("This shouldn't print - sequence was too long!") |
|
|
except Exception as e: |
|
|
print(f"Error as expected: {type(e).__name__}") |
|
|
|
|
|
# Use truncate=True for long sequences |
|
|
result_truncated = model.predict_labels_from_text([words], tokenizer, truncate=True) |
|
|
print(f"Truncated result length: {len(result_truncated[0])} tokens") |
|
|
print("Warning: Output length differs from input length due to truncation!") |
|
|
|
|
|
# When using truncation, you must handle the length mismatch carefully |
|
|
# The output will have fewer predictions than input words |
|
|
assert len(result_truncated[0]) < len(words), "Truncation should reduce length" |
|
|
print("Truncation example completed successfully!") |
|
|
``` |
|
|
|