File size: 4,416 Bytes
ed27d70 aba3455 ed27d70 aba3455 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
library_name: transformers
language:
- en
base_model:
- huawei-noah/TinyBERT_General_4L_312D
pipeline_tag: token-classification
---
# Model Description
Keyphrase extraction is a technique in text analysis where you extract the keyphrases from a paragraph.
The **tinyBert-keyword** model is a fine-tuned version of the huawei-noah/TinyBERT_General_4L_312D model, tailored specifically for Keyphrase extraction.
**huawei-noah/TinyBERT_General_4L_312D** is a distilled version of BERT, specifically designed to be smaller and faster for general NLP tasks.
- **Finetuned from:** huawei-noah/TinyBERT_General_4L_312D
## How to use
```python
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
```
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import difflib
tokenizer = AutoTokenizer.from_pretrained("nirusanan/tinyBert-keyword")
model = AutoModelForTokenClassification.from_pretrained("nirusanan/tinyBert-keyword").to(device)
```
```python
text = """
Computer Vision: VLMs are trained on large datasets of images, videos, or other visual data. They use deep neural networks to extract features and represent the visual information.
Natural Language Processing (NLP): VLMs are also trained on large datasets of text, which enables them to understand and generate natural language.
Cross-modal Interaction: The combination of computer vision and NLP allows the VLM to interact and process both visual and textual data in a unified manner.
Types of Vision Language Models:
Visual-Bert: Visual-BERT (Bilinear Pooling for Visual Question Answering) is a popular VLM that uses a combination of visual feature extractors and language models.
LXMERT: LXMERT (Large Scale Instance and Instance-Specific Multimodal Representation Learning) is a VLM designed for visual reasoning and question answering tasks.
VL-BERT: VL-BERT (Visual Large Language Bert) is a VLM that uses a transformer-based architecture to model visual and textual representations.
"""
```
```python
id2label = model.config.id2label
tokenized = tokenizer(
text,
padding=True,
truncation=True,
return_offsets_mapping=True,
return_tensors="pt"
)
input_ids = tokenized["input_ids"].to(device)
attention_mask = tokenized["attention_mask"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs.logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
token_predictions = [id2label[pred.item()] for pred in predictions[0]]
```
```python
entities = []
current_entity = None
for idx, (token, pred) in enumerate(zip(tokens, token_predictions)):
if pred.startswith("B-"):
if current_entity:
entities.append(current_entity)
current_entity = {"type": pred[2:], "start": idx, "text": token}
elif pred.startswith("I-") and current_entity:
current_entity["text"] += f" {token}"
elif current_entity:
entities.append(current_entity)
current_entity = None
if current_entity:
entities.append(current_entity)
```
```python
keywords = []
for i in entities:
keywords.append(i['text'])
```
```python
def clean_keyword(keyword):
return keyword.replace(" ##", "")
def find_closest_word(keyword, word_positions):
keyword_cleaned = clean_keyword(keyword)
best_match = None
best_score = float('inf')
for pos, word in word_positions.items():
score = difflib.SequenceMatcher(None, keyword_cleaned, word).ratio()
if score > 0.8 and (best_match is None or score > best_score):
best_match = word
best_score = score
return best_match or keyword_cleaned
```
```python
words = text.split()
word_positions = {i: word.strip(".,") for i, word in enumerate(words)}
cleaned_keywords = []
for keyword in keywords:
closest_word = find_closest_word(keyword, word_positions)
cleaned_keywords.append({'text': closest_word})
```
```python
unique_keywords = {}
for item in cleaned_keywords:
text = item['text'].lower()
if text not in unique_keywords:
unique_keywords[text] = item
cleaned_keywords_unique = list(unique_keywords.values())
if len(cleaned_keywords_unique) > 5:
final_keywords = cleaned_keywords_unique[:5]
else:
final_keywords = cleaned_keywords_unique
text_values = [item['text'] for item in final_keywords]
text_values
``` |