File size: 4,416 Bytes
ed27d70
 
aba3455
 
 
 
 
ed27d70
 
aba3455
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
library_name: transformers
language:
- en
base_model:
- huawei-noah/TinyBERT_General_4L_312D
pipeline_tag: token-classification
---

# Model Description
Keyphrase extraction is a technique in text analysis where you extract the keyphrases from a paragraph.

The **tinyBert-keyword** model is a fine-tuned version of the huawei-noah/TinyBERT_General_4L_312D model, tailored specifically for Keyphrase extraction.

**huawei-noah/TinyBERT_General_4L_312D** is a distilled version of BERT, specifically designed to be smaller and faster for general NLP tasks.

- **Finetuned from:** huawei-noah/TinyBERT_General_4L_312D


## How to use
```python
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
```

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import difflib

tokenizer = AutoTokenizer.from_pretrained("nirusanan/tinyBert-keyword")
model = AutoModelForTokenClassification.from_pretrained("nirusanan/tinyBert-keyword").to(device)
```

```python
text = """
Computer Vision: VLMs are trained on large datasets of images, videos, or other visual data. They use deep neural networks to extract features and represent the visual information.
Natural Language Processing (NLP): VLMs are also trained on large datasets of text, which enables them to understand and generate natural language.
Cross-modal Interaction: The combination of computer vision and NLP allows the VLM to interact and process both visual and textual data in a unified manner.
Types of Vision Language Models:

Visual-Bert: Visual-BERT (Bilinear Pooling for Visual Question Answering) is a popular VLM that uses a combination of visual feature extractors and language models.
LXMERT: LXMERT (Large Scale Instance and Instance-Specific Multimodal Representation Learning) is a VLM designed for visual reasoning and question answering tasks.
VL-BERT: VL-BERT (Visual Large Language Bert) is a VLM that uses a transformer-based architecture to model visual and textual representations.
"""
```

```python
id2label  = model.config.id2label

tokenized = tokenizer(
        text,
        padding=True,
        truncation=True,
        return_offsets_mapping=True,
        return_tensors="pt"
    )

input_ids = tokenized["input_ids"].to(device)
attention_mask = tokenized["attention_mask"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs.logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
token_predictions = [id2label[pred.item()] for pred in predictions[0]]
```

```python
entities = []
current_entity = None

for idx, (token, pred) in enumerate(zip(tokens, token_predictions)):
    if pred.startswith("B-"):
        if current_entity:
            entities.append(current_entity)
        current_entity = {"type": pred[2:], "start": idx, "text": token}
    elif pred.startswith("I-") and current_entity:
            current_entity["text"] += f" {token}"
    elif current_entity:
        entities.append(current_entity)
        current_entity = None

if current_entity:
    entities.append(current_entity)
```

```python
keywords = []
for i in entities:
    keywords.append(i['text'])
```

```python
def clean_keyword(keyword):
    return keyword.replace(" ##", "")

def find_closest_word(keyword, word_positions):
    keyword_cleaned = clean_keyword(keyword)
    best_match = None
    best_score = float('inf')

    for pos, word in word_positions.items():
        score = difflib.SequenceMatcher(None, keyword_cleaned, word).ratio()
        if score > 0.8 and (best_match is None or score > best_score):
            best_match = word
            best_score = score

    return best_match or keyword_cleaned
```

```python
words = text.split()
word_positions = {i: word.strip(".,") for i, word in enumerate(words)}

cleaned_keywords = []
for keyword in keywords:
    closest_word = find_closest_word(keyword, word_positions)
    cleaned_keywords.append({'text': closest_word})
```

```python
unique_keywords = {}
for item in cleaned_keywords:
    text = item['text'].lower()
    if text not in unique_keywords:
        unique_keywords[text] = item

cleaned_keywords_unique = list(unique_keywords.values())

if len(cleaned_keywords_unique) > 5:
  final_keywords = cleaned_keywords_unique[:5]
else:
  final_keywords = cleaned_keywords_unique

text_values = [item['text'] for item in final_keywords]
text_values
```