|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- huawei-noah/TinyBERT_General_4L_312D |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# Model Description |
|
|
Keyphrase extraction is a technique in text analysis where you extract the keyphrases from a paragraph. |
|
|
|
|
|
The **tinyBert-keyword** model is a fine-tuned version of the huawei-noah/TinyBERT_General_4L_312D model, tailored specifically for Keyphrase extraction. |
|
|
|
|
|
**huawei-noah/TinyBERT_General_4L_312D** is a distilled version of BERT, specifically designed to be smaller and faster for general NLP tasks. |
|
|
|
|
|
- **Finetuned from:** huawei-noah/TinyBERT_General_4L_312D |
|
|
|
|
|
|
|
|
## How to use |
|
|
```python |
|
|
import torch |
|
|
device = 'cuda' if torch.cuda.is_available() else 'cpu' |
|
|
``` |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
import difflib |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("nirusanan/tinyBert-keyword") |
|
|
model = AutoModelForTokenClassification.from_pretrained("nirusanan/tinyBert-keyword").to(device) |
|
|
``` |
|
|
|
|
|
```python |
|
|
text = """ |
|
|
Computer Vision: VLMs are trained on large datasets of images, videos, or other visual data. They use deep neural networks to extract features and represent the visual information. |
|
|
Natural Language Processing (NLP): VLMs are also trained on large datasets of text, which enables them to understand and generate natural language. |
|
|
Cross-modal Interaction: The combination of computer vision and NLP allows the VLM to interact and process both visual and textual data in a unified manner. |
|
|
Types of Vision Language Models: |
|
|
|
|
|
Visual-Bert: Visual-BERT (Bilinear Pooling for Visual Question Answering) is a popular VLM that uses a combination of visual feature extractors and language models. |
|
|
LXMERT: LXMERT (Large Scale Instance and Instance-Specific Multimodal Representation Learning) is a VLM designed for visual reasoning and question answering tasks. |
|
|
VL-BERT: VL-BERT (Visual Large Language Bert) is a VLM that uses a transformer-based architecture to model visual and textual representations. |
|
|
""" |
|
|
``` |
|
|
|
|
|
```python |
|
|
id2label = model.config.id2label |
|
|
|
|
|
tokenized = tokenizer( |
|
|
text, |
|
|
padding=True, |
|
|
truncation=True, |
|
|
return_offsets_mapping=True, |
|
|
return_tensors="pt" |
|
|
) |
|
|
|
|
|
input_ids = tokenized["input_ids"].to(device) |
|
|
attention_mask = tokenized["attention_mask"].to(device) |
|
|
outputs = model(input_ids=input_ids, attention_mask=attention_mask) |
|
|
predictions = torch.argmax(outputs.logits, dim=2) |
|
|
|
|
|
tokens = tokenizer.convert_ids_to_tokens(input_ids[0]) |
|
|
token_predictions = [id2label[pred.item()] for pred in predictions[0]] |
|
|
``` |
|
|
|
|
|
```python |
|
|
entities = [] |
|
|
current_entity = None |
|
|
|
|
|
for idx, (token, pred) in enumerate(zip(tokens, token_predictions)): |
|
|
if pred.startswith("B-"): |
|
|
if current_entity: |
|
|
entities.append(current_entity) |
|
|
current_entity = {"type": pred[2:], "start": idx, "text": token} |
|
|
elif pred.startswith("I-") and current_entity: |
|
|
current_entity["text"] += f" {token}" |
|
|
elif current_entity: |
|
|
entities.append(current_entity) |
|
|
current_entity = None |
|
|
|
|
|
if current_entity: |
|
|
entities.append(current_entity) |
|
|
``` |
|
|
|
|
|
```python |
|
|
keywords = [] |
|
|
for i in entities: |
|
|
keywords.append(i['text']) |
|
|
``` |
|
|
|
|
|
```python |
|
|
def clean_keyword(keyword): |
|
|
return keyword.replace(" ##", "") |
|
|
|
|
|
def find_closest_word(keyword, word_positions): |
|
|
keyword_cleaned = clean_keyword(keyword) |
|
|
best_match = None |
|
|
best_score = float('inf') |
|
|
|
|
|
for pos, word in word_positions.items(): |
|
|
score = difflib.SequenceMatcher(None, keyword_cleaned, word).ratio() |
|
|
if score > 0.8 and (best_match is None or score > best_score): |
|
|
best_match = word |
|
|
best_score = score |
|
|
|
|
|
return best_match or keyword_cleaned |
|
|
``` |
|
|
|
|
|
```python |
|
|
words = text.split() |
|
|
word_positions = {i: word.strip(".,") for i, word in enumerate(words)} |
|
|
|
|
|
cleaned_keywords = [] |
|
|
for keyword in keywords: |
|
|
closest_word = find_closest_word(keyword, word_positions) |
|
|
cleaned_keywords.append({'text': closest_word}) |
|
|
``` |
|
|
|
|
|
```python |
|
|
unique_keywords = {} |
|
|
for item in cleaned_keywords: |
|
|
text = item['text'].lower() |
|
|
if text not in unique_keywords: |
|
|
unique_keywords[text] = item |
|
|
|
|
|
cleaned_keywords_unique = list(unique_keywords.values()) |
|
|
|
|
|
if len(cleaned_keywords_unique) > 5: |
|
|
final_keywords = cleaned_keywords_unique[:5] |
|
|
else: |
|
|
final_keywords = cleaned_keywords_unique |
|
|
|
|
|
text_values = [item['text'] for item in final_keywords] |
|
|
text_values |
|
|
``` |