--- library_name: transformers language: - en base_model: - huawei-noah/TinyBERT_General_4L_312D pipeline_tag: token-classification --- # Model Description Keyphrase extraction is a technique in text analysis where you extract the keyphrases from a paragraph. The **tinyBert-keyword** model is a fine-tuned version of the huawei-noah/TinyBERT_General_4L_312D model, tailored specifically for Keyphrase extraction. **huawei-noah/TinyBERT_General_4L_312D** is a distilled version of BERT, specifically designed to be smaller and faster for general NLP tasks. - **Finetuned from:** huawei-noah/TinyBERT_General_4L_312D ## How to use ```python import torch device = 'cuda' if torch.cuda.is_available() else 'cpu' ``` ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import difflib tokenizer = AutoTokenizer.from_pretrained("nirusanan/tinyBert-keyword") model = AutoModelForTokenClassification.from_pretrained("nirusanan/tinyBert-keyword").to(device) ``` ```python text = """ Computer Vision: VLMs are trained on large datasets of images, videos, or other visual data. They use deep neural networks to extract features and represent the visual information. Natural Language Processing (NLP): VLMs are also trained on large datasets of text, which enables them to understand and generate natural language. Cross-modal Interaction: The combination of computer vision and NLP allows the VLM to interact and process both visual and textual data in a unified manner. Types of Vision Language Models: Visual-Bert: Visual-BERT (Bilinear Pooling for Visual Question Answering) is a popular VLM that uses a combination of visual feature extractors and language models. LXMERT: LXMERT (Large Scale Instance and Instance-Specific Multimodal Representation Learning) is a VLM designed for visual reasoning and question answering tasks. VL-BERT: VL-BERT (Visual Large Language Bert) is a VLM that uses a transformer-based architecture to model visual and textual representations. """ ``` ```python id2label = model.config.id2label tokenized = tokenizer( text, padding=True, truncation=True, return_offsets_mapping=True, return_tensors="pt" ) input_ids = tokenized["input_ids"].to(device) attention_mask = tokenized["attention_mask"].to(device) outputs = model(input_ids=input_ids, attention_mask=attention_mask) predictions = torch.argmax(outputs.logits, dim=2) tokens = tokenizer.convert_ids_to_tokens(input_ids[0]) token_predictions = [id2label[pred.item()] for pred in predictions[0]] ``` ```python entities = [] current_entity = None for idx, (token, pred) in enumerate(zip(tokens, token_predictions)): if pred.startswith("B-"): if current_entity: entities.append(current_entity) current_entity = {"type": pred[2:], "start": idx, "text": token} elif pred.startswith("I-") and current_entity: current_entity["text"] += f" {token}" elif current_entity: entities.append(current_entity) current_entity = None if current_entity: entities.append(current_entity) ``` ```python keywords = [] for i in entities: keywords.append(i['text']) ``` ```python def clean_keyword(keyword): return keyword.replace(" ##", "") def find_closest_word(keyword, word_positions): keyword_cleaned = clean_keyword(keyword) best_match = None best_score = float('inf') for pos, word in word_positions.items(): score = difflib.SequenceMatcher(None, keyword_cleaned, word).ratio() if score > 0.8 and (best_match is None or score > best_score): best_match = word best_score = score return best_match or keyword_cleaned ``` ```python words = text.split() word_positions = {i: word.strip(".,") for i, word in enumerate(words)} cleaned_keywords = [] for keyword in keywords: closest_word = find_closest_word(keyword, word_positions) cleaned_keywords.append({'text': closest_word}) ``` ```python unique_keywords = {} for item in cleaned_keywords: text = item['text'].lower() if text not in unique_keywords: unique_keywords[text] = item cleaned_keywords_unique = list(unique_keywords.values()) if len(cleaned_keywords_unique) > 5: final_keywords = cleaned_keywords_unique[:5] else: final_keywords = cleaned_keywords_unique text_values = [item['text'] for item in final_keywords] text_values ```