tinyBert-keyword / README.md

Update README.md

aba3455 verified 8 months ago

4.42 kB

	---
	library_name: transformers
	language:
	- en
	base_model:
	- huawei-noah/TinyBERT_General_4L_312D
	pipeline_tag: token-classification
	---

	# Model Description
	Keyphrase extraction is a technique in text analysis where you extract the keyphrases from a paragraph.

	The tinyBert-keyword model is a fine-tuned version of the huawei-noah/TinyBERT_General_4L_312D model, tailored specifically for Keyphrase extraction.

	huawei-noah/TinyBERT_General_4L_312D is a distilled version of BERT, specifically designed to be smaller and faster for general NLP tasks.

	- Finetuned from: huawei-noah/TinyBERT_General_4L_312D


	## How to use
	```python
	import torch
	device = 'cuda' if torch.cuda.is_available() else 'cpu'
	```

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import difflib

	tokenizer = AutoTokenizer.from_pretrained("nirusanan/tinyBert-keyword")
	model = AutoModelForTokenClassification.from_pretrained("nirusanan/tinyBert-keyword").to(device)
	```

	```python
	text = """
	Computer Vision: VLMs are trained on large datasets of images, videos, or other visual data. They use deep neural networks to extract features and represent the visual information.
	Natural Language Processing (NLP): VLMs are also trained on large datasets of text, which enables them to understand and generate natural language.
	Cross-modal Interaction: The combination of computer vision and NLP allows the VLM to interact and process both visual and textual data in a unified manner.
	Types of Vision Language Models:

	Visual-Bert: Visual-BERT (Bilinear Pooling for Visual Question Answering) is a popular VLM that uses a combination of visual feature extractors and language models.
	LXMERT: LXMERT (Large Scale Instance and Instance-Specific Multimodal Representation Learning) is a VLM designed for visual reasoning and question answering tasks.
	VL-BERT: VL-BERT (Visual Large Language Bert) is a VLM that uses a transformer-based architecture to model visual and textual representations.
	"""
	```

	```python
	id2label = model.config.id2label

	tokenized = tokenizer(
	text,
	padding=True,
	truncation=True,
	return_offsets_mapping=True,
	return_tensors="pt"
	)

	input_ids = tokenized["input_ids"].to(device)
	attention_mask = tokenized["attention_mask"].to(device)
	outputs = model(input_ids=input_ids, attention_mask=attention_mask)
	predictions = torch.argmax(outputs.logits, dim=2)

	tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
	token_predictions = [id2label[pred.item()] for pred in predictions[0]]
	```

	```python
	entities = []
	current_entity = None

	for idx, (token, pred) in enumerate(zip(tokens, token_predictions)):
	if pred.startswith("B-"):
	if current_entity:
	entities.append(current_entity)
	current_entity = {"type": pred[2:], "start": idx, "text": token}
	elif pred.startswith("I-") and current_entity:
	current_entity["text"] += f" {token}"
	elif current_entity:
	entities.append(current_entity)
	current_entity = None

	if current_entity:
	entities.append(current_entity)
	```

	```python
	keywords = []
	for i in entities:
	keywords.append(i['text'])
	```

	```python
	def clean_keyword(keyword):
	return keyword.replace(" ##", "")

	def find_closest_word(keyword, word_positions):
	keyword_cleaned = clean_keyword(keyword)
	best_match = None
	best_score = float('inf')

	for pos, word in word_positions.items():
	score = difflib.SequenceMatcher(None, keyword_cleaned, word).ratio()
	if score > 0.8 and (best_match is None or score > best_score):
	best_match = word
	best_score = score

	return best_match or keyword_cleaned
	```

	```python
	words = text.split()
	word_positions = {i: word.strip(".,") for i, word in enumerate(words)}

	cleaned_keywords = []
	for keyword in keywords:
	closest_word = find_closest_word(keyword, word_positions)
	cleaned_keywords.append({'text': closest_word})
	```

	```python
	unique_keywords = {}
	for item in cleaned_keywords:
	text = item['text'].lower()
	if text not in unique_keywords:
	unique_keywords[text] = item

	cleaned_keywords_unique = list(unique_keywords.values())

	if len(cleaned_keywords_unique) > 5:
	final_keywords = cleaned_keywords_unique[:5]
	else:
	final_keywords = cleaned_keywords_unique

	text_values = [item['text'] for item in final_keywords]
	text_values
	```