flare-it / README.md

Update README.md

79c7ba3 about 2 years ago

5.65 kB

	---
	license: mit
	datasets:
	- wikipedia
	language:
	- it
	widget:
	- text: "milano è una <mask> dell'italia"
	example_title: "Example 1"
	- text: "giacomo leopardi è stato uno dei più grandi <mask> del classicismo italiano"
	example_title: "Example 2"
	- text: "la pizza è un noto simbolo della <mask> gastronomica italiana"
	example_title: "Example 3"
	---
	--------------------------------------------------------------------------------------------------

	<body>
	<span class="vertical-text" style="background-color:lightgreen;border-radius: 3px;padding: 3px;"> </span>
	<br>
	<span class="vertical-text" style="background-color:orange;border-radius: 3px;padding: 3px;"> </span>
	<br>
	<span class="vertical-text" style="background-color:lightblue;border-radius: 3px;padding: 3px;"> Model: FLARE 🔥</span>
	<br>
	<span class="vertical-text" style="background-color:tomato;border-radius: 3px;padding: 3px;"> Lang: IT</span>
	<br>
	<span class="vertical-text" style="background-color:lightgrey;border-radius: 3px;padding: 3px;"> </span>
	<br>
	<span class="vertical-text" style="background-color:#CF9FFF;border-radius: 3px;padding: 3px;"> </span>
	</body>

	--------------------------------------------------------------------------------------------------

	<h3>Introduction</h3>

	This model is a <b>lightweight</b> and uncased version of <b>MiniLM</b> <b>[1]</b> for the <b>Italian</b> language. Its <b>17M parameters</b> and <b>67MB</b> size make it
	<b>85% lighter</b> than a typical mono-lingual BERT model. It is ideal when memory consumption and execution speed are critical while maintaining high-quality results.

	<h3>AILC CLiC-IT 2023 Proceedings</h3>

	Flare-IT is part of the publication "Blaze-IT: a lightweight BERT model for the Italian language", which has been accepted at AILC CLiC-IT 2023 and published in the conference proceedings.
	<br>
	You can find the proceedings here: https://clic2023.ilc.cnr.it/proceedings/
	<br>
	And the published paper here: https://ceur-ws.org/Vol-3596/paper43.pdf

	<h3>Model description</h3>

	The model builds on <b>mMiniLMv2</b> <b>[1]</b> (from Microsoft: [L6xH384 mMiniLMv2](https://github.com/microsoft/unilm/tree/master/minilm)) as a starting point,
	focusing it on the Italian language while at the same time turning it into an uncased model by modifying the embedding layer
	(as in <b>[2]</b>, but computing document-level frequencies over the <b>Wikipedia</b> dataset and setting a frequency threshold of 0.1%), which brings a considerable
	reduction in the number of parameters.

	To compensate for the deletion of cased tokens, which now forces the model to exploit lowercase representations of words previously capitalized,
	the model has been further pre-trained on the Italian split of the [Wikipedia](https://huggingface.co/datasets/wikipedia) dataset, using the <b>whole word masking [3]</b> technique to make it more robust
	to the new uncased representations.

	The resulting model has 17M parameters, a vocabulary of 14.610 tokens, and a size of 67MB, which makes it <b>85% lighter</b> than a typical mono-lingual BERT model and
	75% lighter than a standard mono-lingual DistilBERT model.


	<h3>Training procedure</h3>

	The model has been trained for <b>masked language modeling</b> on the Italian <b>Wikipedia</b> (~3GB) dataset for 10K steps, using the AdamW optimizer, with a batch size of 512
	(obtained through 128 gradient accumulation steps),
	a sequence length of 512, and a linearly decaying learning rate starting from 5e-5. The training has been performed using <b>dynamic masking</b> between epochs and
	exploiting the <b>whole word masking</b> technique.


	<h3>Performances</h3>

	The following metrics have been computed on the Part of Speech Tagging and Named Entity Recognition tasks, using the <b>UD Italian ISDT</b> and <b>WikiNER</b> datasets, respectively.
	The PoST model has been trained for 5 epochs, and the NER model for 3 epochs, both with a constant learning rate, fixed at 1e-5. For Part of Speech Tagging, the metrics have been computed on the default test set
	provided with the dataset, while for Named Entity Recognition the metrics have been computed with a 5-fold cross-validation

	\| Task \| Recall \| Precision \| F1 \|
	\| ------ \| ------ \| ------ \| ------ \|
	\| Part of Speech Tagging \| 95.64 \| 95.32 \| 95.45 \|
	\| Named Entity Recognition \| 82.27 \| 80.64 \| 81.29 \|

	The metrics have been computed at the token level and macro-averaged over the classes.

	<h3>Demo</h3>

	You can try the model online (fine-tuned on named entity recognition) using this web app: https://huggingface.co/spaces/osiria/flare-it-demo

	<h3>Quick usage</h3>

	```python
	from transformers import AutoTokenizer, XLMRobertaForMaskedLM
	from transformers import pipeline

	tokenizer = AutoTokenizer.from_pretrained("osiria/flare-it")
	model = XLMRobertaForMaskedLM.from_pretrained("osiria/flare-it")
	pipeline_mlm = pipeline(task="fill-mask", model=model, tokenizer=tokenizer)
	```


	<h3>Limitations</h3>

	This lightweight model has been further pre-trained on Wikipedia, so it's particularly suitable as an agile analyzer for large volumes of natively digital text
	from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.). However, it may show limitations when it comes to chaotic text, containing errors and slang expressions
	(like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).

	<h3>References</h3>

	[1] https://arxiv.org/abs/2012.15828

	[2] https://arxiv.org/abs/2010.05609

	[3] https://arxiv.org/abs/1906.08101

	<h3>License</h3>

	The model is released under <b>MIT</b> license