Fill-Mask
Transformers
Safetensors
English
roberta
law
legal
australia
Generated from Trainer
feature-extraction
Eval Results (legacy)
Instructions to use isaacus/emubert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use isaacus/emubert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="isaacus/emubert")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("isaacus/emubert") model = AutoModelForMaskedLM.from_pretrained("isaacus/emubert") - Notebooks
- Google Colab
- Kaggle
Noted that the quality of EmuBert's embeddings may not be
Browse files
README.md
CHANGED
|
@@ -9,7 +9,6 @@ tags:
|
|
| 9 |
- legal
|
| 10 |
- australia
|
| 11 |
- generated_from_trainer
|
| 12 |
-
- sentence-similarity
|
| 13 |
- feature-extraction
|
| 14 |
- fill-mask
|
| 15 |
datasets:
|
|
@@ -63,14 +62,14 @@ co2_eq_emissions:
|
|
| 63 |
|
| 64 |
EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
|
| 65 |
|
| 66 |
-
Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition**
|
| 67 |
|
| 68 |
To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
|
| 69 |
|
| 70 |
## Usage 👩💻
|
| 71 |
Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
|
| 72 |
|
| 73 |
-
It is also possible to generate embeddings from the model which can be
|
| 74 |
```python
|
| 75 |
import math
|
| 76 |
import torch
|
|
|
|
| 9 |
- legal
|
| 10 |
- australia
|
| 11 |
- generated_from_trainer
|
|
|
|
| 12 |
- feature-extraction
|
| 13 |
- fill-mask
|
| 14 |
datasets:
|
|
|
|
| 62 |
|
| 63 |
EmuBert is the **largest** and **most accurate** open-source masked language model for Australian law.
|
| 64 |
|
| 65 |
+
Trained on 180,000 laws, regulations and decisions across six Australian jurisdictions, totalling 1.4 billion tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), EmuBert is well suited for finetuning on a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including **text classification**, **named entity recognition**, **semantic similarity** and **question answering**.
|
| 66 |
|
| 67 |
To ensure its accessibility to as wide an audience as possible, EmuBert is issued under the [MIT Licence](https://huggingface.co/umarbutler/emubert/blob/main/LICENCE.md).
|
| 68 |
|
| 69 |
## Usage 👩💻
|
| 70 |
Those interested in finetuning EmuBert can check out Hugging Face's documentation for [Roberta](https://huggingface.co/roberta-base)-like models [here](https://huggingface.co/docs/transformers/en/model_doc/roberta) which very helpfully provides tutorials, scripts and other resources for the most common natural language processing tasks.
|
| 71 |
|
| 72 |
+
It is also possible to generate embeddings directly from the model which can be used for tasks like semantic similarity and clustering, although they are unlikely to peform as well as those generated by specially trained sentence embedding models **unless** EmuBert has been finetuned. Embeddings may be generated either through [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) (ie, `m = SentenceTransformer('umarbutler/emubert'); m.encode(...)`) or via the below code snippet which, although more complicated, is also orders of magnitude faster:
|
| 73 |
```python
|
| 74 |
import math
|
| 75 |
import torch
|