Update README.md
Browse files
README.md
CHANGED
|
@@ -11,6 +11,27 @@ This Natural Language Processing (NLP) model is made available under the Apache
|
|
| 11 |
The model is optimized to analyze texts containing up to 512 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 512 tokens. Each chunk can then be processed separately.
|
| 12 |
## Supported Languages
|
| 13 |
Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
# Use Cases
|
| 16 |
## Data Privacy and Compliance
|
|
|
|
| 11 |
The model is optimized to analyze texts containing up to 512 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 512 tokens. Each chunk can then be processed separately.
|
| 12 |
## Supported Languages
|
| 13 |
Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish
|
| 14 |
+
## Example Usage
|
| 15 |
+
|
| 16 |
+
```python
|
| 17 |
+
import torch
|
| 18 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 19 |
+
tokenizer = AutoTokenizer.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9")
|
| 20 |
+
model = AutoModelForCausalLM.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9", torch_dtype=torch.bfloat16)
|
| 21 |
+
|
| 22 |
+
prompt = f"Check for sensitive information: John, our patient, felt a throbbing headache and dizziness for two weeks. He was immediately..."
|
| 23 |
+
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
|
| 24 |
+
|
| 25 |
+
max_length = 512
|
| 26 |
+
inputs_length = inputs.input_ids.shape[1]
|
| 27 |
+
max_new_tokens_value = max_length - inputs_length
|
| 28 |
+
|
| 29 |
+
outputs = model.generate(inputs.input_ids, max_new_tokens=max_new_tokens_value, do_sample=False, top_k=50, top_p=0.98)
|
| 30 |
+
|
| 31 |
+
result = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
|
| 32 |
+
|
| 33 |
+
print(result)
|
| 34 |
+
```
|
| 35 |
|
| 36 |
# Use Cases
|
| 37 |
## Data Privacy and Compliance
|