pii-phi / README.md
anhphamduy's picture
Update README.md
678ae4b verified
### Overview
`pii-phi` is a fine-tuned version of `Phi-3.5-mini-instruct` designed to extract Personally Identifiable Information (PII) from unstructured text. The model outputs PII entities in a structured JSON format according to strict schema guidelines.
### Training Prompt Format
```text
# GUIDELINES
- Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format.
- Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information.
- The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS.
# EXPECTED OUTPUT
- The json output must be in the format below:
{
"result": [
{"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"},
...
]
}
```
### Supported Entities
* PERSON\_NAME
* BUSINESS\_NAME
* API\_KEY
* USERNAME
* API\_ENDPOINT
* WEBSITE\_ADDRESS
* PHONE\_NUMBER
* EMAIL\_ADDRESS
* ID
* PASSWORD
* ADDRESS
### Intended Use
The model is intended for PII detection in text documents to support tasks such as data anonymization, compliance, and security auditing.
### Limitations
* Not guaranteed to detect all forms of PII in every context.
* May return false positives or omit contextually relevant information.
---
### Installation
Install the `vllm` package to run the model efficiently:
```bash
pip install vllm
```
---
### Example:
```python
from vllm import LLM, SamplingParams
llm = LLM("Fsoft-AIC/pii-phi")
system_prompt = """
# GUIDELINES
- Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format.
- Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information.
- The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS.
# EXPECTED OUTPUT
- The json output must be in the format below:
{
"result": [
{"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"},
...
]
}
"""
pii_message = "I am James Jake and my employee number is 123123123"
sampling_params = SamplingParams(temperature=0, max_tokens=1000)
outputs = llm.chat(
[
{"role": "system", "content": system_prompt},
{"role": "user", "content": pii_message},
],
sampling_params,
)
for output in outputs:
generated_text = output.outputs[0].text
print(generated_text)
```