Fsoft-AIC
/

pii-phi

Model card Files Files and versions

pii-phi / README.md

anhphamduy's picture

Update README.md

678ae4b verified 9 months ago

|

history blame contribute delete

2.67 kB

	### Overview

	`pii-phi` is a fine-tuned version of `Phi-3.5-mini-instruct` designed to extract Personally Identifiable Information (PII) from unstructured text. The model outputs PII entities in a structured JSON format according to strict schema guidelines.

	### Training Prompt Format

	```text
	# GUIDELINES
	- Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format.
	- Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information.
	- The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS.

	# EXPECTED OUTPUT
	- The json output must be in the format below:
	{
	"result": [
	{"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"},
	...
	]
	}
	```

	### Supported Entities

	* PERSON\_NAME
	* BUSINESS\_NAME
	* API\_KEY
	* USERNAME
	* API\_ENDPOINT
	* WEBSITE\_ADDRESS
	* PHONE\_NUMBER
	* EMAIL\_ADDRESS
	* ID
	* PASSWORD
	* ADDRESS

	### Intended Use

	The model is intended for PII detection in text documents to support tasks such as data anonymization, compliance, and security auditing.

	### Limitations

	* Not guaranteed to detect all forms of PII in every context.
	* May return false positives or omit contextually relevant information.

	---

	### Installation

	Install the `vllm` package to run the model efficiently:

	```bash
	pip install vllm
	```

	---

	### Example:

	```python
	from vllm import LLM, SamplingParams

	llm = LLM("Fsoft-AIC/pii-phi")

	system_prompt = """
	# GUIDELINES
	- Extract all instances of the following Personally Identifiable Information (PII) entities from the provided text and return them in JSON format.
	- Each item in the JSON list should include an 'entity' key specifying the type of PII and a 'value' key containing the extracted information.
	- The supported entities are: PERSON_NAME, BUSINESS_NAME, API_KEY, USERNAME, API_ENDPOINT, WEBSITE_ADDRESS, PHONE_NUMBER, EMAIL_ADDRESS, ID, PASSWORD, ADDRESS.

	# EXPECTED OUTPUT
	- The json output must be in the format below:
	{
	"result": [
	{"entity": "ENTITY_TYPE", "value": "EXTRACTED_VALUE"},
	...
	]
	}
	"""
	pii_message = "I am James Jake and my employee number is 123123123"

	sampling_params = SamplingParams(temperature=0, max_tokens=1000)
	outputs = llm.chat(
	[
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": pii_message},
	],
	sampling_params,
	)


	for output in outputs:
	generated_text = output.outputs[0].text
	print(generated_text)
	```