CAPID

This model performs context-aware PII detection in user-provided text and additionally classifies the relevance of each PII attribute to a given question. It outputs a structured JSON mapping spans → {type, relevance}. The model is fine-tuned from Llama-3.1-8B using the CAPID dataset generation pipeline.


Model Details

Model Description

This is a fine-tuned Llama-3.1-8B model trained for PII span detection, PII type classification, and binary relevance estimation (1 = high relevance, 0 = low relevance).
The model takes both a text and a question as input and predicts which pieces of personal information matter for answering that question.

This version of the model is intended for research on privacy-preserving NLP, contextual anonymization, and relevance-aware redaction.

  • Model type: Causal Language Model (Llama-3.1-8B, fine-tuned)
  • Languages: English
  • Finetuned from: meta-llama/Meta-Llama-3.1-8B
  • Intended use: PII detection, PII relevance classification, privacy-aware text preprocessing
  • Training framework: Unsloth + PEFT (4-bit quantized training)

Uses

Intended Use

This model is designed for:

  • PII detection in unstructured text
  • Classifying PII relevance to downstream tasks (question answering)
  • Relevance-aware masking / privacy-preserving context reduction

How to Run

You can run the model with Unsloth or standard transformers.

Using Unsloth (recommended)

from unsloth import FastLanguageModel
from transformers import AutoTokenizer

MODEL_ID = "ponoma16/context-pii-detection-binary-relevance-LLama-8B"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_ID,
    max_seq_length = 2048,
    load_in_4bit = True,
)

tokenizer.pad_token = tokenizer.eos_token
FastLanguageModel.for_inference(model)

instruction = """You are given the text and the question.
Find all PIIs (Personally Identifiable Information) in the text and output them separated by commas.
Classify them into the following types: health, location, sexual orientation, occupation, age, belief,
relationship, name, education, appearance, code, organization, finance, datetime, demographic.
Classify their relevance to the question: 1 (high), 0 (low).
Output result in the JSON format.
"""

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
Text: {}
Question: {}

### Response:
{}"""

context = "I’m a 20-year-old Fine Arts student from Hyderabad. After my dad passed away, life became unbearable. My relatives made me do all the housework, insulted my art, and even tried to marry me off for money. I faced harassment from people who were supposed to protect me, and I finally left home to start over. For a while, I was managing with the help of friends and my art. But a few days ago, one of the people who hurt me before found where I stay and attacked me again. I didn’t know what to do — I felt completely broken, but I gathered the courage to protect myself and reach out for help."
question = "What local organizations or services can protect someone like me?"

input_str = alpaca_prompt.format(
    instruction,
    context,
    question,
    "",
)

inputs = tokenizer([input_str], return_tensors="pt").to("cuda")

outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=False,
        temperature=0.0,
        top_p=1.0
    )
output_tokens = outputs[0][inputs['input_ids'].shape[1]:]
decoded = tokenizer.decode(output_tokens, skip_special_tokens=True).strip()
print(decoded)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ponoma16/capid-llama8b-lora

Finetuned
(606)
this model