akhatre commited on
Commit
29ae185
·
1 Parent(s): fbc42ee

add reference to gliner2 zero shot capabilities

Browse files
Files changed (2) hide show
  1. README.md +37 -0
  2. anonymise.py +21 -1
README.md CHANGED
@@ -84,6 +84,8 @@ The synthetic data approach effectively distils the "knowledge" of a large LLM i
84
  | `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
85
  | `VEHICLE_ID_NUMBERS` | License plates, VINs |
86
 
 
 
87
  ## Quick Start
88
 
89
  ### Install dependencies
@@ -158,6 +160,41 @@ entities = detect_entities(model, text, entities={
158
  })
159
  ```
160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  ## How It Works
162
 
163
  The inference pipeline in `anonymise.py`:
 
84
  | `TECHNICAL_ID_NUMBERS` | IP/MAC addresses, serial numbers |
85
  | `VEHICLE_ID_NUMBERS` | License plates, VINs |
86
 
87
+ Since NERPA is built on GLiNER2 (a zero-shot bi-encoder), it is **not limited** to the entities above. You can pass any custom entity types alongside the built-in ones — the fine-tuning does not reduce the model's ability to detect arbitrary categories. See [Custom entities](#custom-entities) below.
88
+
89
  ## Quick Start
90
 
91
  ### Install dependencies
 
160
  })
161
  ```
162
 
163
+ ### Custom entities
164
+
165
+ You can detect additional entity types beyond the built-in PII set. The model's zero-shot capability means any label + description pair will work — your custom entities are detected and anonymised alongside the fine-tuned ones.
166
+
167
+ **CLI** — use `--extra-entities` / `-e`:
168
+
169
+ ```bash
170
+ python anonymise.py -e PRODUCT="Product name" -e SKILL="Professional skill" \
171
+ "John Smith is a senior Python developer who bought a MacBook Pro."
172
+ ```
173
+
174
+ Output:
175
+
176
+ ```
177
+ [PERSON_NAME] is a senior [SKILL] developer who bought a [PRODUCT].
178
+ ```
179
+
180
+ **Python:**
181
+
182
+ ```python
183
+ from anonymise import load_model, detect_entities, anonymise, PII_ENTITIES
184
+
185
+ model = load_model(".")
186
+
187
+ custom_entities = {
188
+ **PII_ENTITIES,
189
+ "PRODUCT": "Product name",
190
+ "SKILL": "Professional skill",
191
+ }
192
+
193
+ text = "John Smith is a senior Python developer who bought a MacBook Pro."
194
+ entities = detect_entities(model, text, entities=custom_entities)
195
+ print(anonymise(text, entities))
196
+ ```
197
+
198
  ## How It Works
199
 
200
  The inference pipeline in `anonymise.py`:
anonymise.py CHANGED
@@ -212,6 +212,14 @@ def main() -> None:
212
  "--show-entities", action="store_true",
213
  help="Print detected entities before anonymised text",
214
  )
 
 
 
 
 
 
 
 
215
  args = parser.parse_args()
216
 
217
  if args.file:
@@ -225,8 +233,20 @@ def main() -> None:
225
  else:
226
  parser.error("Provide text as an argument or use --file")
227
 
 
 
 
 
 
 
 
 
 
 
 
228
  model = load_model(args.model)
229
- detected = detect_entities(model, text, threshold=args.threshold)
 
230
 
231
  if args.show_entities:
232
  for entity in detected:
 
212
  "--show-entities", action="store_true",
213
  help="Print detected entities before anonymised text",
214
  )
215
+ parser.add_argument(
216
+ "--extra-entities", "-e", action="append", metavar="LABEL=DESCRIPTION",
217
+ help=(
218
+ "Additional custom entity types to detect alongside the built-in "
219
+ "PII entities. Repeat for each type. Format: LABEL=\"Description\". "
220
+ "Example: -e PRODUCT=\"Product name\" -e SKILL=\"Professional skill\""
221
+ ),
222
+ )
223
  args = parser.parse_args()
224
 
225
  if args.file:
 
233
  else:
234
  parser.error("Provide text as an argument or use --file")
235
 
236
+ extra: dict[str, str] = {}
237
+ if args.extra_entities:
238
+ for item in args.extra_entities:
239
+ if "=" not in item:
240
+ parser.error(
241
+ f"Invalid --extra-entities value '{item}'. "
242
+ "Expected format: LABEL=\"Description\""
243
+ )
244
+ label, description = item.split("=", 1)
245
+ extra[label.strip()] = description.strip()
246
+
247
  model = load_model(args.model)
248
+ all_entities = {**PII_ENTITIES, **extra} if extra else None
249
+ detected = detect_entities(model, text, entities=all_entities, threshold=args.threshold)
250
 
251
  if args.show_entities:
252
  for entity in detected: