HSDS Deduplication LightGBM Models (v1.2)

This repository contains binary classification models designed to determine whether two HSDS (Human Services Data Specification) records represent the same real-world entity.

The pipeline determines a probability of duplication (0.0 to 1.0) for a pair of records.

Model Overview

We provide two specialized LightGBM models:

Organization Model: Determines if two organizations (service providers) are duplicates.
Service Model: Determines if two services are duplicates.

Purpose

The primary goal is to identify duplicates with high precision to clean human services databases.

Input: A pair of records (Entity A and Entity B) + Computed Features.
Output: Probability score.
Decision: Score > Threshold implies Duplicate.

Architecture & Data Flow

The training and inference pipeline relies on several key components found in this repository:

tf_idf_models/: Directory containing pre-trained TF-IDF vectorizers (tfidf_vectorizer_organization.joblib, tfidf_vectorizer_service.joblib) used to generate advanced text similarity features.
lgbm_model_*.txt: The trained LightGBM model files.
threshold_*.json: Verified decision thresholds optimized for F1-score.

Feature Engineering (v1.2)

Our models use a sophisticated set of features to capture similarity from multiple angles. Features are categorized into those common to all entities and those specific to services or organizations.

A. Common Features (All Models)

These features are calculated for both Organization and Service pairs.

1. String Similarity & Name Matching

Compares the primary NAME fields.

name_jaro_winkler: Jaro-Winkler distance (0-1), ideal for short string typos.
name_levenshtein: Levenshtein ratio (0-1), measures edit distance.
name_token_sort: Sorts tokens alphabetically before comparison (e.g., "The Red Cross" == "Red Cross, The").
fuzzy_name: Boolean flag indicating very high similarity (>0.85).
max_distinctive_term_ratio: Measures agreement on rare/distinctive terms (TF-IDF).
tfidf_weighted_similarity: Cosine similarity of the TF-IDF vectors (Name + Description).

2. Token Completeness

token_jaccard: Overlap of name tokens (intersection / union).
token_overlap_count: Count of shared tokens.
total_token_count: Total unique tokens in both names.
name_length_diff_in_tokens: Difference in word count.

3. Geographic Context

shared_address: Boolean, true if cleaned street addresses match.
same_city: Boolean, true if normalized city names match.
same_state: Boolean, true if state codes match.
same_zipcode: Boolean, true if 5-digit zip codes match.

4. Contact & Metadata

shared_phone: Boolean, true if any phone number matches.
shared_email: Boolean, true if email addresses match.
shared_website: Boolean, true if domains match.

B. Service-Specific Features

These features are exclusive to the Service Model and focus on taxonomy (categorization) and delivery method.

1. Taxonomy & Semantics

Services are categorized using a standard taxonomy (e.g., AIRS/211). We go beyond exact code matching to understand semantic relationships.

shared_taxonomy_count: Number of exact taxonomy codes shared.
percent_shared_taxonomies: Ratio of shared codes to the total number of unique codes involved.
taxonomy_hierarchy_match_score: Measures hierarchical alignment.
- Logic: Checks if codes belong to the same branch by strictly matching prefixes (e.g., BD-1800 is a match for BD-1800.2000).
taxonomy_pairs_sim_gt_*.ratio (e.g., _gt_080_ratio):
- Purpose: Detects services that are semantically identical but coded differently (e.g., "Soup Kitchens" vs "Food Pantries").
- Logic: We use Taxonomy Embeddings to calculate the cosine similarity between every pair of taxonomy codes. This feature counts the fraction of pairs that exceed a similarity threshold (0.70, 0.80, etc.).
embedding_similarity: Cosine similarity of the full service text embedding (Name + Description + Taxonomy definitions).

2. Delivery Method

is_virtual_service_diff: Boolean.
- Logic: Checks LOCATION_TYPE. If one service is "virtual" (hotline, online-only) and the other is physical (has a physical location type), this flag is True. This effectively prevents false positives between a national hotline and a local branch.

C. Organization-Specific Features

These features are exclusive to the Organization Model and aggregate data from the organization's portfolio of services.

shared_service_names: Count of services in each organization's portfolio that have matching names.
max_service_name_jaccard: The highest token overlap score found between any two services in the respective portfolios.
shared_service_taxonomies: Count of taxonomy codes shared across all services offered by the organizations.
avg_taxonomies_per_service_diff: Difference in the average "width" of services (specialized vs generalist).
total_services: Combined number of services (proxy for org size).
num_services_diff: Difference in portfolio size.

Disqualification Score (dq-encoder): This feature is currently disabled in v1.2. The model relies on the features above.

Feature Importance Insights

Analysis of organization_feature_selection_report.csv and service_feature_selection_report.csv reveals what drives the models:

Model	Top Predictive Features	Insight
Organization	`name_levenshtein`, `embedding_similarity`, `total_token_count`, `name_token_sort`	Name similarity is king. The model heavily scrutinizes the organization name (spelling, token order, length) above all else.
Service	`shared_address`, `embedding_similarity`, `fuzzy_name`, `same_org_name_fuzzy`	Location context is critical. For services, sharing an address or parent organization is a massive predictor of duplication, often more than the service name itself.

Usage Guide (Python)

To use these models at runtime, you must initialize the FeatureExtractor with the correct model_type to ensure the appropriate TF-IDF models are loaded.

/your/app/
  ├── LightGBM_Train/
  │   └── tf_idf_models/
  │       ├── tfidf_vectorizer_organization.joblib
  │       └── tfidf_vectorizer_service.joblib
  ├── lgbm_model_organization.txt
  └── threshold_organization.json

Inference Code Example

import json
import lightgbm as lgb
import os
from feature_extractor import FeatureExtractor

# 1. Configuration
MODEL_TYPE = 'organization' # or 'service'
MODEL_PATH = f"lgbm_model_{MODEL_TYPE}.txt"
THRESHOLD_PATH = f"threshold_{MODEL_TYPE}.json"

# 2. Initialize Feature Extractor
# This automatically loads the correct TF-IDF model from deduplication/LightGBM_Train/tf_idf_models/
extractor = FeatureExtractor(
    model_type=MODEL_TYPE, 
    # Optional: ensure it rolls up services for orgs
    rollup_services=(MODEL_TYPE == 'organization')
)

# 3. Load Model
bst = lgb.Booster(model_file=MODEL_PATH)

# 4. Load Threshold
with open(THRESHOLD_PATH, 'r') as f:
    threshold_config = json.load(f)
    THRESHOLD = threshold_config['threshold']

# 5. Extract Features & Predict
def predict_duplication(record_a, record_b):
    # Extract dictionary of features
    features_dict = extractor.extract_features(record_a, record_b)
    
    # Convert to list of values (ensure order matches training!)
    # Note: FeatureExtractor usually returns a dict. You must ensure
    # the order matches the features used in training.
    # For this example, we assume `bst.predict` handles dict/pandas or you have a mapper.
    # In production, use the `feature_names` from the model file to order the list.
    
    feature_names = bst.feature_name()
    feature_values = [features_dict.get(name, 0.0) for name in feature_names]
    
    # Predict
    prob = bst.predict([feature_values])[0]
    is_duplicate = prob > THRESHOLD
    
    return {
        "is_duplicate": bool(is_duplicate),
        "score": float(prob),
        "threshold_used": THRESHOLD
    }

Performance Metrics

Metric	Organization Model	Service Model
ROC AUC	0.9742	0.9924
Precision	0.96	0.96
Recall	0.98	0.98
F1-Score	0.97	0.97
Accuracy	0.96	0.96

Metrics based on latest validation reports.

Downloads last month: -; Downloads are not tracked for this model. How to track