English
localities_LNI / README.md
hassaanulhaq01's picture
Update README.md
b8e7b30 verified
---
Developed by: GivingTuesday Data Commions
license: apache-2.0
Model Type: Classifier (BERT)
language:
- en
---
**Technical Specifications Document**: [Technical Specifications Link](https://docs.google.com/document/d/1cWLKdOmLH0-13HCLXtObx-WEAMKLClckGIYbD4NzgtU/edit?usp=sharing)
**Description**:
This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
- Local/Regional: Orgs operating primarily in one state or a few localities
- National: Orgs operating across many U.S. states
- International: Orgs with notable international grantmaking activity
------------------------------------------------------------------------------------------------------------------------------------------------------
# Non-Profit Mapping Project Documentation: Local vs National vs International Segmentation
**Author**: Edward Moore \- GivingTuesday Data Commons
**Note for external readers:** Some Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday.
# 1\. Approach
## Definitions
This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
1. Local/Regional: Orgs operating primarily in one state or a few localities
2. National: Orgs operating across many U.S. states
3. International: Orgs with notable international grantmaking activity
## Variables
The variables that may be used in the classification approach include:
* FILERNAME1
* FILEREIN
* TAXYEAR
* FILERUSSTATE (990PF Basic Fields Data Mart: Header A \- Charity Location \- Domestic State)
* SIGOCPYRFAPC (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Zip or Postal Code)
* SIGOCAFFRFACO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Country)
* SIGOCPYRFAPO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address State or Province)
* SIGOCPYAMOUN (990PF Current Grants Data Mart: Part 15 Table 3A \- Col E \- Row 2 \- Amount)
* SIGOCPYRFSTA (990PF Current Grants Data Mart: Part 15 Table 3A \- Col C \- Row 2 \- Foundation Status of Recipient)
* SIGOCPYRFACI (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address City)
* RECTABADDSTA (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- State)
* RETAAMOFCAGR (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col D \- Amount of Cash)
* RECTABADDCIT (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- City)
* Schedule F Variables TBD
## Approach
1. Identify funding orgs from Form 990 and 990PF (Part XIV-3a), including Schedules I and F
2. Standardize and enrich location data using state/country lookups and geocoding via Geopy
3. Calculate foreign grant percentage, max state concentration, number of distinct U.S. recipient states, and a composite score combining all three features
4. Apply KMeans to group orgs into 3 clusters reflecting geographic scope
**OR**
Choose thresholds for the features (foreign grant percentage, max state concentration, number of states) to define the 3 geographic categories (local, national, international)
# 2\. Code documentation
## Notebooks
1. Geopy Lookups
[Geopy Lookups (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331532?o=4203893953353865)
This notebook runs geolocation lookups for missing or ambiguous country/state data.
2. 990PF Cleaning
[990PF Cleaning (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331544?o=4203893953353865)
This notebook cleans and standardizes grants data from Form 990-PF.
3. 990PF Aggregation
[990PF Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331577?o=4203893953353865)
This notebook aggregates grant data by state and calculates key features.
4. 990 Aggregation
[990 Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331587?o=4203893953353865)
This notebook performs similar aggregations for Form 990 Schedule I and F filers.
5. Classification
[Classification (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/788755825032239?o=4203893953353865)
This notebook combines datasets, scales features, and runs KMeans clustering
# 3\. Outputs and Results
## K-means Cluster Definitions
* Cluster 0: Local/Regional
* High concentration in one state
* Low foreign activity
* Cluster 1: International
* High percentage of international grants
* Cluster 2: National
* Wide U.S. state distribution
* Moderate to low international share
# 4\. List of approaches that did not work
# 5\. Plan for Deploy, Scale, Archive, and Expose-data steps
* Easily extendable to additional years and EINs
* Output is an assigned geographic scope (local, national, or international) for each EIN per tax year
# 6\. Future work
* Incorporate Schedule F data once available
* Potentially explore labeling and supervised learning using curated examples rather than clustering
* Review k-means clustering approach and decide if it should be scaled or if a definitive feature-threshold approach should be used instead to define the 3 clusters