|
|
--- |
|
|
Developed by: GivingTuesday Data Commions |
|
|
license: apache-2.0 |
|
|
Model Type: Classifier (BERT) |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
**Technical Specifications Document**: [Technical Specifications Link](https://docs.google.com/document/d/1cWLKdOmLH0-13HCLXtObx-WEAMKLClckGIYbD4NzgtU/edit?usp=sharing) |
|
|
|
|
|
|
|
|
|
|
|
**Description**: |
|
|
This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories: |
|
|
- Local/Regional: Orgs operating primarily in one state or a few localities |
|
|
- National: Orgs operating across many U.S. states |
|
|
- International: Orgs with notable international grantmaking activity |
|
|
|
|
|
------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
|
|
|
|
# Non-Profit Mapping Project Documentation: Local vs National vs International Segmentation |
|
|
|
|
|
**Author**: Edward Moore \- GivingTuesday Data Commons |
|
|
**Note for external readers:** Some Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday. |
|
|
|
|
|
# 1\. Approach |
|
|
|
|
|
## Definitions |
|
|
|
|
|
This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories: |
|
|
|
|
|
1. Local/Regional: Orgs operating primarily in one state or a few localities |
|
|
2. National: Orgs operating across many U.S. states |
|
|
3. International: Orgs with notable international grantmaking activity |
|
|
|
|
|
## Variables |
|
|
|
|
|
The variables that may be used in the classification approach include: |
|
|
|
|
|
* FILERNAME1 |
|
|
* FILEREIN |
|
|
* TAXYEAR |
|
|
* FILERUSSTATE (990PF Basic Fields Data Mart: Header A \- Charity Location \- Domestic State) |
|
|
* SIGOCPYRFAPC (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Zip or Postal Code) |
|
|
* SIGOCAFFRFACO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Country) |
|
|
* SIGOCPYRFAPO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address State or Province) |
|
|
* SIGOCPYAMOUN (990PF Current Grants Data Mart: Part 15 Table 3A \- Col E \- Row 2 \- Amount) |
|
|
* SIGOCPYRFSTA (990PF Current Grants Data Mart: Part 15 Table 3A \- Col C \- Row 2 \- Foundation Status of Recipient) |
|
|
* SIGOCPYRFACI (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address City) |
|
|
* RECTABADDSTA (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- State) |
|
|
* RETAAMOFCAGR (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col D \- Amount of Cash) |
|
|
* RECTABADDCIT (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- City) |
|
|
* Schedule F Variables TBD |
|
|
|
|
|
## Approach |
|
|
|
|
|
1. Identify funding orgs from Form 990 and 990PF (Part XIV-3a), including Schedules I and F |
|
|
2. Standardize and enrich location data using state/country lookups and geocoding via Geopy |
|
|
3. Calculate foreign grant percentage, max state concentration, number of distinct U.S. recipient states, and a composite score combining all three features |
|
|
4. Apply KMeans to group orgs into 3 clusters reflecting geographic scope |
|
|
**OR** |
|
|
Choose thresholds for the features (foreign grant percentage, max state concentration, number of states) to define the 3 geographic categories (local, national, international) |
|
|
|
|
|
# 2\. Code documentation |
|
|
|
|
|
## Notebooks |
|
|
|
|
|
1. Geopy Lookups |
|
|
[Geopy Lookups (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331532?o=4203893953353865) |
|
|
This notebook runs geolocation lookups for missing or ambiguous country/state data. |
|
|
2. 990PF Cleaning |
|
|
[990PF Cleaning (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331544?o=4203893953353865) |
|
|
This notebook cleans and standardizes grants data from Form 990-PF. |
|
|
3. 990PF Aggregation |
|
|
[990PF Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331577?o=4203893953353865) |
|
|
This notebook aggregates grant data by state and calculates key features. |
|
|
4. 990 Aggregation |
|
|
[990 Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331587?o=4203893953353865) |
|
|
This notebook performs similar aggregations for Form 990 Schedule I and F filers. |
|
|
5. Classification |
|
|
[Classification (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/788755825032239?o=4203893953353865) |
|
|
This notebook combines datasets, scales features, and runs KMeans clustering |
|
|
|
|
|
# 3\. Outputs and Results |
|
|
|
|
|
## K-means Cluster Definitions |
|
|
|
|
|
* Cluster 0: Local/Regional |
|
|
* High concentration in one state |
|
|
* Low foreign activity |
|
|
* Cluster 1: International |
|
|
* High percentage of international grants |
|
|
* Cluster 2: National |
|
|
* Wide U.S. state distribution |
|
|
* Moderate to low international share |
|
|
|
|
|
# 4\. List of approaches that did not work |
|
|
|
|
|
# 5\. Plan for Deploy, Scale, Archive, and Expose-data steps |
|
|
|
|
|
* Easily extendable to additional years and EINs |
|
|
* Output is an assigned geographic scope (local, national, or international) for each EIN per tax year |
|
|
|
|
|
# 6\. Future work |
|
|
|
|
|
* Incorporate Schedule F data once available |
|
|
* Potentially explore labeling and supervised learning using curated examples rather than clustering |
|
|
* Review k-means clustering approach and decide if it should be scaled or if a definitive feature-threshold approach should be used instead to define the 3 clusters |