--- Developed by: GivingTuesday Data Commions license: apache-2.0 Model Type: Classifier (BERT) language: - en --- **Technical Specifications Document**: [Technical Specifications Link](https://docs.google.com/document/d/1cWLKdOmLH0-13HCLXtObx-WEAMKLClckGIYbD4NzgtU/edit?usp=sharing) **Description**: This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories: - Local/Regional: Orgs operating primarily in one state or a few localities - National: Orgs operating across many U.S. states - International: Orgs with notable international grantmaking activity ------------------------------------------------------------------------------------------------------------------------------------------------------ # Non-Profit Mapping Project Documentation: Local vs National vs International Segmentation **Author**: Edward Moore \- GivingTuesday Data Commons **Note for external readers:** Some Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday. # 1\. Approach ## Definitions This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories: 1. Local/Regional: Orgs operating primarily in one state or a few localities 2. National: Orgs operating across many U.S. states 3. International: Orgs with notable international grantmaking activity ## Variables The variables that may be used in the classification approach include: * FILERNAME1 * FILEREIN * TAXYEAR * FILERUSSTATE (990PF Basic Fields Data Mart: Header A \- Charity Location \- Domestic State) * SIGOCPYRFAPC (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Zip or Postal Code) * SIGOCAFFRFACO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Country) * SIGOCPYRFAPO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address State or Province) * SIGOCPYAMOUN (990PF Current Grants Data Mart: Part 15 Table 3A \- Col E \- Row 2 \- Amount) * SIGOCPYRFSTA (990PF Current Grants Data Mart: Part 15 Table 3A \- Col C \- Row 2 \- Foundation Status of Recipient) * SIGOCPYRFACI (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address City) * RECTABADDSTA (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- State) * RETAAMOFCAGR (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col D \- Amount of Cash) * RECTABADDCIT (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- City) * Schedule F Variables TBD ## Approach 1. Identify funding orgs from Form 990 and 990PF (Part XIV-3a), including Schedules I and F 2. Standardize and enrich location data using state/country lookups and geocoding via Geopy 3. Calculate foreign grant percentage, max state concentration, number of distinct U.S. recipient states, and a composite score combining all three features 4. Apply KMeans to group orgs into 3 clusters reflecting geographic scope **OR** Choose thresholds for the features (foreign grant percentage, max state concentration, number of states) to define the 3 geographic categories (local, national, international) # 2\. Code documentation ## Notebooks 1. Geopy Lookups [Geopy Lookups (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331532?o=4203893953353865) This notebook runs geolocation lookups for missing or ambiguous country/state data. 2. 990PF Cleaning [990PF Cleaning (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331544?o=4203893953353865) This notebook cleans and standardizes grants data from Form 990-PF. 3. 990PF Aggregation [990PF Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331577?o=4203893953353865) This notebook aggregates grant data by state and calculates key features. 4. 990 Aggregation [990 Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331587?o=4203893953353865) This notebook performs similar aggregations for Form 990 Schedule I and F filers. 5. Classification [Classification (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/788755825032239?o=4203893953353865) This notebook combines datasets, scales features, and runs KMeans clustering # 3\. Outputs and Results ## K-means Cluster Definitions * Cluster 0: Local/Regional * High concentration in one state * Low foreign activity * Cluster 1: International * High percentage of international grants * Cluster 2: National * Wide U.S. state distribution * Moderate to low international share # 4\. List of approaches that did not work # 5\. Plan for Deploy, Scale, Archive, and Expose-data steps * Easily extendable to additional years and EINs * Output is an assigned geographic scope (local, national, or international) for each EIN per tax year # 6\. Future work * Incorporate Schedule F data once available * Potentially explore labeling and supervised learning using curated examples rather than clustering * Review k-means clustering approach and decide if it should be scaled or if a definitive feature-threshold approach should be used instead to define the 3 clusters