English
File size: 5,548 Bytes
8930c1a
 
 
 
b8e7b30
 
8930c1a
 
03e5447
 
 
 
8930c1a
 
 
 
 
 
03e5447
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8e7b30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
Developed by: GivingTuesday Data Commions
license: apache-2.0
Model Type: Classifier (BERT)
language:
- en
---

**Technical Specifications Document**: [Technical Specifications Link](https://docs.google.com/document/d/1cWLKdOmLH0-13HCLXtObx-WEAMKLClckGIYbD4NzgtU/edit?usp=sharing)



**Description**: 
This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
- Local/Regional: Orgs operating primarily in one state or a few localities
- National: Orgs operating across many U.S. states
- International: Orgs with notable international grantmaking activity

------------------------------------------------------------------------------------------------------------------------------------------------------

# Non-Profit Mapping Project Documentation: Local vs National vs International Segmentation

**Author**: Edward Moore \- GivingTuesday Data Commons   
**Note for external readers:** Some Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday.

# 1\. Approach

## Definitions

This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:

1. Local/Regional: Orgs operating primarily in one state or a few localities  
2. National: Orgs operating across many U.S. states  
3. International: Orgs with notable international grantmaking activity

## Variables

The variables that may be used in the classification approach include:

* FILERNAME1   
* FILEREIN   
* TAXYEAR  
* FILERUSSTATE (990PF Basic Fields Data Mart: Header A \- Charity Location \- Domestic State)  
* SIGOCPYRFAPC (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Zip or Postal Code)  
* SIGOCAFFRFACO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Country)  
* SIGOCPYRFAPO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address State or Province)  
* SIGOCPYAMOUN (990PF Current Grants Data Mart: Part 15 Table 3A \- Col E \- Row 2 \- Amount)  
* SIGOCPYRFSTA (990PF Current Grants Data Mart: Part 15 Table 3A \- Col C \- Row 2 \- Foundation Status of Recipient)  
* SIGOCPYRFACI (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address City)  
* RECTABADDSTA (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- State)  
* RETAAMOFCAGR (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col D \- Amount of Cash)  
* RECTABADDCIT (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- City)  
* Schedule F Variables TBD

## Approach

1. Identify funding orgs from Form 990 and 990PF (Part XIV-3a), including Schedules I and F  
2. Standardize and enrich location data using state/country lookups and geocoding via Geopy  
3. Calculate foreign grant percentage, max state concentration, number of distinct U.S. recipient states, and a composite score combining all three features  
4. Apply KMeans to group orgs into 3 clusters reflecting geographic scope  
   **OR**  
   Choose thresholds for the features (foreign grant percentage, max state concentration, number of states) to define the 3 geographic categories (local, national, international)

# 2\. Code documentation

## Notebooks

1. Geopy Lookups  
   [Geopy Lookups (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331532?o=4203893953353865)  
   This notebook runs geolocation lookups for missing or ambiguous country/state data.  
2. 990PF Cleaning  
   [990PF Cleaning (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331544?o=4203893953353865)  
   This notebook cleans and standardizes grants data from Form 990-PF.  
3. 990PF Aggregation  
   [990PF Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331577?o=4203893953353865)  
   This notebook aggregates grant data by state and calculates key features.  
4. 990 Aggregation  
   [990 Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331587?o=4203893953353865)  
   This notebook performs similar aggregations for Form 990 Schedule I and F filers.  
5. Classification  
   [Classification (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/788755825032239?o=4203893953353865)  
   This notebook combines datasets, scales features, and runs KMeans clustering

# 3\. Outputs and Results

## K-means Cluster Definitions

* Cluster 0: Local/Regional  
  * High concentration in one state  
  * Low foreign activity  
* Cluster 1: International  
  * High percentage of international grants  
* Cluster 2: National  
  * Wide U.S. state distribution  
  * Moderate to low international share

# 4\. List of approaches that did not work

# 5\. Plan for Deploy, Scale, Archive, and Expose-data steps

* Easily extendable to additional years and EINs  
* Output is an assigned geographic scope (local, national, or international) for each EIN per tax year

# 6\. Future work

* Incorporate Schedule F data once available  
* Potentially explore labeling and supervised learning using curated examples rather than clustering  
* Review k-means clustering approach and decide if it should be scaled or if a definitive feature-threshold approach should be used instead to define the 3 clusters