Update README.md
Browse files
README.md
CHANGED
|
@@ -4,9 +4,114 @@ license: apache-2.0
|
|
| 4 |
Model Type: Classifier (BERT)
|
| 5 |
---
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
**Description**:
|
| 8 |
This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
|
| 9 |
- Local/Regional: Orgs operating primarily in one state or a few localities
|
| 10 |
- National: Orgs operating across many U.S. states
|
| 11 |
- International: Orgs with notable international grantmaking activity
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
Model Type: Classifier (BERT)
|
| 5 |
---
|
| 6 |
|
| 7 |
+
**Technical Specifications Document**: [Technical Specifications Link](https://docs.google.com/document/d/1cWLKdOmLH0-13HCLXtObx-WEAMKLClckGIYbD4NzgtU/edit?usp=sharing)
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
|
| 11 |
**Description**:
|
| 12 |
This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
|
| 13 |
- Local/Regional: Orgs operating primarily in one state or a few localities
|
| 14 |
- National: Orgs operating across many U.S. states
|
| 15 |
- International: Orgs with notable international grantmaking activity
|
| 16 |
|
| 17 |
+
------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 18 |
+
|
| 19 |
+
# Non-Profit Mapping Project Documentation: Local vs National vs International Segmentation
|
| 20 |
+
|
| 21 |
+
**Author**: Edward Moore \- GivingTuesday Data Commons
|
| 22 |
+
**Note for external readers:** Some Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday.
|
| 23 |
+
|
| 24 |
+
# 1\. Approach
|
| 25 |
+
|
| 26 |
+
## Definitions
|
| 27 |
+
|
| 28 |
+
Concept grew from this Slack message from Marc:
|
| 29 |
+
*We (Marc, Annie, Ali) had a good checkin about nailing down the geographical stuff for this project. Because Annie's first and second attempt at classifying orgs as local / state / regional / national / international was not very successful, we simplified our definition (but still meets the needs of our own team's segmentation) and will redo the classifier on a larger set of narrower organizations. Namely:*
|
| 30 |
+
*local: IF an org is registered in only one state, we'll train a batch and:*
|
| 31 |
+
*determine city/county vs statewide*
|
| 32 |
+
*If org is registered in 10+ states, we'll treat as regional/national*
|
| 33 |
+
*If ang org has a foreign office, we'll treat as international*
|
| 34 |
+
*anything in between, or orgs with an regional/branch office outside the US will be excluded from our "state/local" training set.*
|
| 35 |
+
*We don't try to determine if an org is in multiple counties but not the whole state, or is "regional" like one part of the US but not the whole US, because our data is noisiest for these. Placenames in text are used for many other thing besides "this is where we work". Final geo-scope in our API output will be: city/county, state (or sub-state), regional/national, international. These are probably the ones in which we can be most confident. Grantmakers: follow different logic, where we will use the locations of their grantees to define their scope. It might make more sense to just report a list of states where they've made grants instead of "regional/national" for these.*
|
| 36 |
+
|
| 37 |
+
This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
|
| 38 |
+
|
| 39 |
+
1. Local/Regional: Orgs operating primarily in one state or a few localities
|
| 40 |
+
2. National: Orgs operating across many U.S. states
|
| 41 |
+
3. International: Orgs with notable international grantmaking activity
|
| 42 |
+
|
| 43 |
+
## Variables
|
| 44 |
+
|
| 45 |
+
The variables that may be used in the classification approach include:
|
| 46 |
+
|
| 47 |
+
* FILERNAME1
|
| 48 |
+
* FILEREIN
|
| 49 |
+
* TAXYEAR
|
| 50 |
+
* FILERUSSTATE (990PF Basic Fields Data Mart: Header A \- Charity Location \- Domestic State)
|
| 51 |
+
* SIGOCPYRFAPC (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Zip or Postal Code)
|
| 52 |
+
* SIGOCAFFRFACO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Country)
|
| 53 |
+
* SIGOCPYRFAPO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address State or Province)
|
| 54 |
+
* SIGOCPYAMOUN (990PF Current Grants Data Mart: Part 15 Table 3A \- Col E \- Row 2 \- Amount)
|
| 55 |
+
* SIGOCPYRFSTA (990PF Current Grants Data Mart: Part 15 Table 3A \- Col C \- Row 2 \- Foundation Status of Recipient)
|
| 56 |
+
* SIGOCPYRFACI (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address City)
|
| 57 |
+
* RECTABADDSTA (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- State)
|
| 58 |
+
* RETAAMOFCAGR (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col D \- Amount of Cash)
|
| 59 |
+
* RECTABADDCIT (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- City)
|
| 60 |
+
* Schedule F Variables TBD
|
| 61 |
+
|
| 62 |
+
## Approach
|
| 63 |
+
|
| 64 |
+
1. Identify funding orgs from Form 990 and 990PF (Part XIV-3a), including Schedules I and F
|
| 65 |
+
2. Standardize and enrich location data using state/country lookups and geocoding via Geopy
|
| 66 |
+
3. Calculate foreign grant percentage, max state concentration, number of distinct U.S. recipient states, and a composite score combining all three features
|
| 67 |
+
4. Apply KMeans to group orgs into 3 clusters reflecting geographic scope
|
| 68 |
+
**OR**
|
| 69 |
+
Choose thresholds for the features (foreign grant percentage, max state concentration, number of states) to define the 3 geographic categories (local, national, international)
|
| 70 |
+
|
| 71 |
+
# 2\. Code documentation
|
| 72 |
+
|
| 73 |
+
## Notebooks
|
| 74 |
+
|
| 75 |
+
1. Geopy Lookups
|
| 76 |
+
[Geopy Lookups (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331532?o=4203893953353865)
|
| 77 |
+
This notebook runs geolocation lookups for missing or ambiguous country/state data.
|
| 78 |
+
2. 990PF Cleaning
|
| 79 |
+
[990PF Cleaning (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331544?o=4203893953353865)
|
| 80 |
+
This notebook cleans and standardizes grants data from Form 990-PF.
|
| 81 |
+
3. 990PF Aggregation
|
| 82 |
+
[990PF Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331577?o=4203893953353865)
|
| 83 |
+
This notebook aggregates grant data by state and calculates key features.
|
| 84 |
+
4. 990 Aggregation
|
| 85 |
+
[990 Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331587?o=4203893953353865)
|
| 86 |
+
This notebook performs similar aggregations for Form 990 Schedule I and F filers.
|
| 87 |
+
5. Classification
|
| 88 |
+
[Classification (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/788755825032239?o=4203893953353865)
|
| 89 |
+
This notebook combines datasets, scales features, and runs KMeans clustering
|
| 90 |
+
|
| 91 |
+
# 3\. Outputs and Results
|
| 92 |
+
|
| 93 |
+
## K-means Cluster Definitions
|
| 94 |
+
|
| 95 |
+
* Cluster 0: Local/Regional
|
| 96 |
+
* High concentration in one state
|
| 97 |
+
* Low foreign activity
|
| 98 |
+
* Cluster 1: International
|
| 99 |
+
* High percentage of international grants
|
| 100 |
+
* Cluster 2: National
|
| 101 |
+
* Wide U.S. state distribution
|
| 102 |
+
* Moderate to low international share
|
| 103 |
+
|
| 104 |
+
# 4\. List of approaches that did not work
|
| 105 |
+
|
| 106 |
+
# 5\. Plan for Deploy, Scale, Archive, and Expose-data steps
|
| 107 |
+
|
| 108 |
+
* Easily extendable to additional years and EINs
|
| 109 |
+
* Output is an assigned geographic scope (local, national, or international) for each EIN per tax year
|
| 110 |
+
|
| 111 |
+
# 6\. Future work
|
| 112 |
+
|
| 113 |
+
* Incorporate Schedule F data once available
|
| 114 |
+
* Potentially explore labeling and supervised learning using curated examples rather than clustering
|
| 115 |
+
* Review k-means clustering approach and decide if it should be scaled or if a definitive feature-threshold approach should be used instead to define the 3 clusters
|
| 116 |
+
|
| 117 |
+
|