English
hassaanulhaq01 commited on
Commit
03e5447
·
verified ·
1 Parent(s): 8930c1a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md CHANGED
@@ -4,9 +4,114 @@ license: apache-2.0
4
  Model Type: Classifier (BERT)
5
  ---
6
 
 
 
 
 
7
  **Description**:
8
  This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
9
  - Local/Regional: Orgs operating primarily in one state or a few localities
10
  - National: Orgs operating across many U.S. states
11
  - International: Orgs with notable international grantmaking activity
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  Model Type: Classifier (BERT)
5
  ---
6
 
7
+ **Technical Specifications Document**: [Technical Specifications Link](https://docs.google.com/document/d/1cWLKdOmLH0-13HCLXtObx-WEAMKLClckGIYbD4NzgtU/edit?usp=sharing)
8
+
9
+
10
+
11
  **Description**:
12
  This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
13
  - Local/Regional: Orgs operating primarily in one state or a few localities
14
  - National: Orgs operating across many U.S. states
15
  - International: Orgs with notable international grantmaking activity
16
 
17
+ ------------------------------------------------------------------------------------------------------------------------------------------------------
18
+
19
+ # Non-Profit Mapping Project Documentation: Local vs National vs International Segmentation
20
+
21
+ **Author**: Edward Moore \- GivingTuesday Data Commons
22
+ **Note for external readers:** Some Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday.
23
+
24
+ # 1\. Approach
25
+
26
+ ## Definitions
27
+
28
+ Concept grew from this Slack message from Marc:
29
+ *We (Marc, Annie, Ali) had a good checkin about nailing down the geographical stuff for this project. Because Annie's first and second attempt at classifying orgs as local / state / regional / national / international was not very successful, we simplified our definition (but still meets the needs of our own team's segmentation) and will redo the classifier on a larger set of narrower organizations. Namely:*
30
+ *local: IF an org is registered in only one state, we'll train a batch and:*
31
+ *determine city/county vs statewide*
32
+ *If org is registered in 10+ states, we'll treat as regional/national*
33
+ *If ang org has a foreign office, we'll treat as international*
34
+ *anything in between, or orgs with an regional/branch office outside the US will be excluded from our "state/local" training set.*
35
+ *We don't try to determine if an org is in multiple counties but not the whole state, or is "regional" like one part of the US but not the whole US, because our data is noisiest for these. Placenames in text are used for many other thing besides "this is where we work". Final geo-scope in our API output will be: city/county, state (or sub-state), regional/national, international. These are probably the ones in which we can be most confident. Grantmakers: follow different logic, where we will use the locations of their grantees to define their scope. It might make more sense to just report a list of states where they've made grants instead of "regional/national" for these.*
36
+
37
+ This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
38
+
39
+ 1. Local/Regional: Orgs operating primarily in one state or a few localities
40
+ 2. National: Orgs operating across many U.S. states
41
+ 3. International: Orgs with notable international grantmaking activity
42
+
43
+ ## Variables
44
+
45
+ The variables that may be used in the classification approach include:
46
+
47
+ * FILERNAME1
48
+ * FILEREIN
49
+ * TAXYEAR
50
+ * FILERUSSTATE (990PF Basic Fields Data Mart: Header A \- Charity Location \- Domestic State)
51
+ * SIGOCPYRFAPC (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Zip or Postal Code)
52
+ * SIGOCAFFRFACO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Country)
53
+ * SIGOCPYRFAPO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address State or Province)
54
+ * SIGOCPYAMOUN (990PF Current Grants Data Mart: Part 15 Table 3A \- Col E \- Row 2 \- Amount)
55
+ * SIGOCPYRFSTA (990PF Current Grants Data Mart: Part 15 Table 3A \- Col C \- Row 2 \- Foundation Status of Recipient)
56
+ * SIGOCPYRFACI (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address City)
57
+ * RECTABADDSTA (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- State)
58
+ * RETAAMOFCAGR (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col D \- Amount of Cash)
59
+ * RECTABADDCIT (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- City)
60
+ * Schedule F Variables TBD
61
+
62
+ ## Approach
63
+
64
+ 1. Identify funding orgs from Form 990 and 990PF (Part XIV-3a), including Schedules I and F
65
+ 2. Standardize and enrich location data using state/country lookups and geocoding via Geopy
66
+ 3. Calculate foreign grant percentage, max state concentration, number of distinct U.S. recipient states, and a composite score combining all three features
67
+ 4. Apply KMeans to group orgs into 3 clusters reflecting geographic scope
68
+ **OR**
69
+ Choose thresholds for the features (foreign grant percentage, max state concentration, number of states) to define the 3 geographic categories (local, national, international)
70
+
71
+ # 2\. Code documentation
72
+
73
+ ## Notebooks
74
+
75
+ 1. Geopy Lookups
76
+ [Geopy Lookups (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331532?o=4203893953353865)
77
+ This notebook runs geolocation lookups for missing or ambiguous country/state data.
78
+ 2. 990PF Cleaning
79
+ [990PF Cleaning (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331544?o=4203893953353865)
80
+ This notebook cleans and standardizes grants data from Form 990-PF.
81
+ 3. 990PF Aggregation
82
+ [990PF Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331577?o=4203893953353865)
83
+ This notebook aggregates grant data by state and calculates key features.
84
+ 4. 990 Aggregation
85
+ [990 Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331587?o=4203893953353865)
86
+ This notebook performs similar aggregations for Form 990 Schedule I and F filers.
87
+ 5. Classification
88
+ [Classification (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/788755825032239?o=4203893953353865)
89
+ This notebook combines datasets, scales features, and runs KMeans clustering
90
+
91
+ # 3\. Outputs and Results
92
+
93
+ ## K-means Cluster Definitions
94
+
95
+ * Cluster 0: Local/Regional
96
+ * High concentration in one state
97
+ * Low foreign activity
98
+ * Cluster 1: International
99
+ * High percentage of international grants
100
+ * Cluster 2: National
101
+ * Wide U.S. state distribution
102
+ * Moderate to low international share
103
+
104
+ # 4\. List of approaches that did not work
105
+
106
+ # 5\. Plan for Deploy, Scale, Archive, and Expose-data steps
107
+
108
+ * Easily extendable to additional years and EINs
109
+ * Output is an assigned geographic scope (local, national, or international) for each EIN per tax year
110
+
111
+ # 6\. Future work
112
+
113
+ * Incorporate Schedule F data once available
114
+ * Potentially explore labeling and supervised learning using curated examples rather than clustering
115
+ * Review k-means clustering approach and decide if it should be scaled or if a definitive feature-threshold approach should be used instead to define the 3 clusters
116
+
117
+