localities_LNI / README.md

Update README.md

b8e7b30 verified 3 months ago

5.55 kB

	---
	Developed by: GivingTuesday Data Commions
	license: apache-2.0
	Model Type: Classifier (BERT)
	language:
	- en
	---

	Technical Specifications Document: [Technical Specifications Link](https://docs.google.com/document/d/1cWLKdOmLH0-13HCLXtObx-WEAMKLClckGIYbD4NzgtU/edit?usp=sharing)



	Description:
	This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:
	- Local/Regional: Orgs operating primarily in one state or a few localities
	- National: Orgs operating across many U.S. states
	- International: Orgs with notable international grantmaking activity

	------------------------------------------------------------------------------------------------------------------------------------------------------

	# Non-Profit Mapping Project Documentation: Local vs National vs International Segmentation

	Author: Edward Moore \- GivingTuesday Data Commons
	Note for external readers: Some Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday.

	# 1\. Approach

	## Definitions

	This segmentation aims to classify funding organizations by geographic scope based on their grantmaking behavior into three categories:

	1. Local/Regional: Orgs operating primarily in one state or a few localities
	2. National: Orgs operating across many U.S. states
	3. International: Orgs with notable international grantmaking activity

	## Variables

	The variables that may be used in the classification approach include:

	* FILERNAME1
	* FILEREIN
	* TAXYEAR
	* FILERUSSTATE (990PF Basic Fields Data Mart: Header A \- Charity Location \- Domestic State)
	* SIGOCPYRFAPC (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Zip or Postal Code)
	* SIGOCAFFRFACO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address Country)
	* SIGOCPYRFAPO (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address State or Province)
	* SIGOCPYAMOUN (990PF Current Grants Data Mart: Part 15 Table 3A \- Col E \- Row 2 \- Amount)
	* SIGOCPYRFSTA (990PF Current Grants Data Mart: Part 15 Table 3A \- Col C \- Row 2 \- Foundation Status of Recipient)
	* SIGOCPYRFACI (990PF Current Grants Data Mart: Part 15 Table 3A \- Col A \- Row 2 \- Address City)
	* RECTABADDSTA (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- State)
	* RETAAMOFCAGR (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col D \- Amount of Cash)
	* RECTABADDCIT (990 Schedule I Grants Domestic Orgs & Gov Data Mart: Schedule I Part 2 Table Col A Row 2 \- Address US \- City)
	* Schedule F Variables TBD

	## Approach

	1. Identify funding orgs from Form 990 and 990PF (Part XIV-3a), including Schedules I and F
	2. Standardize and enrich location data using state/country lookups and geocoding via Geopy
	3. Calculate foreign grant percentage, max state concentration, number of distinct U.S. recipient states, and a composite score combining all three features
	4. Apply KMeans to group orgs into 3 clusters reflecting geographic scope
	OR
	Choose thresholds for the features (foreign grant percentage, max state concentration, number of states) to define the 3 geographic categories (local, national, international)

	# 2\. Code documentation

	## Notebooks

	1. Geopy Lookups
	[Geopy Lookups (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331532?o=4203893953353865)
	This notebook runs geolocation lookups for missing or ambiguous country/state data.
	2. 990PF Cleaning
	[990PF Cleaning (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331544?o=4203893953353865)
	This notebook cleans and standardizes grants data from Form 990-PF.
	3. 990PF Aggregation
	[990PF Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331577?o=4203893953353865)
	This notebook aggregates grant data by state and calculates key features.
	4. 990 Aggregation
	[990 Aggregation (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/3766548687331587?o=4203893953353865)
	This notebook performs similar aggregations for Form 990 Schedule I and F filers.
	5. Classification
	[Classification (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/788755825032239?o=4203893953353865)
	This notebook combines datasets, scales features, and runs KMeans clustering

	# 3\. Outputs and Results

	## K-means Cluster Definitions

	* Cluster 0: Local/Regional
	* High concentration in one state
	* Low foreign activity
	* Cluster 1: International
	* High percentage of international grants
	* Cluster 2: National
	* Wide U.S. state distribution
	* Moderate to low international share

	# 4\. List of approaches that did not work

	# 5\. Plan for Deploy, Scale, Archive, and Expose-data steps

	* Easily extendable to additional years and EINs
	* Output is an assigned geographic scope (local, national, or international) for each EIN per tax year

	# 6\. Future work

	* Incorporate Schedule F data once available
	* Potentially explore labeling and supervised learning using curated examples rather than clustering
	* Review k-means clustering approach and decide if it should be scaled or if a definitive feature-threshold approach should be used instead to define the 3 clusters