Update README.md
Browse files
README.md
CHANGED
|
@@ -11,21 +11,20 @@ language:
|
|
| 11 |
---
|
| 12 |
**Technical Specifications Document is available at**: https://docs.google.com/document/d/1eLUFC-8FtJkaQT9dUhjwRRKn8bXrHaZsXdMlIvCoeT4/edit?usp=sharing
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
|
| 17 |
|
| 18 |
# Non-Profit Mapping Project Documentation: Religious Orgs Segmentation
|
| 19 |
|
| 20 |
-
Author
|
| 21 |
-
**Note for external readers
|
| 22 |
|
| 23 |
# 1\. Approach
|
| 24 |
|
| 25 |
## Definition
|
| 26 |
|
| 27 |
We use the following definition for categorizing religious orgs provided in the academic literature:
|
| 28 |
-
“Religious organizations are organizations whose identity and mission are derived from a religious or spiritual tradition and which operate as registered or unregistered, nonprofit, voluntary entities.” ([
|
| 29 |
This definition is operationalized in how we prompt GPT 4 to classify the training and testing datasets. Namely, we give it information on the name, mission statement and key activities and prompt it to find mentions/wording/terminology that reveal an org’s religious affiliations.
|
| 30 |
|
| 31 |
## Religious Recipient Orgs
|
|
@@ -46,17 +45,17 @@ All of the notebooks should be reasonably documented. Please message Zilun Lin i
|
|
| 46 |
|
| 47 |
This notebook randomly samples from the 990 datamart and classifies the sample orgs using GPT4. It also generates a curated dataset of artificial orgs that are associated with under-represented religions. These two datasets are combined, formatted into an appropriate instruct-prompt-output format for fine-tuning and uploaded to HuggingFace. The final dataset has over 2k examples for training and validation, and 500 examples for testing.
|
| 48 |
|
| 49 |
-
[
|
| 50 |
|
| 51 |
## Fine-tuning the LLM and testing for accuracy
|
| 52 |
|
| 53 |
-
We downloaded the fine-tuning dataset from HuggingFace and fine-
|
| 54 |
|
| 55 |
(Llama Models)
|
| 56 |
-
[
|
| 57 |
|
| 58 |
(Bert models)
|
| 59 |
-
[
|
| 60 |
|
| 61 |
# 3\. Outputs and Results
|
| 62 |
|
|
@@ -89,7 +88,7 @@ In comparison, BERT is much faster for inference, thanks to its streamlined mode
|
|
| 89 |
|
| 90 |
# 4\. Deployment
|
| 91 |
|
| 92 |
-
The
|
| 93 |
|
| 94 |
The API endpoint will output five fields, three for BERT classification and two based on 1023 EZ data availability:
|
| 95 |
BERT Natural Language Outputs:
|
|
@@ -98,3 +97,4 @@ BERT Natural Language Outputs:
|
|
| 98 |
(3) Classification probability for whether the organisation is religious or not (and probability)
|
| 99 |
|
| 100 |
|
|
|
|
|
|
| 11 |
---
|
| 12 |
**Technical Specifications Document is available at**: https://docs.google.com/document/d/1eLUFC-8FtJkaQT9dUhjwRRKn8bXrHaZsXdMlIvCoeT4/edit?usp=sharing
|
| 13 |
|
| 14 |
+
---------------------------------------------------------------------------------------------------------------------------------------------------------
|
|
|
|
| 15 |
|
| 16 |
|
| 17 |
# Non-Profit Mapping Project Documentation: Religious Orgs Segmentation
|
| 18 |
|
| 19 |
+
**Author**: Zilun Lin \- GivingTuesday Data Commons
|
| 20 |
+
**Note for external readers:** Some Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday.
|
| 21 |
|
| 22 |
# 1\. Approach
|
| 23 |
|
| 24 |
## Definition
|
| 25 |
|
| 26 |
We use the following definition for categorizing religious orgs provided in the academic literature:
|
| 27 |
+
“Religious organizations are organizations whose identity and mission are derived from a religious or spiritual tradition and which operate as registered or unregistered, nonprofit, voluntary entities.” ([Source: Montclair.ed](https://www.montclair.edu/profilepages/media/11259/user/religiousorganizationsglobalencyclope.pdf))
|
| 28 |
This definition is operationalized in how we prompt GPT 4 to classify the training and testing datasets. Namely, we give it information on the name, mission statement and key activities and prompt it to find mentions/wording/terminology that reveal an org’s religious affiliations.
|
| 29 |
|
| 30 |
## Religious Recipient Orgs
|
|
|
|
| 45 |
|
| 46 |
This notebook randomly samples from the 990 datamart and classifies the sample orgs using GPT4. It also generates a curated dataset of artificial orgs that are associated with under-represented religions. These two datasets are combined, formatted into an appropriate instruct-prompt-output format for fine-tuning and uploaded to HuggingFace. The final dataset has over 2k examples for training and validation, and 500 examples for testing.
|
| 47 |
|
| 48 |
+
[Link to EDA Notebook (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/1182041857993717?o=4203893953353865)
|
| 49 |
|
| 50 |
## Fine-tuning the LLM and testing for accuracy
|
| 51 |
|
| 52 |
+
We downloaded the fine-tuning dataset from HuggingFace and fine-tuned a set of LLMs. The resulting models are uploaded to HuggingFace. We also test these model’s accuracy on an unseen testing dataset.
|
| 53 |
|
| 54 |
(Llama Models)
|
| 55 |
+
[Llama Model Fine-tuning (Google Collab)](https://colab.research.google.com/drive/1tZBVcQ_XQeb11HUBKxKjPTGBhMwJCiDF?usp=sharing)
|
| 56 |
|
| 57 |
(Bert models)
|
| 58 |
+
[BERT Model Fine-tuning (Google Collab)](https://colab.research.google.com/drive/1OaV9wwqCzWqRXFmKzzDYW3Hwq_zwaUE5?usp=sharing)
|
| 59 |
|
| 60 |
# 3\. Outputs and Results
|
| 61 |
|
|
|
|
| 88 |
|
| 89 |
# 4\. Deployment
|
| 90 |
|
| 91 |
+
The curated BERT model is now hosted on MLFlow (Databricks) in the model registry under the name \`religious\_orgs\_model\`, and has been released to the public under the apache-2 license on [Huggingface](https://huggingface.co/GivingTuesday/religious_org_v1). The processed data will be available for download in a data mart or API.
|
| 92 |
|
| 93 |
The API endpoint will output five fields, three for BERT classification and two based on 1023 EZ data availability:
|
| 94 |
BERT Natural Language Outputs:
|
|
|
|
| 97 |
(3) Classification probability for whether the organisation is religious or not (and probability)
|
| 98 |
|
| 99 |
|
| 100 |
+
|