Safetensors
English
bert
hassaanulhaq01 commited on
Commit
e4484d7
·
verified ·
1 Parent(s): 30b417a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -11,21 +11,20 @@ language:
11
  ---
12
  **Technical Specifications Document is available at**: https://docs.google.com/document/d/1eLUFC-8FtJkaQT9dUhjwRRKn8bXrHaZsXdMlIvCoeT4/edit?usp=sharing
13
 
14
-
15
-
16
 
17
 
18
  # Non-Profit Mapping Project Documentation: Religious Orgs Segmentation
19
 
20
- Author: Zilun Lin - GivingTuesday Data Commons
21
- **Note for external readers**: Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday.
22
 
23
  # 1\. Approach
24
 
25
  ## Definition
26
 
27
  We use the following definition for categorizing religious orgs provided in the academic literature:
28
- “Religious organizations are organizations whose identity and mission are derived from a religious or spiritual tradition and which operate as registered or unregistered, nonprofit, voluntary entities.” ([source](https://www.montclair.edu/profilepages/media/11259/user/religiousorganizationsglobalencyclope.pdf))
29
  This definition is operationalized in how we prompt GPT 4 to classify the training and testing datasets. Namely, we give it information on the name, mission statement and key activities and prompt it to find mentions/wording/terminology that reveal an org’s religious affiliations.
30
 
31
  ## Religious Recipient Orgs
@@ -46,17 +45,17 @@ All of the notebooks should be reasonably documented. Please message Zilun Lin i
46
 
47
  This notebook randomly samples from the 990 datamart and classifies the sample orgs using GPT4. It also generates a curated dataset of artificial orgs that are associated with under-represented religions. These two datasets are combined, formatted into an appropriate instruct-prompt-output format for fine-tuning and uploaded to HuggingFace. The final dataset has over 2k examples for training and validation, and 500 examples for testing.
48
 
49
- [https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/1182041857993717?o=4203893953353865](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/1182041857993717?o=4203893953353865)
50
 
51
  ## Fine-tuning the LLM and testing for accuracy
52
 
53
- We downloaded the fine-tuning dataset from HuggingFace and fine-tune a set of LLMs. The resulting models are uploaded to HuggingFace. We also test these model’s accuracy on an unseen testing dataset.
54
 
55
  (Llama Models)
56
- [https://colab.research.google.com/drive/1tZBVcQ\_XQeb11HUBKxKjPTGBhMwJCiDF?usp=sharing](https://colab.research.google.com/drive/1tZBVcQ_XQeb11HUBKxKjPTGBhMwJCiDF?usp=sharing)
57
 
58
  (Bert models)
59
- [https://colab.research.google.com/drive/1OaV9wwqCzWqRXFmKzzDYW3Hwq\_zwaUE5?usp=sharing](https://colab.research.google.com/drive/1OaV9wwqCzWqRXFmKzzDYW3Hwq_zwaUE5?usp=sharing)
60
 
61
  # 3\. Outputs and Results
62
 
@@ -89,7 +88,7 @@ In comparison, BERT is much faster for inference, thanks to its streamlined mode
89
 
90
  # 4\. Deployment
91
 
92
- The chosen BERT model is now hosted on MLFlow (Databricks) in the model registry under the name \`religious\_orgs\_model\`, and has been released to the public under the apache-2 license on [Huggingface](https://huggingface.co/GivingTuesday/religious_org_v1). The processed data will be available for download in a data mart or API.
93
 
94
  The API endpoint will output five fields, three for BERT classification and two based on 1023 EZ data availability:
95
  BERT Natural Language Outputs:
@@ -98,3 +97,4 @@ BERT Natural Language Outputs:
98
  (3) Classification probability for whether the organisation is religious or not (and probability)
99
 
100
 
 
 
11
  ---
12
  **Technical Specifications Document is available at**: https://docs.google.com/document/d/1eLUFC-8FtJkaQT9dUhjwRRKn8bXrHaZsXdMlIvCoeT4/edit?usp=sharing
13
 
14
+ ---------------------------------------------------------------------------------------------------------------------------------------------------------
 
15
 
16
 
17
  # Non-Profit Mapping Project Documentation: Religious Orgs Segmentation
18
 
19
+ **Author**: Zilun Lin \- GivingTuesday Data Commons
20
+ **Note for external readers:** Some Databricks links in this document point to internal notebooks and may not be accessible to people outside GivingTuesday.
21
 
22
  # 1\. Approach
23
 
24
  ## Definition
25
 
26
  We use the following definition for categorizing religious orgs provided in the academic literature:
27
+ “Religious organizations are organizations whose identity and mission are derived from a religious or spiritual tradition and which operate as registered or unregistered, nonprofit, voluntary entities.” ([Source: Montclair.ed](https://www.montclair.edu/profilepages/media/11259/user/religiousorganizationsglobalencyclope.pdf))
28
  This definition is operationalized in how we prompt GPT 4 to classify the training and testing datasets. Namely, we give it information on the name, mission statement and key activities and prompt it to find mentions/wording/terminology that reveal an org’s religious affiliations.
29
 
30
  ## Religious Recipient Orgs
 
45
 
46
  This notebook randomly samples from the 990 datamart and classifies the sample orgs using GPT4. It also generates a curated dataset of artificial orgs that are associated with under-represented religions. These two datasets are combined, formatted into an appropriate instruct-prompt-output format for fine-tuning and uploaded to HuggingFace. The final dataset has over 2k examples for training and validation, and 500 examples for testing.
47
 
48
+ [Link to EDA Notebook (Databricks)](https://dbc-3a4d04f2-8cab.cloud.databricks.com/editor/notebooks/1182041857993717?o=4203893953353865)
49
 
50
  ## Fine-tuning the LLM and testing for accuracy
51
 
52
+ We downloaded the fine-tuning dataset from HuggingFace and fine-tuned a set of LLMs. The resulting models are uploaded to HuggingFace. We also test these model’s accuracy on an unseen testing dataset.
53
 
54
  (Llama Models)
55
+ [Llama Model Fine-tuning (Google Collab)](https://colab.research.google.com/drive/1tZBVcQ_XQeb11HUBKxKjPTGBhMwJCiDF?usp=sharing)
56
 
57
  (Bert models)
58
+ [BERT Model Fine-tuning (Google Collab)](https://colab.research.google.com/drive/1OaV9wwqCzWqRXFmKzzDYW3Hwq_zwaUE5?usp=sharing)
59
 
60
  # 3\. Outputs and Results
61
 
 
88
 
89
  # 4\. Deployment
90
 
91
+ The curated BERT model is now hosted on MLFlow (Databricks) in the model registry under the name \`religious\_orgs\_model\`, and has been released to the public under the apache-2 license on [Huggingface](https://huggingface.co/GivingTuesday/religious_org_v1). The processed data will be available for download in a data mart or API.
92
 
93
  The API endpoint will output five fields, three for BERT classification and two based on 1023 EZ data availability:
94
  BERT Natural Language Outputs:
 
97
  (3) Classification probability for whether the organisation is religious or not (and probability)
98
 
99
 
100
+