DagimB
/

ecfr-textcat

@@ -6,171 +6,271 @@ tags:
   - huggingface
 ---
-# prodigy-ecfr-textcat
-## About the Project
-Our goal is to organize these financial institution rules and regulations so financial institutions  can go through newly created rules and regulations to know which departments to send the information to and to allow easy retrieval of these regulations when necessary. Text mining and information retrieval will allow a large step of the process to be automated. Automating these steps will allow less time and effort to be contributed for financial institutions employees. This allows more time and work to be used to accomplish other projects.
-## Table of Contents
-- [About the Project](#about-the-project)
-- [Getting Started](#getting-started)
-  - [Prerequisites](#prerequisites)
-  - [Installation](#installation)
-- [Usage](#usage)
-- [File Structure](#file-structure)
-- [License](#license)
-- [Acknowledgements](#acknowledgements)
-## Getting Started
-Instructions on setting up the project on a local machine.
-### Prerequisites
-Before running the project, ensure you have the following software dependencies installed:
-- [Python 3.x](https://www.python.org/downloads/)
-- [spaCy](https://spacy.io/usage)
-- [Prodigy](https://prodi.gy/docs/) (optional)
-### Installation
-Follow these step-by-step instructions to install and configure the project:
-1. **Clone this repository to your local machine.**
-   ```bash
-   git clone <https://github.com/ManjinderUNCC/prodigy-ecfr-textcat.git>
-2. Install the required dependencies by running:
-```bash
-pip install -r requirements.txt
-```
-## Usage
-To use the project, follow these steps:
-1. **Prepare your data:**
-   - Place your dataset files in the `/data` directory.
-   - Optionally, annotate your data using Prodigy and save the annotations in the `/data` directory.
-2. **Train the text classification model:**
-   - Run the training script located in the `/python_Code` directory.
-3. **Evaluate the model:**
-   - Use the evaluation script to assess the model's performance on labeled data.
-4. **Make predictions:**
-   - Apply the trained model to new, unlabeled data to classify it into relevant categories.
-## File Structure
-Describe the organization of files and directories within the project.
-- `/corpus`
-  - `/labels`
-    - `ner.json`
-    - `parser.json`
-    - `tagger.json`
-    - `textcat_multilabel.json`
-- `/data`
-  - `eval.jsonl`
-  - `firstStep_file.jsonl`
-  - `five_examples_annotated5.jsonl`
-  - `goldenEval.jsonl`
-  - `thirdStep_file.jsonl`
-  - `train.jsonl`
-  - `train200.jsonl`
-  - `train4465.jsonl`
-- `/my_trained_model`
-  - `/textcat_multilabel`
-    - `cfg`
-    - `model`
-  - `/vocab`
-    - `key2row`
-    - `lookups.bin`
-    - `strings.json`
-    - `vectors`
-    - `vectors.cfg`
-  - `config.cfg`
-  - `meta.json`
-  - `tokenizer`
-- `/output`
-  - `/experiment1`
-    - `/model-best`
-      - `/textcat_multilabel`
-        - `cfg`
-        - `model`
-      - `/vocab`
-        - `key2row`
-        - `lookups.bin`
-        - `strings.json`
-        - `vectors`
-        - `vectors.cfg`
-      - `config.cfg`
-      - `meta.json`
-      - `tokenizer`
-    - `/model-last`
-      - `/textcat_multilabel`
-        - `cfg`
-        - `model`
-      - `/vocab`
-        - `key2row`
-        - `lookups.bin`
-        - `strings.json`
-        - `vectors`
-        - `vectors.cfg`
-      - `config.cfg`
-      - `meta.json`
-      - `tokenizer`
-  - `/experiment3`
-    - `/model-best`
-      - `/textcat_multilabel`
-        - `cfg`
-        - `model`
-      - `/vocab`
-        - `key2row`
-        - `lookups.bin`
-        - `strings.json`
-        - `vectors`
-        - `vectors.cfg`
-      - `config.cfg`
-      - `meta.json`
-      - `tokenizer`
-    - `/model-last`
-      - `/textcat_multilabel`
-        - `cfg`
-        - `model`
-      - `/vocab`
-        - `key2row`
-        - `lookups.bin`
-        - `strings.json`
-        - `vectors`
-        - `vectors.cfg`
-      - `config.cfg`
-      - `meta.json`
-      - `tokenizer`
-- `/python_Code`
-  - `finalStep-formatLabel.py`
-  - `firstStep-format.py`
-  - `five_examples_annotated.ipynb`
-  - `secondStep-score.py`
-  - `thirdStep-label.py`
-  - `train_eval_split.ipynb`
-- `TerminalCode.txt`
-- `requirements.txt`
-- `Terminal Commands vs Project.yml`
-- `Project.yml`
-- `README.md`
-- `prodigy.json`
-## License
-- Package A: MIT License
-- Package B: Apache License 2.0
-## Acknowledgements
-Manjinder Sandhu, Dagim Bantikassegn, Alex Brooks, Tyler Dabbs

   - huggingface
 ---
+<!-- WEASEL: AUTO-GENERATED DOCS START (do not remove) -->
+# 🪐 Weasel Project: Citations of ECFR Banking Regulation in a spaCy pipeline.
+Custom text classification project for spaCy v3 adapted from the spaCy v3
+## 📋 project.yml
+The [`project.yml`](project.yml) defines the data assets required by the
+project, as well as the available commands and workflows. For details, see the
+[Weasel documentation](https://github.com/explosion/weasel).
+### ⏯ Commands
+The following commands are defined by the project. They
+can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run).
+Commands are only re-run if their inputs have changed.
+| Command | Description |
+| --- | --- |
+| `format-script` | Execute the Python script `firstStep-format.py`, which performs the initial formatting of a dataset file for the first step of the project. This script extracts text and labels from a dataset file in JSONL format and writes them to a new JSONL file in a specific format.
+Usage:
+```
+spacy project run execute-first-step-format-script
+```
+Explanation:
+- The script `firstStep-format.py` reads data from the file specified in the `dataset_file` variable (`data/train200.jsonl` by default).
+- It extracts text and labels from each JSON object in the dataset file.
+- If both text and at least one label are available, it writes a new JSON object to the output file specified in the `output_file` variable (`data/firstStep_file.jsonl` by default) with the extracted text and label.
+- If either text or label is missing in a JSON object, a warning message is printed.
+- Upon completion, the script prints a message confirming the processing and the path to the output file.
+ |
+| `train-text-classification-model` | Train the text classification model for the second step of the project using the `secondStep-score.py` script. This script loads a blank English spaCy model and adds a text classification pipeline to it. It then trains the model using the processed data from the first step.
+Usage:
+```
+spacy project run train-text-classification-model
+```
+Explanation:
+- The script `secondStep-score.py` loads a blank English spaCy model and adds a text classification pipeline to it.
+- It reads processed data from the file specified in the `processed_data_file` variable (`data/firstStep_file.jsonl` by default).
+- The processed data is converted to spaCy format for training the model.
+- The model is trained using the converted data for a specified number of iterations (`n_iter`).
+- Losses are printed for each iteration during training.
+- Upon completion, the trained model is saved to the specified output directory (`./my_trained_model` by default).
+ |
+| `classify-unlabeled-data` | Classify the unlabeled data for the third step of the project using the `thirdStep-label.py` script. This script loads the trained spaCy model from the previous step and classifies each record in the unlabeled dataset.
+Usage:
+```
+spacy project run classify-unlabeled-data
+```
+Explanation:
+- The script `thirdStep-label.py` loads the trained spaCy model from the specified model directory (`./my_trained_model` by default).
+- It reads the unlabeled data from the file specified in the `unlabeled_data_file` variable (`data/train.jsonl` by default).
+- Each record in the unlabeled data is classified using the loaded model.
+- The predicted labels for each record are extracted and stored along with the text.
+- The classified data is optionally saved to a file specified in the `output_file` variable (`data/thirdStep_file.jsonl` by default).
+ |
+| `format-labeled-data` | Format the labeled data for the final step of the project using the `finalStep-formatLabel.py` script. This script processes the classified data from the third step and transforms it into a specific format, considering a threshold for label acceptance.
+Usage:
+```
+spacy project run format-labeled-data
+```
+Explanation:
+- The script `finalStep-formatLabel.py` reads classified data from the file specified in the `input_file` variable (`data/thirdStep_file.jsonl` by default).
+- For each record, it determines accepted categories based on a specified threshold.
+- It constructs an output record containing the text, predicted labels, accepted categories, answer (accept/reject), and options with meta information.
+- The transformed data is written to the file specified in the `output_file` variable (`data/train4465.jsonl` by default).
+ |
+| `setup-environment` | Set up the Python virtual environment.
+ |
+| `review-evaluation-data` | Review the evaluation data in Prodigy and automatically accept annotations.
+Usage:
+```
+spacy project run review-evaluation-data
+```
+Explanation:
+- The command reviews the evaluation data in Prodigy.
+- It automatically accepts annotations made during the review process.
+- Only sessions allowed by the environment variable PRODIGY_ALLOWED_SESSIONS are permitted to review data. In this case, the session 'reviwer' is allowed.
+ |
+| `export-reviewed-evaluation-data` | Export the reviewed evaluation data from Prodigy to a JSONL file named 'goldenEval.jsonl'.
+Usage:
+```
+spacy project run export-reviewed-evaluation-data
+```
+Explanation:
+- The command exports the reviewed evaluation data from Prodigy to a JSONL file.
+- The data is exported from the Prodigy database associated with the project named 'project3eval-review'.
+- The exported data is saved to the file 'goldenEval.jsonl'.
+- This command helps in preserving the reviewed annotations for further analysis or processing.
+ |
+| `import-training-data` | Import the training data into Prodigy from a JSONL file named 'train200.jsonl'.
+Usage:
+```
+spacy project run import-training-data
+```
+Explanation:
+- The command imports the training data into Prodigy from the specified JSONL file.
+- The data is imported into the Prodigy database associated with the project named 'prodigy3train'.
+- This command prepares the training data for annotation and model training in Prodigy.
+ |
+| `import-golden-evaluation-data` | Import the golden evaluation data into Prodigy from a JSONL file named 'goldeneval.jsonl'.
+Usage:
+```
+spacy project run import-golden-evaluation-data
+```
+Explanation:
+- The command imports the golden evaluation data into Prodigy from the specified JSONL file.
+- The data is imported into the Prodigy database associated with the project named 'golden3'.
+- This command prepares the golden evaluation data for further analysis and model evaluation in Prodigy.
+ |
+| `train-model-experiment1` | Train a text classification model using Prodigy with the 'prodigy3train' dataset and evaluating on 'golden3'.
+Usage:
+```
+spacy project run train-model-experiment1
+```
+Explanation:
+- The command trains a text classification model using Prodigy.
+- It uses the 'prodigy3train' dataset for training and evaluates the model on the 'golden3' dataset.
+- The trained model is saved to the './output/experiment1' directory.
+ |
+| `download-model` | Download the English language model 'en_core_web_lg' from spaCy.
+Usage:
+```
+spacy project run download-model
+```
+Explanation:
+- The command downloads the English language model 'en_core_web_lg' from spaCy.
+- This model is used as the base model for further data processing and training in the project.
+ |
+| `convert-data-to-spacy-format` | Convert the annotated data from Prodigy to spaCy format using the 'prodigy3train' and 'golden3' datasets.
+Usage:
+```
+spacy project run convert-data-to-spacy-format
+```
+Explanation:
+- The command converts the annotated data from Prodigy to spaCy format.
+- It uses the 'prodigy3train' and 'golden3' datasets for conversion.
+- The converted data is saved to the './corpus' directory with the base model 'en_core_web_lg'.
+ |
+| `train-custom-model` | Train a custom text classification model using spaCy with the converted data in spaCy format.
+Usage:
+```
+spacy project run train-custom-model
+```
+Explanation:
+- The command trains a custom text classification model using spaCy.
+- It uses the converted data in spaCy format located in the './corpus' directory.
+- The model is trained using the configuration defined in 'corpus/config.cfg'.
+ |
+### ⏭ Workflows
+The following workflows are defined by the project. They
+can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run)
+and will run the specified commands in order. Commands are only re-run if their
+inputs have changed.
+| Workflow | Steps |
+| --- | --- |
+| `all` | `format-script` &rarr; `train-text-classification-model` &rarr; `classify-unlabeled-data` &rarr; `format-labeled-data` &rarr; `setup-environment` &rarr; `review-evaluation-data` &rarr; `export-reviewed-evaluation-data` &rarr; `import-training-data` &rarr; `import-golden-evaluation-data` &rarr; `train-model-experiment1` &rarr; `download-model` &rarr; `convert-data-to-spacy-format` &rarr; `train-custom-model` |
+### 🗂 Assets
+The following assets are defined by the project. They can
+be fetched by running [`weasel assets`](https://github.com/explosion/weasel/tree/main/docs/cli.md#open_file_folder-assets)
+in the project directory.
+| File | Source | Description |
+| --- | --- | --- |
+| [`corpus/labels/ner.json`](corpus/labels/ner.json) | Local | JSON file containing NER labels |
+| [`corpus/labels/parser.json`](corpus/labels/parser.json) | Local | JSON file containing parser labels |
+| [`corpus/labels/tagger.json`](corpus/labels/tagger.json) | Local | JSON file containing tagger labels |
+| [`corpus/labels/textcat_multilabel.json`](corpus/labels/textcat_multilabel.json) | Local | JSON file containing multilabel text classification labels |
+| [`data/eval.jsonl`](data/eval.jsonl) | Local | JSONL file containing evaluation data |
+| [`data/firstStep_file.jsonl`](data/firstStep_file.jsonl) | Local | JSONL file containing formatted data from the first step |
+| `data/five_examples_annotated5.jsonl` | Local | JSONL file containing five annotated examples |
+| [`data/goldenEval.jsonl`](data/goldenEval.jsonl) | Local | JSONL file containing golden evaluation data |
+| [`data/thirdStep_file.jsonl`](data/thirdStep_file.jsonl) | Local | JSONL file containing classified data from the third step |
+| [`data/train.jsonl`](data/train.jsonl) | Local | JSONL file containing training data |
+| [`data/train200.jsonl`](data/train200.jsonl) | Local | JSONL file containing initial training data |
+| [`data/train4465.jsonl`](data/train4465.jsonl) | Local | JSONL file containing formatted and labeled training data |
+| [`my_trained_model/textcat_multilabel/cfg`](my_trained_model/textcat_multilabel/cfg) | Local | Configuration files for the text classification model |
+| [`my_trained_model/textcat_multilabel/model`](my_trained_model/textcat_multilabel/model) | Local | Trained model files for the text classification model |
+| [`my_trained_model/vocab/key2row`](my_trained_model/vocab/key2row) | Local | Mapping from keys to row indices in the vocabulary |
+| [`my_trained_model/vocab/lookups.bin`](my_trained_model/vocab/lookups.bin) | Local | Binary lookups file for the vocabulary |
+| [`my_trained_model/vocab/strings.json`](my_trained_model/vocab/strings.json) | Local | JSON file containing string representations of the vocabulary |
+| [`my_trained_model/vocab/vectors`](my_trained_model/vocab/vectors) | Local | Directory containing vector files for the vocabulary |
+| [`my_trained_model/vocab/vectors.cfg`](my_trained_model/vocab/vectors.cfg) | Local | Configuration file for vectors in the vocabulary |
+| [`my_trained_model/config.cfg`](my_trained_model/config.cfg) | Local | Configuration file for the trained model |
+| [`my_trained_model/meta.json`](my_trained_model/meta.json) | Local | JSON file containing metadata for the trained model |
+| [`my_trained_model/tokenizer`](my_trained_model/tokenizer) | Local | Tokenizer files for the trained model |
+| [`output/experiment1/model-best/textcat_multilabel/cfg`](output/experiment1/model-best/textcat_multilabel/cfg) | Local | Configuration files for the best model in experiment 1 |
+| [`output/experiment1/model-best/textcat_multilabel/model`](output/experiment1/model-best/textcat_multilabel/model) | Local | Trained model files for the best model in experiment 1 |
+| [`output/experiment1/model-best/vocab/key2row`](output/experiment1/model-best/vocab/key2row) | Local | Mapping from keys to row indices in the vocabulary for the best model in experiment 1 |
+| [`output/experiment1/model-best/vocab/lookups.bin`](output/experiment1/model-best/vocab/lookups.bin) | Local | Binary lookups file for the vocabulary for the best model in experiment 1 |
+| [`output/experiment1/model-best/vocab/strings.json`](output/experiment1/model-best/vocab/strings.json) | Local | JSON file containing string representations of the vocabulary for the best model in experiment 1 |
+| [`output/experiment1/model-best/vocab/vectors`](output/experiment1/model-best/vocab/vectors) | Local | Directory containing vector files for the vocabulary for the best model in experiment 1 |
+| [`output/experiment1/model-best/vocab/vectors.cfg`](output/experiment1/model-best/vocab/vectors.cfg) | Local | Configuration file for vectors in the vocabulary for the best model in experiment 1 |
+| [`output/experiment1/model-best/config.cfg`](output/experiment1/model-best/config.cfg) | Local | Configuration file for the best model in experiment 1 |
+| [`output/experiment1/model-best/meta.json`](output/experiment1/model-best/meta.json) | Local | JSON file containing metadata for the best model in experiment 1 |
+| [`output/experiment1/model-best/tokenizer`](output/experiment1/model-best/tokenizer) | Local | Tokenizer files for the best model in experiment 1 |
+| [`output/experiment1/model-last/textcat_multilabel/cfg`](output/experiment1/model-last/textcat_multilabel/cfg) | Local | Configuration files for the last model in experiment 1 |
+| [`output/experiment1/model-last/textcat_multilabel/model`](output/experiment1/model-last/textcat_multilabel/model) | Local | Trained model files for the last model in experiment 1 |
+| [`output/experiment1/model-last/vocab/key2row`](output/experiment1/model-last/vocab/key2row) | Local | Mapping from keys to row indices in the vocabulary for the last model in experiment 1 |
+| [`output/experiment1/model-last/vocab/lookups.bin`](output/experiment1/model-last/vocab/lookups.bin) | Local | Binary lookups file for the vocabulary for the last model in experiment 1 |
+| [`output/experiment1/model-last/vocab/strings.json`](output/experiment1/model-last/vocab/strings.json) | Local | JSON file containing string representations of the vocabulary for the last model in experiment 1 |
+| [`output/experiment1/model-last/vocab/vectors`](output/experiment1/model-last/vocab/vectors) | Local | Directory containing vector files for the vocabulary for the last model in experiment 1 |
+| [`output/experiment1/model-last/vocab/vectors.cfg`](output/experiment1/model-last/vocab/vectors.cfg) | Local | Configuration file for vectors in the vocabulary for the last model in experiment 1 |
+| [`output/experiment1/model-last/config.cfg`](output/experiment1/model-last/config.cfg) | Local | Configuration file for the last model in experiment 1 |
+| [`output/experiment1/model-last/meta.json`](output/experiment1/model-last/meta.json) | Local | JSON file containing metadata for the last model in experiment 1 |
+| [`output/experiment1/model-last/tokenizer`](output/experiment1/model-last/tokenizer) | Local | Tokenizer files for the last model in experiment 1 |
+| [`output/experiment3/model-best/textcat_multilabel/cfg`](output/experiment3/model-best/textcat_multilabel/cfg) | Local | Configuration files for the best model in experiment 3 |
+| [`output/experiment3/model-best/textcat_multilabel/model`](output/experiment3/model-best/textcat_multilabel/model) | Local | Trained model files for the best model in experiment 3 |
+| [`output/experiment3/model-best/vocab/key2row`](output/experiment3/model-best/vocab/key2row) | Local | Mapping from keys to row indices in the vocabulary for the best model in experiment 3 |
+| [`output/experiment3/model-best/vocab/lookups.bin`](output/experiment3/model-best/vocab/lookups.bin) | Local | Binary lookups file for the vocabulary for the best model in experiment 3 |
+| [`output/experiment3/model-best/vocab/strings.json`](output/experiment3/model-best/vocab/strings.json) | Local | JSON file containing string representations of the vocabulary for the best model in experiment 3 |
+| [`output/experiment3/model-best/vocab/vectors`](output/experiment3/model-best/vocab/vectors) | Local | Directory containing vector files for the vocabulary for the best model in experiment 3 |
+| [`output/experiment3/model-best/vocab/vectors.cfg`](output/experiment3/model-best/vocab/vectors.cfg) | Local | Configuration file for vectors in the vocabulary for the best model in experiment 3 |
+| [`output/experiment3/model-best/config.cfg`](output/experiment3/model-best/config.cfg) | Local | Configuration file for the best model in experiment 3 |
+| [`output/experiment3/model-best/meta.json`](output/experiment3/model-best/meta.json) | Local | JSON file containing metadata for the best model in experiment 3 |
+| [`output/experiment3/model-best/tokenizer`](output/experiment3/model-best/tokenizer) | Local | Tokenizer files for the best model in experiment 3 |
+| [`output/experiment3/model-last/textcat_multilabel/cfg`](output/experiment3/model-last/textcat_multilabel/cfg) | Local | Configuration files for the last model in experiment 3 |
+| [`output/experiment3/model-last/textcat_multilabel/model`](output/experiment3/model-last/textcat_multilabel/model) | Local | Trained model files for the last model in experiment 3 |
+| [`output/experiment3/model-last/vocab/key2row`](output/experiment3/model-last/vocab/key2row) | Local | Mapping from keys to row indices in the vocabulary for the last model in experiment 3 |
+| [`output/experiment3/model-last/vocab/lookups.bin`](output/experiment3/model-last/vocab/lookups.bin) | Local | Binary lookups file for the vocabulary for the last model in experiment 3 |
+| [`output/experiment3/model-last/vocab/strings.json`](output/experiment3/model-last/vocab/strings.json) | Local | JSON file containing string representations of the vocabulary for the last model in experiment 3 |
+| [`output/experiment3/model-last/vocab/vectors`](output/experiment3/model-last/vocab/vectors) | Local | Directory containing vector files for the vocabulary for the last model in experiment 3 |
+| [`output/experiment3/model-last/vocab/vectors.cfg`](output/experiment3/model-last/vocab/vectors.cfg) | Local | Configuration file for vectors in the vocabulary for the last model in experiment 3 |
+| [`output/experiment3/model-last/config.cfg`](output/experiment3/model-last/config.cfg) | Local | Configuration file for the last model in experiment 3 |
+| [`output/experiment3/model-last/meta.json`](output/experiment3/model-last/meta.json) | Local | JSON file containing metadata for the last model in experiment 3 |
+| [`output/experiment3/model-last/tokenizer`](output/experiment3/model-last/tokenizer) | Local | Tokenizer files for the last model in experiment 3 |
+| [`python_Code/finalStep-formatLabel.py`](python_Code/finalStep-formatLabel.py) | Local | Python script for formatting labeled data in the final step |
+| [`python_Code/firstStep-format.py`](python_Code/firstStep-format.py) | Local | Python script for formatting data in the first step |
+| [`python_Code/five_examples_annotated.ipynb`](python_Code/five_examples_annotated.ipynb) | Local | Jupyter notebook containing five annotated examples |
+| [`python_Code/secondStep-score.py`](python_Code/secondStep-score.py) | Local | Python script for scoring data in the second step |
+| [`python_Code/thirdStep-label.py`](python_Code/thirdStep-label.py) | Local | Python script for labeling data in the third step |
+| [`python_Code/train_eval_split.ipynb`](python_Code/train_eval_split.ipynb) | Local | Jupyter notebook for training and evaluation data splitting |
+| [`TerminalCode.txt`](TerminalCode.txt) | Local | Text file containing terminal code |
+| [`README.md`](README.md) | Local | Markdown file containing project documentation |
+| [`prodigy.json`](prodigy.json) | Local | JSON file containing Prodigy configuration |
+<!-- WEASEL: AUTO-GENERATED DOCS END (do not remove) -->