textcat_model / README.md
RaThorat's picture
Update README.md
b24dc98 verified
---
license: mit
datasets:
- RaThorat/doc_chunks
language:
- nl
base_model:
- GroNLP/bert-base-dutch-cased
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
Het doel is een schaalbare, privacyschone oplossing die gebruik maakt van openbare gegevens van DUS-I (zoals beleidsdocumenten en nieuwsberichten) om medewerkers snel en accuraat te informeren.
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/RaThorat/my-chatbot-project
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
Identificatie van vragen: Veelvoorkomende onderwerpen zijn subsidie-informatie, beleidsontwikkelingen en handleidingen.
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
Tijd besparen door snel informatie te leveren aan medewerkers via AI.
[More Information Needed]
## Training Details
### Training Data
46 txt, pdf en odt documenten van de DUS-I website zijn gebruikt om Chunks (200 woorden per chunk) te maken in JSON-formaat.
[More Information Needed]
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Preprocessing [optional]
Documenten gegroepeerd (groeperen_segment_text_to_jsonl.py) in labels zoals: PROJECT, HANDLEIDING, OVEREENKOMST, PLAN, BELEID, SUBSIDIE.
#### Training Hyperparameters
- **Training regime:** Uitgevoerd met GroNLP/bert-base-dutch-cased model (110 miljoen parameters). <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
### Results
[More Information Needed]
#### Summary
Script voor textcat model: https://github.com/RaThorat/my-chatbot-project/blob/main/scripts/train_textcat_model.py
## Technical Specifications [optional]
### Model Architecture and Objective
46 txt, pdf en odt documenten van de DUS-I website zijn gebruikt om Chunks (200 woorden per chunk) te maken in JSON-formaat.
Voor text categorization model: dezelfde documenten omgezet naar JSONL-formaat.
### Compute Infrastructure
[More Information Needed]
#### Hardware
8 vCPU's en 64 GB RAM was vereist.