|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- RaThorat/doc_chunks |
|
|
language: |
|
|
- nl |
|
|
base_model: |
|
|
- GroNLP/bert-base-dutch-cased |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1). |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
Het doel is een schaalbare, privacyschone oplossing die gebruik maakt van openbare gegevens van DUS-I (zoals beleidsdocumenten en nieuwsberichten) om medewerkers snel en accuraat te informeren. |
|
|
|
|
|
### Model Sources [optional] |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** https://github.com/RaThorat/my-chatbot-project |
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
Identificatie van vragen: Veelvoorkomende onderwerpen zijn subsidie-informatie, beleidsontwikkelingen en handleidingen. |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
Tijd besparen door snel informatie te leveren aan medewerkers via AI. |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
46 txt, pdf en odt documenten van de DUS-I website zijn gebruikt om Chunks (200 woorden per chunk) te maken in JSON-formaat. |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
|
|
#### Preprocessing [optional] |
|
|
|
|
|
Documenten gegroepeerd (groeperen_segment_text_to_jsonl.py) in labels zoals: PROJECT, HANDLEIDING, OVEREENKOMST, PLAN, BELEID, SUBSIDIE. |
|
|
|
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Training regime:** Uitgevoerd met GroNLP/bert-base-dutch-cased model (110 miljoen parameters). <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
|
|
|
|
|
### Results |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
#### Summary |
|
|
|
|
|
Script voor textcat model: https://github.com/RaThorat/my-chatbot-project/blob/main/scripts/train_textcat_model.py |
|
|
|
|
|
|
|
|
## Technical Specifications [optional] |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
46 txt, pdf en odt documenten van de DUS-I website zijn gebruikt om Chunks (200 woorden per chunk) te maken in JSON-formaat. |
|
|
Voor text categorization model: dezelfde documenten omgezet naar JSONL-formaat. |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
#### Hardware |
|
|
|
|
|
8 vCPU's en 64 GB RAM was vereist. |