RaThorat
/

textcat_model

Model card Files Files and versions

textcat_model / README.md

RaThorat's picture

Update README.md

b24dc98 verified 11 months ago

|

history blame contribute delete

2.72 kB

	---
	license: mit
	datasets:
	- RaThorat/doc_chunks
	language:
	- nl
	base_model:
	- GroNLP/bert-base-dutch-cased
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->

	This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->
	Het doel is een schaalbare, privacyschone oplossing die gebruik maakt van openbare gegevens van DUS-I (zoals beleidsdocumenten en nieuwsberichten) om medewerkers snel en accuraat te informeren.

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/RaThorat/my-chatbot-project

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
	Identificatie van vragen: Veelvoorkomende onderwerpen zijn subsidie-informatie, beleidsontwikkelingen en handleidingen.

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
	Tijd besparen door snel informatie te leveren aan medewerkers via AI.

	[More Information Needed]


	## Training Details

	### Training Data

	46 txt, pdf en odt documenten van de DUS-I website zijn gebruikt om Chunks (200 woorden per chunk) te maken in JSON-formaat.

	[More Information Needed]

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	#### Preprocessing [optional]

	Documenten gegroepeerd (groeperen_segment_text_to_jsonl.py) in labels zoals: PROJECT, HANDLEIDING, OVEREENKOMST, PLAN, BELEID, SUBSIDIE.


	#### Training Hyperparameters

	- Training regime: Uitgevoerd met GroNLP/bert-base-dutch-cased model (110 miljoen parameters). <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->


	### Results

	[More Information Needed]

	#### Summary

	Script voor textcat model: https://github.com/RaThorat/my-chatbot-project/blob/main/scripts/train_textcat_model.py


	## Technical Specifications [optional]

	### Model Architecture and Objective

	46 txt, pdf en odt documenten van de DUS-I website zijn gebruikt om Chunks (200 woorden per chunk) te maken in JSON-formaat.
	Voor text categorization model: dezelfde documenten omgezet naar JSONL-formaat.

	### Compute Infrastructure

	[More Information Needed]

	#### Hardware

	8 vCPU's en 64 GB RAM was vereist.