Configuration Parsing Warning: Invalid JSON for config file tokenizer_config.json

Built with Axolotl

See axolotl config

axolotl version: 0.10.0

base_model: google/gemma-3-4b-it

load_in_4bit: true

# gemma3 doesn't seem to play nice with ddp
ddp_find_unused_parameters: true

chat_template: gemma3
eot_tokens:
  - <end_of_turn>
datasets:
  - path: /data/meta-extractor/conversations/conversations_gemma-3-4b-it_qlora_pdf_metadata_extractor.jsonl
    type: chat_template
    train_on_inputs: false
    field_messages: conversations
    message_property_mappings:
      role: role
      content: content


dataset_prepared_path: last_run_prepared
val_set_size: 0.01

output_dir: /data/meta-extractor/models/gemma-3-4b-it_qlora_pdf_metadata_extractor

adapter: qlora
lora_model_dir:

sequence_len: 4096
sample_packing: false


lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: true
fp16:
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
logging_steps: 1
flash_attention: false
eager_attention:

warmup_ratio: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0

# save_first_step: true  # uncomment this to validate checkpoint saving works with your config

Training data

This model was trained from scratch on a dataset of texts extracted from public publications from Danish government agencies and municipalities published as PDFs.

Model description

This model was trained for the purpose of extracting metadata from Danish public publications from government agencies, ministries and municipalities published as PDFs. We created a dataset of 10.000 articles and extracted the text of the PDFs using fitz (PyMuPDF). We have metadata registrations of these articles in the danish MARC format danMARC and we converted this data to a common json format. We then trained the model to given an input of an extracted text from a PDF to output metadata in json format.

Intended uses & limitations

The model is intented to be used for creating metadata for library catalogues and we made a custom prompt to be used with it (see link below).

The model was mainly trained on Danish publications, but will probably work for other languages as well.

Synthetic example of the simple json format we expect the model to output using this prompt:

https://github.com/DBCDK/meta-extractor/blob/main/src/meta_extractor/data/prompt_production.json

{
  "title": "National Public Health Preparedness Outlook",
  "subtitle": "Assessment of Response Protocols and Resource Readiness",
  "publication_year": 1998,
  "language_as_iso639-2": "eng",
  "identifiers": { "ISBN": "099-9-4444442-9" },
  "publisher": ["Health Security Authority"],
  "creator_persons": [
    "Marel Orit [author]",
    "Tilo Envar [editor]"
  ],
  "creator_corporations": [
    "Civic Emergency Planning Unit"
  ],
  "country_of_publication_iso639-1": ["us"]
}

Training and evaluation data

We do not have the license to provide the training and evaluation data.

Training procedure

See more at https://github.com/DBCDK/meta-extractor for how we trained the model.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0002
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 16
  • total_eval_batch_size: 4
  • optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 54
  • training_steps: 543

Evaluation results

We evaluated the performance of our model against a set of 990 PDFs not used for training were we had a ground truth based on metadata made by professional librarians. The 990 PDFs had 10890 metadata fields we evaluated against.

TP: True positive = The model has suggested a value that was correct
FP: False positive = The model has suggested an incorrect value โ€“ either there should not have been a value at all, or the value is wrong
FN: False negative = The model has not suggested anything, but it should have (however, it is not certain that the value exists in the text extracted from the PDF)
TN: True negative = The model has not suggested a value, and it should not have

########################################
Evaluation results:
########################################
-------------------EXACT MATCH-------------------
Overall accuracy: 82.26% (8958/10890)
Total hallucinations: 1442 | TP=6703 FP=1601 FN=1487 TN=3190
Micro P/R/F1: P=80.72% R=81.84% F1=81.28%
TITLE accuracy: 76.16% (754/990) | TP=754 FP=231 FN=5, TN=0 | P=76.55% R=99.34% F1=86.47% | Hallucinations=231
SUBTITLE accuracy: 71.01% (703/990) | TP=404 FP=231 FN=56, TN=299 | P=63.62% R=87.83% F1=73.79% | Hallucinations=231
PUBLICATION_YEAR accuracy: 94.24% (933/990) | TP=933 FP=52 FN=5, TN=0 | P=94.72% R=99.47% F1=97.04% | Hallucinations=52
LANGUAGE_AS_ISO639-2 accuracy: 98.18% (972/990) | TP=972 FP=13 FN=5, TN=0 | P=98.68% R=99.49% F1=99.08% | Hallucinations=13
IDENTIFIERS.ISBN accuracy: 81.31% (805/990) | TP=569 FP=140 FN=45, TN=236 | P=80.25% R=92.67% F1=86.02% | Hallucinations=140
PUBLISHER accuracy: 72.73% (720/990) | TP=800 FP=243 FN=282, TN=0 | P=76.70% R=73.94% F1=75.29% | Hallucinations=238
CREATOR_PERSONS accuracy: 68.99% (683/990) | TP=998 FP=288 FN=471, TN=280 | P=77.60% R=67.94% F1=72.45% | Hallucinations=177
CREATOR_CORPORATIONS accuracy: 60.10% (595/990) | TP=809 FP=341 FN=456, TN=37 | P=70.35% R=63.95% F1=67.00% | Hallucinations=300
COUNTRY_OF_PUBLICATION_ISO639-1 accuracy: 99.70% (987/990) | TP=0 FP=0 FN=4, TN=987 | P=โ€” R=0.00% F1=โ€” | Hallucinations=0
SERIES accuracy: 88.89% (880/990) | TP=289 FP=34 FN=101, TN=600 | P=89.47% R=74.10% F1=81.07% | Hallucinations=33
SERIES_NUMBER accuracy: 93.54% (926/990) | TP=175 FP=28 FN=57, TN=751 | P=86.21% R=75.43% F1=80.46% | Hallucinations=27

Framework versions

  • PEFT 0.15.2
  • Transformers 4.52.3
  • Pytorch 2.6.0+cu124
  • Datasets 3.6.0
  • Tokenizers 0.21.4
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DBCDigital/gemma-3-4b-it_qlora_pdf_metadata_extractor

Adapter
(114)
this model