Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

.gitattributes +1 -0
README.md +158 -0
model.npz.best-chrf.npz +3 -0
run_model.sh +17 -0
tiny.decoder.yml +5 -0
vocab.spm +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+vocab.spm filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,158 @@

+---
+language:
+- eo
+- ca
+tags:
+- machine-translation
+- translation
+- marian
+- esperanto
+- catalan
+- neural-machine-translation
+library_name: marian
+pipeline_tag: translation
+license: apache-2.0
+---
+# Esperanto → Catalan MarianMT model
+This repository contains a **Marian NMT** model for **Esperanto-to-Catalan** machine translation.
+## Overview
+This model was trained for the translation direction:
+- **Source language:** Esperanto (`eo`)
+- **Target language:** Catalan (`ca`)
+It is distributed in **Marian format** and is intended to be used with the **Marian decoder**.
+## Important note
+This model is **not intended for direct inference through the Hugging Face `transformers` library**.
+Use **Marian** for inference instead.
+## Repository contents
+The repository includes the following files:
+- `model.npz.best-chrf.npz` — trained Marian model checkpoint
+- `tiny.decoder.yml` — decoder configuration
+- `vocab.spm` — SentencePiece vocabulary
+## Requirements
+You need a working installation of **Marian NMT**.
+For example, on our system the decoder binary is located at:
+```bash
+/scratch/project_2005815/members/degibert/MTM25/marian/build/marian-decoder
+````
+## Inference
+Run decoding from inside the model directory:
+```bash
+marian-decoder \
+  -c tiny.decoder.yml \
+  --input input.epo \
+  --output output.cat \
+  --normalize \
+  -m model.npz.best-chrf.npz \
+  --vocabs vocab.spm vocab.spm \
+  --log decode.log \
+  --devices 0
+```
+## Example
+Input file `input.epo`:
+```text
+Ŝi amas danci.
+```
+Output file `output.cat`:
+```text
+Li encanta ballar.
+```
+## Example helper script
+You can also run the model with a small shell script such as:
+```bash
+#!/usr/bin/env bash
+set -euo pipefail
+MARIAN_BIN="/path/to/marian-decoder"
+MODEL_DIR="$(cd "$(dirname "$0")" && pwd)"
+INPUT="${1:-input.epo}"
+OUTPUT="${2:-output.cat}"
+LOG="${3:-decode.log}"
+"$MARIAN_BIN" \
+  -c "$MODEL_DIR/tiny.decoder.yml" \
+  --input "$MODEL_DIR/$INPUT" \
+  --output "$MODEL_DIR/$OUTPUT" \
+  --normalize \
+  -m "$MODEL_DIR/model.npz.best-chrf.npz" \
+  --vocabs "$MODEL_DIR/vocab.spm" "$MODEL_DIR/vocab.spm" \
+  --log "$MODEL_DIR/$LOG" \
+  --devices 0
+```
+## Intended use
+This model is intended for:
+* research on low-resource machine translation
+* Esperanto–Catalan translation experiments
+* reproducible Marian-based inference
+## Limitations
+This is a research model and may have limitations including:
+* reduced robustness outside the training domain
+* sensitivity to spelling variation and noisy input
+* lower quality on idiomatic, literary, or highly specialised text
+Outputs should be reviewed before use in high-stakes or publication settings.
+## Training and evaluation
+Add here any details you want to share, for example:
+* training corpus or data source
+* preprocessing pipeline
+* tokenisation / SentencePiece setup
+* evaluation sets
+* BLEU / chrF results
+Example placeholder text:
+This model was trained as part of research on low-resource translation involving Esperanto and Catalan. Evaluation was carried out on held-out test data using standard MT metrics.
+## Citation
+If you use this model, please cite the associated work.
+```bibtex
+@misc{degibert2026eo_ca_marian,
+  title  = {Esperanto to Catalan MarianMT Model},
+  author = {Degibert, [Your full name]},
+  year   = {2026},
+  note   = {Model distributed via Hugging Face}
+}
+```
+## Acknowledgements
+This model was trained using **Marian NMT**

model.npz.best-chrf.npz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c180e7307a456d968a9c6ed3cbb22707a8c7e676c279cac47e8c9acfb7c6c243
+size 68714917

run_model.sh ADDED Viewed

	@@ -0,0 +1,17 @@

+#!/usr/bin/env bash
+set -euo pipefail
+MARIAN_BIN=<path-to-marian>
+MODEL_DIR=<path-to-model-dir>
+src_file=<path-to-src-file>
+tgt=<tgt-tag> # Choose between cat, spa, eng
+cat "$src_file" | sed "s/^/>>${tgt}<< /"  \
+  "$MARIAN_BIN" \
+  -c "$MODEL_DIR/tiny.decoder.yml" \
+  --output "$MODEL_DIR/test.epo-cat.cat.out" \
+  --normalize \
+  -m "$MODEL_DIR/model.npz.best-chrf.npz" \
+  --vocabs "$MODEL_DIR/vocab.spm" "$MODEL_DIR/vocab.spm" \
+  --log "$MODEL_DIR/test.log" \
+  --devices 0

tiny.decoder.yml ADDED Viewed

	@@ -0,0 +1,5 @@

+beam-size: 1
+mini-batch: 32
+maxi-batch: 100
+maxi-batch-sort: src
+skip-cost: True

vocab.spm ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d561cdf0fc7ad693c1bf1fe21732c6434650623ec69dc712aceb36483587914d
+size 805644