odegiber commited on
Commit
c763a41
·
verified ·
1 Parent(s): 07769f4

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. README.md +158 -0
  3. model.npz.best-chrf.npz +3 -0
  4. run_model.sh +17 -0
  5. tiny.decoder.yml +5 -0
  6. vocab.spm +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ vocab.spm filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - eo
4
+ - ca
5
+ tags:
6
+ - machine-translation
7
+ - translation
8
+ - marian
9
+ - esperanto
10
+ - catalan
11
+ - neural-machine-translation
12
+ library_name: marian
13
+ pipeline_tag: translation
14
+ license: apache-2.0
15
+ ---
16
+
17
+ # Esperanto → Catalan MarianMT model
18
+
19
+ This repository contains a **Marian NMT** model for **Esperanto-to-Catalan** machine translation.
20
+
21
+ ## Overview
22
+
23
+ This model was trained for the translation direction:
24
+
25
+ - **Source language:** Esperanto (`eo`)
26
+ - **Target language:** Catalan (`ca`)
27
+
28
+ It is distributed in **Marian format** and is intended to be used with the **Marian decoder**.
29
+
30
+ ## Important note
31
+
32
+ This model is **not intended for direct inference through the Hugging Face `transformers` library**.
33
+
34
+ Use **Marian** for inference instead.
35
+
36
+ ## Repository contents
37
+
38
+ The repository includes the following files:
39
+
40
+ - `model.npz.best-chrf.npz` — trained Marian model checkpoint
41
+ - `tiny.decoder.yml` — decoder configuration
42
+ - `vocab.spm` — SentencePiece vocabulary
43
+
44
+ ## Requirements
45
+
46
+ You need a working installation of **Marian NMT**.
47
+
48
+ For example, on our system the decoder binary is located at:
49
+
50
+ ```bash
51
+ /scratch/project_2005815/members/degibert/MTM25/marian/build/marian-decoder
52
+ ````
53
+
54
+ ## Inference
55
+
56
+ Run decoding from inside the model directory:
57
+
58
+ ```bash
59
+ marian-decoder \
60
+ -c tiny.decoder.yml \
61
+ --input input.epo \
62
+ --output output.cat \
63
+ --normalize \
64
+ -m model.npz.best-chrf.npz \
65
+ --vocabs vocab.spm vocab.spm \
66
+ --log decode.log \
67
+ --devices 0
68
+ ```
69
+
70
+ ## Example
71
+
72
+ Input file `input.epo`:
73
+
74
+ ```text
75
+ Ŝi amas danci.
76
+ ```
77
+
78
+ Output file `output.cat`:
79
+
80
+ ```text
81
+ Li encanta ballar.
82
+ ```
83
+
84
+ ## Example helper script
85
+
86
+ You can also run the model with a small shell script such as:
87
+
88
+ ```bash
89
+ #!/usr/bin/env bash
90
+ set -euo pipefail
91
+
92
+ MARIAN_BIN="/path/to/marian-decoder"
93
+ MODEL_DIR="$(cd "$(dirname "$0")" && pwd)"
94
+
95
+ INPUT="${1:-input.epo}"
96
+ OUTPUT="${2:-output.cat}"
97
+ LOG="${3:-decode.log}"
98
+
99
+ "$MARIAN_BIN" \
100
+ -c "$MODEL_DIR/tiny.decoder.yml" \
101
+ --input "$MODEL_DIR/$INPUT" \
102
+ --output "$MODEL_DIR/$OUTPUT" \
103
+ --normalize \
104
+ -m "$MODEL_DIR/model.npz.best-chrf.npz" \
105
+ --vocabs "$MODEL_DIR/vocab.spm" "$MODEL_DIR/vocab.spm" \
106
+ --log "$MODEL_DIR/$LOG" \
107
+ --devices 0
108
+ ```
109
+
110
+ ## Intended use
111
+
112
+ This model is intended for:
113
+
114
+ * research on low-resource machine translation
115
+ * Esperanto–Catalan translation experiments
116
+ * reproducible Marian-based inference
117
+
118
+ ## Limitations
119
+
120
+ This is a research model and may have limitations including:
121
+
122
+ * reduced robustness outside the training domain
123
+ * sensitivity to spelling variation and noisy input
124
+ * lower quality on idiomatic, literary, or highly specialised text
125
+
126
+ Outputs should be reviewed before use in high-stakes or publication settings.
127
+
128
+ ## Training and evaluation
129
+
130
+ Add here any details you want to share, for example:
131
+
132
+ * training corpus or data source
133
+ * preprocessing pipeline
134
+ * tokenisation / SentencePiece setup
135
+ * evaluation sets
136
+ * BLEU / chrF results
137
+
138
+ Example placeholder text:
139
+
140
+ This model was trained as part of research on low-resource translation involving Esperanto and Catalan. Evaluation was carried out on held-out test data using standard MT metrics.
141
+
142
+ ## Citation
143
+
144
+ If you use this model, please cite the associated work.
145
+
146
+ ```bibtex
147
+ @misc{degibert2026eo_ca_marian,
148
+ title = {Esperanto to Catalan MarianMT Model},
149
+ author = {Degibert, [Your full name]},
150
+ year = {2026},
151
+ note = {Model distributed via Hugging Face}
152
+ }
153
+ ```
154
+
155
+ ## Acknowledgements
156
+
157
+ This model was trained using **Marian NMT**
158
+
model.npz.best-chrf.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c180e7307a456d968a9c6ed3cbb22707a8c7e676c279cac47e8c9acfb7c6c243
3
+ size 68714917
run_model.sh ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ MARIAN_BIN=<path-to-marian>
5
+ MODEL_DIR=<path-to-model-dir>
6
+ src_file=<path-to-src-file>
7
+ tgt=<tgt-tag> # Choose between cat, spa, eng
8
+
9
+ cat "$src_file" | sed "s/^/>>${tgt}<< /" \
10
+ "$MARIAN_BIN" \
11
+ -c "$MODEL_DIR/tiny.decoder.yml" \
12
+ --output "$MODEL_DIR/test.epo-cat.cat.out" \
13
+ --normalize \
14
+ -m "$MODEL_DIR/model.npz.best-chrf.npz" \
15
+ --vocabs "$MODEL_DIR/vocab.spm" "$MODEL_DIR/vocab.spm" \
16
+ --log "$MODEL_DIR/test.log" \
17
+ --devices 0
tiny.decoder.yml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ beam-size: 1
2
+ mini-batch: 32
3
+ maxi-batch: 100
4
+ maxi-batch-sort: src
5
+ skip-cost: True
vocab.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d561cdf0fc7ad693c1bf1fe21732c6434650623ec69dc712aceb36483587914d
3
+ size 805644