Update README and add a training doc
Browse files- README.md +25 -7
- training.md +7 -0
README.md
CHANGED
|
@@ -1,7 +1,25 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: es
|
| 3 |
+
license: CC-BY 4.0
|
| 4 |
+
tags:
|
| 5 |
+
- spanish
|
| 6 |
+
- roberta
|
| 7 |
+
- vit
|
| 8 |
+
---
|
| 9 |
+
# CLIP-Spanish
|
| 10 |
+
CLIP Spanish is a CLIP-like Model for Spanish. It is composed of a RoBERTa-base language encoder and a ViT-B/32 image encoder using [Flax](https://github.com/google/flax), including training scripts (see training.md).
|
| 11 |
+
This is part of the [Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organised by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
| 12 |
+
## Spanish WIT
|
| 13 |
+
We used a subset of 141,230 Spanish captions from the [WIT dataset](https://github.com/google-research-datasets/wit) for training.
|
| 14 |
+
|
| 15 |
+
## Team members
|
| 16 |
+
- Eduardo González Ponferrada ([edugp](https://huggingface.co/edugp))
|
| 17 |
+
- Manu Romero ([mrm8488](https://huggingface.co/))
|
| 18 |
+
- María Grandury ([mariagrandury](https://huggingface.co/))
|
| 19 |
+
## Useful links
|
| 20 |
+
- [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6)
|
| 21 |
+
- [Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md)
|
| 22 |
+
- [Community Week thread](https://discuss.huggingface.co/t/bertin-pretrain-roberta-large-from-scratch-in-spanish/7125)
|
| 23 |
+
- [Community Week channel](https://discord.com/channels/858019234139602994/859113060068229190)
|
| 24 |
+
- [Hybrid CLIP example scripts](https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip)
|
| 25 |
+
- [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)
|
training.md
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Training:
|
| 2 |
+
* Download tsv files from here: https://github.com/google-research-datasets/wit/blob/main/DATA.md
|
| 3 |
+
* Use `prepare_wit.py` to download images from Wikipedia as annotated on each TSV file.
|
| 4 |
+
* Use `scale_converter.py` to remove corrupt images and resize suitable images to 224x224.
|
| 5 |
+
* Use `join_datasets_custom_split.py` to group all JSONs from different subsets of the dataset together.
|
| 6 |
+
* Use `discard_incorrect_files.py` to filter out images that we were not able to convert.
|
| 7 |
+
* Finally, use `run-clip.sh` to train.
|