Instructions to use Skyler215/VIT_Captioning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Skyler215/VIT_Captioning with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Skyler215/VIT_Captioning")

# Load model directly
from transformers import AutoTokenizer, AutoModelForImageTextToText

tokenizer = AutoTokenizer.from_pretrained("Skyler215/VIT_Captioning")
model = AutoModelForImageTextToText.from_pretrained("Skyler215/VIT_Captioning")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Skyler215/VIT_Captioning with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Skyler215/VIT_Captioning"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Skyler215/VIT_Captioning",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Skyler215/VIT_Captioning

SGLang

How to use Skyler215/VIT_Captioning with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Skyler215/VIT_Captioning" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Skyler215/VIT_Captioning",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Skyler215/VIT_Captioning" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Skyler215/VIT_Captioning",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Skyler215/VIT_Captioning with Docker Model Runner:
```
docker model run hf.co/Skyler215/VIT_Captioning
```

Skyler215 commited on Nov 25, 2024

Commit

4ac5dbc

verified ·

1 Parent(s): 338906a

End of training

Browse files

Files changed (17) hide show

README.md +73 -0
added_tokens.json +6 -0
config.json +209 -0
generation_config.json +13 -0
merges.txt +0 -0
model.safetensors +3 -0
preprocessor_config.json +22 -0
runs/Nov25_12-48-04_07d4093be4c9/events.out.tfevents.1732538888.07d4093be4c9.249.0 +3 -0
runs/Nov25_12-48-48_07d4093be4c9/events.out.tfevents.1732538929.07d4093be4c9.249.1 +3 -0
runs/Nov25_12-49-51_07d4093be4c9/events.out.tfevents.1732539048.07d4093be4c9.249.2 +3 -0
runs/Nov25_12-53-13_07d4093be4c9/events.out.tfevents.1732539197.07d4093be4c9.249.3 +3 -0
runs/Nov25_12-55-24_07d4093be4c9/events.out.tfevents.1732539326.07d4093be4c9.249.4 +3 -0
special_tokens_map.json +44 -0
tokenizer.json +0 -0
tokenizer_config.json +86 -0
training_args.bin +3 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,73 @@

+---
+library_name: transformers
+tags:
+- generated_from_trainer
+metrics:
+- rouge
+model-index:
+- name: VIT_Captioning
+  results: []
+---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# VIT_Captioning
+This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
+It achieves the following results on the evaluation set:
+- Loss: 3.1590
+- Rouge1: 0.3875
+- Rouge2: 0.1212
+- Rougel: 0.3156
+- Rougelsum: 0.3166
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 8
+- eval_batch_size: 8
+- seed: 42
+- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_steps: 1024
+- num_epochs: 10
+- mixed_precision_training: Native AMP
+### Training results
+| Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum |
+|:-------------:|:-----:|:----:|:---------------:|:------:|:------:|:------:|:---------:|
+| No log        | 1.0   | 13   | 4.3037          | 0.3448 | 0.0437 | 0.2181 | 0.2186    |
+| No log        | 2.0   | 26   | 4.2507          | 0.3448 | 0.0437 | 0.2181 | 0.2186    |
+| No log        | 3.0   | 39   | 4.1702          | 0.3448 | 0.0437 | 0.2181 | 0.2186    |
+| No log        | 4.0   | 52   | 4.0673          | 0.3448 | 0.0437 | 0.2181 | 0.2186    |
+| No log        | 5.0   | 65   | 3.9448          | 0.3643 | 0.0496 | 0.2480 | 0.2481    |
+| No log        | 6.0   | 78   | 3.8053          | 0.3653 | 0.0499 | 0.2464 | 0.2466    |
+| No log        | 7.0   | 91   | 3.6485          | 0.3653 | 0.0499 | 0.2464 | 0.2466    |
+| No log        | 8.0   | 104  | 3.4774          | 0.4061 | 0.0678 | 0.2583 | 0.2586    |
+| No log        | 9.0   | 117  | 3.3057          | 0.3700 | 0.0443 | 0.2441 | 0.2448    |
+| No log        | 10.0  | 130  | 3.1590          | 0.3875 | 0.1212 | 0.3156 | 0.3166    |
+### Framework versions
+- Transformers 4.46.2
+- Pytorch 2.5.1+cu121
+- Datasets 3.1.0
+- Tokenizers 0.20.3

added_tokens.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "<|endoftext|>": 50257,
+  "[CLS]": 50258,
+  "[PAD]": 50259,
+  "[SEP]": 50260
+}

config.json ADDED Viewed

	@@ -0,0 +1,209 @@

+{
+  "architectures": [
+    "VisionEncoderDecoderModel"
+  ],
+  "decoder": {
+    "_attn_implementation_autoset": true,
+    "_name_or_path": "NlpHUST/gpt2-vietnamese",
+    "activation_function": "gelu_new",
+    "add_cross_attention": true,
+    "architectures": [
+      "GPT2LMHeadModel"
+    ],
+    "attn_pdrop": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": 50256,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "early_stopping": false,
+    "embd_pdrop": 0.0,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 50256,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_range": 0.02,
+    "is_decoder": true,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_epsilon": 1e-05,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "gpt2",
+    "n_ctx": 1024,
+    "n_embd": 768,
+    "n_head": 12,
+    "n_inner": null,
+    "n_layer": 12,
+    "n_positions": 1024,
+    "no_repeat_ngram_size": 0,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "reorder_and_upcast_attn": false,
+    "repetition_penalty": 1.0,
+    "resid_pdrop": 0.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "scale_attn_by_inverse_layer_idx": false,
+    "scale_attn_weights": true,
+    "sep_token_id": null,
+    "summary_activation": null,
+    "summary_first_dropout": 0.1,
+    "summary_proj_to_labels": true,
+    "summary_type": "cls_index",
+    "summary_use_proj": true,
+    "suppress_tokens": null,
+    "task_specific_params": {
+      "text-generation": {
+        "do_sample": true,
+        "max_length": 50
+      }
+    },
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "use_cache": true,
+    "vocab_size": 50261
+  },
+  "decoder_start_token_id": 50258,
+  "early_stopping": null,
+  "encoder": {
+    "_attn_implementation_autoset": true,
+    "_name_or_path": "google/vit-base-patch16-224-in21k",
+    "add_cross_attention": false,
+    "architectures": [
+      "ViTModel"
+    ],
+    "attention_probs_dropout_prob": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "encoder_stride": 16,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.0,
+    "hidden_size": 768,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "image_size": 224,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-12,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "vit",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 12,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_channels": 3,
+    "num_hidden_layers": 12,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "patch_size": 16,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "qkv_bias": true,
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "typical_p": 1.0,
+    "use_bfloat16": false
+  },
+  "eos_token_id": 50260,
+  "is_encoder_decoder": true,
+  "length_penalty": null,
+  "max_length": null,
+  "model_type": "vision-encoder-decoder",
+  "no_repeat_ngram_size": null,
+  "num_beams": null,
+  "pad_token_id": 50259,
+  "quantization_config": {
+    "_load_in_4bit": true,
+    "_load_in_8bit": false,
+    "bnb_4bit_compute_dtype": "float16",
+    "bnb_4bit_quant_storage": "uint8",
+    "bnb_4bit_quant_type": "nf4",
+    "bnb_4bit_use_double_quant": true,
+    "llm_int8_enable_fp32_cpu_offload": false,
+    "llm_int8_has_fp16_weight": false,
+    "llm_int8_skip_modules": null,
+    "llm_int8_threshold": 6.0,
+    "load_in_4bit": true,
+    "load_in_8bit": false,
+    "quant_method": "bitsandbytes"
+  },
+  "tie_encoder_decoder": true,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.46.2",
+  "use_cache": false,
+  "vocab_size": 50257
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "bos_token_id": 50256,
+  "decoder_start_token_id": 50258,
+  "early_stopping": true,
+  "eos_token_id": 50260,
+  "length_penalty": 2.0,
+  "max_length": 29,
+  "no_repeat_ngram_size": 3,
+  "num_beams": 4,
+  "pad_token_id": 50259,
+  "transformers_version": "4.46.2",
+  "use_cache": false
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f08b62e0db86d85d9385c5e51409d7af7f172ba1d5b0776b62014e22f2998f66
+size 956847808

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "ViTFeatureExtractor",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "resample": 2,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "height": 224,
+    "width": 224
+  }
+}

runs/Nov25_12-48-04_07d4093be4c9/events.out.tfevents.1732538888.07d4093be4c9.249.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cd9dd061236445c41609113526d5590541e4b73a62c333a1c7bed64ae5eb45d3
+size 10101

runs/Nov25_12-48-48_07d4093be4c9/events.out.tfevents.1732538929.07d4093be4c9.249.1 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c22bf437727d6555645644c1d578eae68ba54bc97c21b1616be60ee2f2cdeef8
+size 10101

runs/Nov25_12-49-51_07d4093be4c9/events.out.tfevents.1732539048.07d4093be4c9.249.2 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a49ca0ab5109331f3bcd56a075a52a1395c209a6f47430ea2435f40201f7d21d
+size 10101

runs/Nov25_12-53-13_07d4093be4c9/events.out.tfevents.1732539197.07d4093be4c9.249.3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8dd71f8226546e087777736df4ad38dc24ad80abc61780ef22fb5adab31e4f31
+size 10101

runs/Nov25_12-55-24_07d4093be4c9/events.out.tfevents.1732539326.07d4093be4c9.249.4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:00c1b3721f84b35f66f8f320818ac2b0694f47635f0a9a9ca1dae8f6abbf82e5
+size 15114

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,86 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<mask>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50257": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50258": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50259": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50260": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3250168532d7a398ffe371b69ed5814ef5808e6d1166e9f0afc312e49f36e8e7
+size 5432

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff