mrdbourke commited on
Commit
0faa204
Β·
verified Β·
1 Parent(s): 6246f77

Uploading FoodExtract-Vision demo app.py

Browse files
Files changed (3) hide show
  1. README.md +20 -6
  2. app.py +97 -0
  3. requirements.txt +5 -0
README.md CHANGED
@@ -1,12 +1,26 @@
1
  ---
2
- title: FoodExtract Vision V1
3
- emoji: πŸš€
4
- colorFrom: indigo
5
- colorTo: purple
6
  sdk: gradio
7
- sdk_version: 6.3.0
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: FoodExtract-Vision Fine-tuned VLM Structued Data Extractor
3
+ emoji: πŸŸβž‘οΈπŸ“
4
+ colorFrom: green
5
+ colorTo: blue
6
  sdk: gradio
 
7
  app_file: app.py
8
  pinned: false
9
+ license: apache-2.0
10
  ---
11
 
12
+ """
13
+ Fine-tuned SmolVLM2-500M to extract food and drink items from images.
14
+
15
+ Input can be any kind of image and output will be a formatted string such as the following:
16
+
17
+ ```json
18
+ {'is_food': 0, 'image_title': '', 'food_items': [], 'drink_items': []}
19
+ ```
20
+
21
+ Or for an image of food:
22
+
23
+ ```json
24
+ {'is_food': 1, 'image_title': 'fried calamari', 'food_items': ['fried calamari'], 'drink_items': []}
25
+ ```
26
+ """
app.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import gradio as gr
3
+
4
+ from transformers import pipeline
5
+
6
+ BASE_MODEL_ID = "HuggingFaceTB/SmolVLM2-256M-Video-Instruct"
7
+ FINE_TUNED_MODEL_ID = "mrdbourke/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v1"
8
+ OUTPUT_TOKENS = 256
9
+
10
+ # Load original base model (no fine-tuning)
11
+ original_pipeline = pipeline(
12
+ "image-text-to-text",
13
+ model=BASE_MODEL_ID,
14
+ dtype=torch.bfloat16,
15
+ device_map="auto"
16
+ )
17
+
18
+ # Load fine-tuned model
19
+ ft_pipe = pipeline(
20
+ "image-text-to-text",
21
+ model=FINE_TUNED_MODEL_ID,
22
+ dtype=torch.bfloat16,
23
+ device_map="auto"
24
+ )
25
+
26
+ def create_message(input_image):
27
+ return [{'role': 'user',
28
+ 'content': [{'type': 'image',
29
+ 'image': input_image},
30
+ {'type': 'text',
31
+ 'text': "Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.\n\nOnly return valid JSON in the following form:\n\n```json\n{\n 'is_food': 0, # int - 0 or 1 based on whether food/drinks are present (0 = no foods visible, 1 = foods visible)\n 'image_title': '', # str - short food-related title for what foods/drinks are visible in the image, leave blank if no foods present\n 'food_items': [], # list[str] - list of visible edible food item nouns\n 'drink_items': [] # list[str] - list of visible edible drink item nouns\n}\n```\n"}]}]
32
+
33
+ def extract_foods_from_image(input_image):
34
+ input_image = input_image.resize(size=(512, 512))
35
+ input_message = create_message(input_image=input_image)
36
+
37
+ # Get outputs from base model (not fine-tuned)
38
+ original_pipeline_output = original_pipeline(text=[input_message],
39
+ max_new_tokens=OUTPUT_TOKENS)
40
+
41
+ outputs_pretrained = original_pipeline_output[0][0]["generated_text"][-1]["content"]
42
+
43
+ # Get outputs from fine-tuned model (fine-tuned on food images)
44
+ ft_pipe = ft_pipe(text=[input_message],
45
+ max_new_tokens=OUTPUT_TOKENS)
46
+ outputs_fine_tuned = ft_pipe[0][0]["generated_text"][-1]["content"]
47
+
48
+ return outputs_pretrained, outputs_fine_tuned
49
+
50
+ demo_title = "πŸ₯‘βž‘οΈπŸ“ FoodExtract-Vision with a fine-tuned SmolVLM2-500M"
51
+ demo_description = """* **Base model:** https://huggingface.co/HuggingFaceTB/SmolVLM-500M-Instruct
52
+ * **Fine-tuning dataset:** https://huggingface.co/datasets/mrdbourke/FoodExtract-1k-Vision (1k food images and 500 not food images)
53
+ * **Fine-tuned model:** https://huggingface.co/mrdbourke/FoodExtract-Vision-SmolVLM2-500M-fine-tune-v1
54
+
55
+ ## Overview
56
+
57
+ Extract food and drink items in a structured way from images.
58
+
59
+ The original model outputs fail to capture the desired structure. But the fine-tuned model sticks to the output structure quite well.
60
+
61
+ However, the fine-tuned model could definitely be improved with respects to its ability to extract the right food/drink items.
62
+
63
+ Both models use the input prompt:
64
+
65
+ ````
66
+ Classify the given input image into food or not and if edible food or drink items are present, extract those to a list. If no food/drink items are visible, return empty lists.
67
+
68
+ Only return valid JSON in the following form:
69
+
70
+ ```json
71
+ {
72
+ 'is_food': 0, # int - 0 or 1 based on whether food/drinks are present (0 = no foods visible, 1 = foods visible)
73
+ 'image_title': '', # str - short food-related title for what foods/drinks are visible in the image, leave blank if no foods present
74
+ 'food_items': [], # list[str] - list of visible edible food item nouns
75
+ 'drink_items': [] # list[str] - list of visible edible drink item nouns
76
+ }
77
+ ```
78
+ ````
79
+
80
+ Except one model has been fine-tuned on the structured data whereas the other hasn't.
81
+
82
+ Notable next steps would be:
83
+ * **Remove the input prompt:** Just train the model to go straight from image -> text (no text prompt on input), this would save on inference tokens.
84
+ * **Fine-tune on more real-world data:** Right now the model is only trained on 1k food images (from Food101) and 500 not food (random internet images), training on real world data would likely significantly improve performance.
85
+ """
86
+
87
+ demo = gr.Interface(
88
+ fn=extract_foods_from_image,
89
+ inputs=gr.Image(type="pil"),
90
+ title=demo_title,
91
+ description=demo_description,
92
+ outputs=[gr.Textbox(lines=4, label="Original Model (not fine-tuned)"),
93
+ gr.Textbox(lines=4, label="Fine-tuned Model")]
94
+ )
95
+
96
+ if __name__ == "__main__":
97
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ num2words
2
+ transformers
3
+ torch
4
+ accelerate
5
+ gradio