cfchase commited on
Commit
06d699c
Β·
verified Β·
1 Parent(s): afe11c6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +235 -0
README.md ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenShift AI Demo: Text-to-Image Generation
2
+
3
+ This demonstration showcases the complete machine learning workflow in Red Hat OpenShift AI, taking you from initial experimentation to production deployment. Using Stable Diffusion for text-to-image generation, you'll learn how to experiment with models, fine-tune them with custom data, create automated pipelines, and deploy models as scalable services.
4
+
5
+ ## What You'll Learn
6
+
7
+ - **Data Science Projects**: Creating and managing ML workspaces in OpenShift AI
8
+ - **GPU-Accelerated Workbenches**: Leveraging NVIDIA GPUs for model training and inference
9
+ - **Model Experimentation**: Working with pre-trained models from Hugging Face
10
+ - **Fine-Tuning**: Customizing models with your own data using Dreambooth
11
+ - **Pipeline Automation**: Building repeatable ML workflows with Data Science Pipelines
12
+ - **Model Serving**: Deploying models as REST APIs using KServe
13
+ - **Production Integration**: Connecting served models to applications
14
+
15
+ ## Prerequisites
16
+
17
+ ### Platform Requirements
18
+ - Red Hat OpenShift cluster (4.12+)
19
+ - Red Hat OpenShift AI installed (2.9+)
20
+ - For managed service: Available as add-on for OpenShift Dedicated or ROSA
21
+ - For self-managed: Install from OperatorHub
22
+ - GPU node with at least 45GB memory (NVIDIA L40S recommended, A10G minimum for smaller models)
23
+
24
+ ### Storage Requirements
25
+ - S3-compatible object storage (MinIO, AWS S3, or Ceph)
26
+ - Two buckets configured:
27
+ - `pipeline-artifacts`: For pipeline execution artifacts
28
+ - `models`: For storing trained models
29
+
30
+ ### Access Requirements
31
+ - OpenShift AI Dashboard access
32
+ - Ability to create Data Science Projects
33
+ - (Optional) Hugging Face account with API token for model downloads
34
+
35
+ ## Quick Start
36
+
37
+ 1. **Access OpenShift AI Dashboard**
38
+ - Navigate to your OpenShift console
39
+ - Click the application launcher (9-dot grid)
40
+ - Select "Red Hat OpenShift AI"
41
+
42
+ 2. **Create a Data Science Project**
43
+ - Click "Data Science Projects"
44
+ - Create a new project named `image-generation`
45
+
46
+ 3. **Set Up Storage**
47
+ - Import `setup/setup-s3.yaml` to create local S3 storage (for demos)
48
+ - Or configure your own S3-compatible storage connections
49
+
50
+ 4. **Create a Workbench**
51
+ - Select PyTorch notebook image
52
+ - Allocate GPU resources
53
+ - Add environment variables (including `HF_TOKEN` if available)
54
+ - Attach data connections
55
+
56
+ 5. **Clone This Repository**
57
+ ```bash
58
+ git clone https://github.com/cfchase/text-to-image-demo.git
59
+ cd text-to-image-demo
60
+ ```
61
+
62
+ 6. **Follow the Notebooks**
63
+ - `1_experimentation.ipynb`: Initial model testing
64
+ - `2_fine_tuning.ipynb`: Training with custom data
65
+ - `3_remote_inference.ipynb`: Testing deployed models
66
+
67
+ ## Key Components
68
+
69
+ - **Workbenches**: Jupyter notebook environments for development
70
+ - **Pipelines**: Automated ML workflows
71
+ - **Model Serving**: Deploy models as REST APIs
72
+ - **Storage**: S3-compatible object storage for data and models
73
+
74
+ ## Detailed Setup Instructions
75
+
76
+ ### 1. Storage Configuration
77
+
78
+ #### Option A: Demo Setup (Local S3)
79
+ ```bash
80
+ oc apply -f setup/setup-s3.yaml
81
+ ```
82
+
83
+ This creates:
84
+ - MinIO deployment for S3-compatible storage
85
+ - Two PVCs for buckets
86
+ - Data connections for workbench and pipeline access
87
+
88
+ #### Option B: Production Setup (External S3)
89
+ Create data connections with your S3 credentials:
90
+ - Connection 1: "My Storage" - for workbench access
91
+ - Connection 2: "Pipeline Artifacts" - for pipeline server
92
+
93
+ ### 2. Workbench Configuration
94
+
95
+ When creating your workbench:
96
+
97
+ **Notebook Image**: Choose based on your needs
98
+ - Standard Data Science: Basic Python environment
99
+ - PyTorch: Includes PyTorch, CUDA support (recommended for this demo)
100
+ - TensorFlow: For TensorFlow-based workflows
101
+ - Custom: Use your own image with specific dependencies
102
+
103
+ **Resources**:
104
+ - Small: 2 CPUs, 8Gi memory
105
+ - Medium: 7 CPUs, 24Gi memory
106
+ - Large: 14 CPUs, 56Gi memory
107
+ - GPU: Add 1-2 NVIDIA GPUs (required for this demo)
108
+
109
+ **Environment Variables**:
110
+ ```
111
+ HF_TOKEN=<your-huggingface-token> # For model downloads
112
+ AWS_S3_ENDPOINT=<s3-endpoint-url> # Auto-configured if using data connections
113
+ AWS_ACCESS_KEY_ID=<access-key> # Auto-configured if using data connections
114
+ AWS_SECRET_ACCESS_KEY=<secret-key> # Auto-configured if using data connections
115
+ AWS_S3_BUCKET=<bucket-name> # Auto-configured if using data connections
116
+ ```
117
+
118
+ ### 3. Pipeline Server Setup
119
+
120
+ 1. In your Data Science Project, go to "Pipelines" β†’ "Create pipeline server"
121
+ 2. Select the "Pipeline Artifacts" data connection
122
+ 3. Wait for the server to be ready (2-3 minutes)
123
+
124
+ ### 4. Model Serving Configuration
125
+
126
+ After training your model:
127
+
128
+ 1. Deploy the custom Diffusers runtime:
129
+ ```bash
130
+ cd diffusers-runtime
131
+ make build
132
+ make push
133
+ oc apply -f templates/serving-runtime.yaml
134
+ ```
135
+
136
+ 2. Create a model server in the OpenShift AI dashboard:
137
+ - Model framework: "Custom"
138
+ - Model location: S3 path to your trained model
139
+ - Select the Diffusers serving runtime
140
+
141
+ ## Project Structure
142
+
143
+ ```
144
+ text-to-image-demo/
145
+ β”œβ”€β”€ README.md # This file
146
+ β”œβ”€β”€ ARCHITECTURE.md # Technical architecture details
147
+ β”œβ”€β”€ PIPELINES.md # Pipeline automation guide
148
+ β”œβ”€β”€ SERVING.md # Model serving guide
149
+ β”œβ”€β”€ DEMO_SCRIPT.md # Step-by-step demo script
150
+ β”‚
151
+ β”œβ”€β”€ 1_experimentation.ipynb # Initial model testing
152
+ β”œβ”€β”€ 2_fine_tuning.ipynb # Custom training workflow
153
+ β”œβ”€β”€ 3_remote_inference.ipynb # Testing served models
154
+ β”‚
155
+ β”œβ”€β”€ requirements-base.txt # Base Python dependencies
156
+ β”œβ”€β”€ requirements-gpu.txt # GPU-specific packages
157
+ β”‚
158
+ β”œβ”€β”€ finetuning_pipeline/ # Kubeflow pipeline components
159
+ β”‚ β”œβ”€β”€ Dreambooth.pipeline # Pipeline definition
160
+ β”‚ β”œβ”€β”€ get_data.ipynb # Data preparation step
161
+ β”‚ β”œβ”€β”€ train.ipynb # Training execution step
162
+ β”‚ └── upload.ipynb # Model upload step
163
+ β”‚
164
+ β”œβ”€β”€ diffusers-runtime/ # Custom KServe runtime
165
+ β”‚ β”œβ”€β”€ Dockerfile # Runtime container definition
166
+ β”‚ β”œβ”€β”€ model.py # KServe predictor implementation
167
+ β”‚ └── templates/ # Kubernetes manifests
168
+ β”‚
169
+ └── setup/ # Deployment configurations
170
+ └── setup-s3.yaml # Demo S3 storage setup
171
+ ```
172
+
173
+ ## Workflow Overview
174
+
175
+ ### 1. Experimentation Phase
176
+ - Load pre-trained Stable Diffusion model
177
+ - Test basic text-to-image generation
178
+ - Identify limitations with generic models
179
+
180
+ ### 2. Training Phase
181
+ - Prepare custom training data (images of "Teddy")
182
+ - Fine-tune model using Dreambooth technique
183
+ - Save trained weights to S3 storage
184
+
185
+ ### 3. Pipeline Automation
186
+ - Convert notebooks to pipeline steps
187
+ - Create repeatable training workflow
188
+ - Enable parameter tuning and experimentation
189
+
190
+ ### 4. Model Serving
191
+ - Deploy custom KServe runtime
192
+ - Create inference service
193
+ - Expose REST API endpoint
194
+
195
+ ### 5. Application Integration
196
+ - Test model via REST API
197
+ - Integrate with applications
198
+ - Monitor performance
199
+
200
+ ## Troubleshooting
201
+
202
+ ### GPU Issues
203
+ - **No GPU detected**: Ensure your node has GPU support and correct drivers
204
+ - **Out of memory**: Reduce batch size or use gradient checkpointing
205
+ - **CUDA errors**: Verify PyTorch and CUDA versions match
206
+
207
+ ### Storage Issues
208
+ - **S3 connection failed**: Check credentials and endpoint URL
209
+ - **Permission denied**: Verify bucket policies and access keys
210
+ - **Upload timeouts**: Check network connectivity and proxy settings
211
+
212
+ ### Pipeline Issues
213
+ - **Pipeline server not starting**: Check data connection configuration
214
+ - **Pipeline runs failing**: Review logs in pipeline run details
215
+ - **Missing artifacts**: Verify S3 bucket permissions
216
+
217
+ ### Serving Issues
218
+ - **Model not loading**: Check S3 path and model format
219
+ - **Inference errors**: Review KServe pod logs
220
+ - **Timeout errors**: Increase resource limits or timeout values
221
+
222
+ ## Additional Resources
223
+
224
+ - [Red Hat OpenShift AI Documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed)
225
+ - [OpenShift AI Learning Resources](https://developers.redhat.com/products/red-hat-openshift-ai/overview)
226
+ - [KServe Documentation](https://kserve.github.io/website/)
227
+ - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers)
228
+
229
+ ## Contributing
230
+
231
+ Contributions are welcome! Please feel free to submit issues or pull requests to improve this demo.
232
+
233
+ ## License
234
+
235
+ This project is licensed under the Apache License 2.0 - see the LICENSE file for details.