Rishabh2095 commited on
Commit
daec63e
·
1 Parent(s): e2af417

Add URL support for resume files - enables remote resume access via HTTP/HTTPS URLs

Browse files
RESUME_STORAGE_GUIDE.md ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Resume Storage Options for HF Spaces Deployment
2
+
3
+ This guide explains different ways to store and access your resume file for the deployed LangGraph application on HuggingFace Spaces.
4
+
5
+ ## Problem
6
+
7
+ HuggingFace Spaces doesn't allow binary files (PDFs) in git repositories. We removed `resume.pdf` from git, but the workflow needs access to it.
8
+
9
+ ## Solution Options
10
+
11
+ ### ✅ Option 1: URL Support (Easiest - Already Implemented!)
12
+
13
+ **Status:** ✅ **Code updated - now supports URLs!**
14
+
15
+ You can now provide a resume URL instead of a file path. The code will automatically download it.
16
+
17
+ **Supported URL formats:**
18
+ - `https://example.com/resume.pdf` - Direct HTTP/HTTPS links
19
+ - `https://github.com/username/repo/raw/main/resume.pdf` - GitHub raw files
20
+ - `https://drive.google.com/uc?export=download&id=FILE_ID` - Google Drive (public)
21
+ - Any publicly accessible URL
22
+
23
+ **How to use:**
24
+
25
+ 1. **Upload resume to a public location:**
26
+ - GitHub: Upload to a repo and use the "raw" file URL
27
+ - Google Drive: Make file public, get shareable link
28
+ - Dropbox: Get public link
29
+ - Any web server or CDN
30
+
31
+ 2. **Use the URL in your API call:**
32
+ ```json
33
+ {
34
+ "assistant_id": "job_app_graph",
35
+ "input": {
36
+ "resume_path": "https://github.com/username/repo/raw/main/resume.pdf",
37
+ "job_description_source": "https://example.com/job",
38
+ "content_category": "cover_letter"
39
+ }
40
+ }
41
+ ```
42
+
43
+ **Pros:**
44
+ - ✅ No code changes needed (already implemented)
45
+ - ✅ Works with any public URL
46
+ - ✅ No additional services required
47
+ - ✅ Easy to update (just replace the file at the URL)
48
+
49
+ **Cons:**
50
+ - ⚠️ File must be publicly accessible
51
+ - ⚠️ Requires internet connection to download
52
+
53
+ ---
54
+
55
+ ### Option 2: HuggingFace Hub Dataset (Recommended for Production)
56
+
57
+ Store your resume in HF Hub as a dataset - native integration with HF Spaces.
58
+
59
+ **Steps:**
60
+
61
+ 1. **Install HF Hub CLI:**
62
+ ```bash
63
+ pip install huggingface_hub
64
+ ```
65
+
66
+ 2. **Login to HF:**
67
+ ```bash
68
+ huggingface-cli login
69
+ ```
70
+
71
+ 3. **Create a dataset and upload resume:**
72
+ ```bash
73
+ # Create dataset (one-time)
74
+ huggingface-cli repo create resume-dataset --type dataset
75
+
76
+ # Upload resume
77
+ huggingface-cli upload Rishabh2095/resume-dataset resume.pdf resume.pdf
78
+ ```
79
+
80
+ 4. **Access in code (add to workflow):**
81
+ ```python
82
+ from huggingface_hub import hf_hub_download
83
+ import tempfile
84
+
85
+ # Download resume from HF Hub
86
+ resume_path = hf_hub_download(
87
+ repo_id="Rishabh2095/resume-dataset",
88
+ filename="resume.pdf",
89
+ cache_dir="/tmp"
90
+ )
91
+ ```
92
+
93
+ 5. **Use in API call:**
94
+ ```json
95
+ {
96
+ "assistant_id": "job_app_graph",
97
+ "input": {
98
+ "resume_path": "/tmp/resume.pdf", # After downloading from HF Hub
99
+ "job_description_source": "https://example.com/job",
100
+ "content_category": "cover_letter"
101
+ }
102
+ }
103
+ ```
104
+
105
+ **Pros:**
106
+ - ✅ Native HF integration
107
+ - ✅ Private datasets supported
108
+ - ✅ Version control for resume
109
+ - ✅ No external dependencies
110
+
111
+ **Cons:**
112
+ - ⚠️ Requires code modification to download from HF Hub
113
+ - ⚠️ Slight overhead for downloading
114
+
115
+ ---
116
+
117
+ ### Option 3: Object Storage (S3, GCS, Azure Blob)
118
+
119
+ Use cloud object storage for production scalability.
120
+
121
+ **Example: AWS S3**
122
+
123
+ 1. **Upload to S3:**
124
+ ```bash
125
+ aws s3 cp resume.pdf s3://your-bucket/resume.pdf --acl public-read
126
+ ```
127
+
128
+ 2. **Use public URL:**
129
+ ```json
130
+ {
131
+ "resume_path": "https://your-bucket.s3.amazonaws.com/resume.pdf"
132
+ }
133
+ ```
134
+
135
+ **For private S3 (requires credentials):**
136
+ - Add AWS credentials as HF Space secrets
137
+ - Use `boto3` to download in code
138
+
139
+ **Pros:**
140
+ - ✅ Scalable and reliable
141
+ - ✅ Supports private files with auth
142
+ - ✅ Industry standard
143
+
144
+ **Cons:**
145
+ - ⚠️ Requires cloud account setup
146
+ - ⚠️ May incur costs
147
+ - ⚠️ More complex setup
148
+
149
+ ---
150
+
151
+ ### Option 4: HF Spaces Persistent Storage
152
+
153
+ HF Spaces provides `/tmp` directory that persists across restarts.
154
+
155
+ **Steps:**
156
+
157
+ 1. **Upload file via API or during build:**
158
+ - Add file to Docker image (but this increases image size)
159
+ - Or download during container startup
160
+
161
+ 2. **Use in code:**
162
+ ```python
163
+ # In your workflow initialization
164
+ DEFAULT_RESUME_PATH = "/tmp/resume.pdf"
165
+ ```
166
+
167
+ **Pros:**
168
+ - ✅ No external dependencies
169
+ - ✅ Fast access (local file)
170
+
171
+ **Cons:**
172
+ - ⚠️ File must be in Docker image (increases size)
173
+ - ⚠️ Not easily updatable without rebuild
174
+
175
+ ---
176
+
177
+ ### Option 5: Environment Variable with URL
178
+
179
+ Store resume URL as an HF Space secret.
180
+
181
+ **Steps:**
182
+
183
+ 1. **Add to HF Space Secrets:**
184
+ - Go to Space Settings → Variables and secrets
185
+ - Add: `RESUME_URL=https://example.com/resume.pdf`
186
+
187
+ 2. **Use in code:**
188
+ ```python
189
+ import os
190
+ resume_path = os.getenv("RESUME_URL", "default_path_or_url")
191
+ ```
192
+
193
+ **Pros:**
194
+ - ✅ Easy to update (change secret, no code deploy)
195
+ - ✅ Can point to any URL
196
+ - ✅ Works with Option 1 (URL support)
197
+
198
+ **Cons:**
199
+ - ⚠️ Requires code modification to read env var
200
+
201
+ ---
202
+
203
+ ## Recommended Approach
204
+
205
+ **For Quick Start:** Use **Option 1 (URL Support)** - just upload your resume to GitHub, Google Drive, or any public URL and use that URL in your API calls.
206
+
207
+ **For Production:** Use **Option 2 (HF Hub Dataset)** - native integration, private support, version control.
208
+
209
+ ## Implementation Status
210
+
211
+ - ✅ **URL Support:** Implemented in `parse_resume()` function
212
+ - ⏳ **HF Hub Integration:** Can be added if needed
213
+ - ⏳ **Environment Variable:** Can be added if needed
214
+
215
+ ## Testing
216
+
217
+ Test with a public resume URL:
218
+
219
+ ```powershell
220
+ # Test with GitHub raw file URL
221
+ $body = @{
222
+ assistant_id = "job_app_graph"
223
+ input = @{
224
+ resume_path = "https://github.com/username/repo/raw/main/resume.pdf"
225
+ job_description_source = "https://example.com/job"
226
+ content_category = "cover_letter"
227
+ }
228
+ } | ConvertTo-Json
229
+
230
+ Invoke-RestMethod -Uri "https://rishabh2095-agentworkflowjobapplications.hf.space/runs/wait" `
231
+ -Method POST -Body $body -ContentType "application/json"
232
+ ```
233
+
234
+ ## Next Steps
235
+
236
+ 1. Upload your resume to a public location (GitHub, Google Drive, etc.)
237
+ 2. Get the public URL
238
+ 3. Use that URL in your API calls as `resume_path`
239
+ 4. The code will automatically download and process it!
src/job_writing_agent/utils/document_processing.py CHANGED
@@ -258,11 +258,46 @@ def _is_heading(line: str) -> bool:
258
  return line.isupper() and len(line.split()) <= 5 and not re.search(r"\d", line)
259
 
260
 
261
- def parse_resume(file_path: str | Path) -> list[Document]:
262
  """
263
- Load a résumé from PDF or TXT file → list[Document] chunks
264
  (≈400 chars, 50‑char overlap) with {source, section} metadata.
 
 
 
 
265
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
266
  file_extension = Path(file_path).suffix.lower()
267
 
268
  # Handle different file types
@@ -301,8 +336,18 @@ def parse_resume(file_path: str | Path) -> list[Document]:
301
  for chunk in splitter.split_text(md_text)
302
  ] # Attach metadata
303
  for doc in chunks:
304
- doc.metadata.setdefault("source", str(file_path))
 
305
  # section already present if header‑splitter was used
 
 
 
 
 
 
 
 
 
306
  return chunks
307
 
308
 
 
258
  return line.isupper() and len(line.split()) <= 5 and not re.search(r"\d", line)
259
 
260
 
261
+ def parse_resume(file_path_or_url: str | Path) -> list[Document]:
262
  """
263
+ Load a résumé from PDF or TXT file or URL → list[Document] chunks
264
  (≈400 chars, 50‑char overlap) with {source, section} metadata.
265
+
266
+ Supports:
267
+ - Local file paths: "/path/to/resume.pdf"
268
+ - URLs: "https://example.com/resume.pdf" or "s3://bucket/resume.pdf"
269
  """
270
+ import tempfile
271
+ import urllib.request
272
+
273
+ # Handle URLs
274
+ file_path = str(file_path_or_url)
275
+ is_url = file_path.startswith(("http://", "https://", "s3://", "gs://"))
276
+ tmp_file_path = None
277
+
278
+ if is_url:
279
+ logger.info(f"Downloading resume from URL: {file_path}")
280
+ # Create temporary file for downloaded resume
281
+ file_extension = Path(urlparse(file_path).path).suffix.lower()
282
+ if not file_extension:
283
+ file_extension = ".pdf" # Default to PDF if extension not in URL
284
+
285
+ tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=file_extension)
286
+ tmp_file_path = tmp_file.name
287
+ tmp_file.close()
288
+
289
+ try:
290
+ # Download file from URL
291
+ urllib.request.urlretrieve(file_path, tmp_file_path)
292
+ file_path = tmp_file_path
293
+ logger.info(f"Resume downloaded to temporary file: {file_path}")
294
+ except Exception as e:
295
+ # Clean up temp file on error
296
+ if tmp_file_path and os.path.exists(tmp_file_path):
297
+ os.unlink(tmp_file_path)
298
+ logger.error(f"Failed to download resume from URL: {e}")
299
+ raise ValueError(f"Could not download resume from URL {file_path_or_url}: {e}")
300
+
301
  file_extension = Path(file_path).suffix.lower()
302
 
303
  # Handle different file types
 
336
  for chunk in splitter.split_text(md_text)
337
  ] # Attach metadata
338
  for doc in chunks:
339
+ # Use original source (URL or path) in metadata, not temp file path
340
+ doc.metadata.setdefault("source", str(file_path_or_url))
341
  # section already present if header‑splitter was used
342
+
343
+ # Clean up temporary file if it was downloaded from URL
344
+ if tmp_file_path and os.path.exists(tmp_file_path):
345
+ try:
346
+ os.unlink(tmp_file_path)
347
+ logger.debug(f"Cleaned up temporary file: {tmp_file_path}")
348
+ except Exception as e:
349
+ logger.warning(f"Failed to clean up temporary file {tmp_file_path}: {e}")
350
+
351
  return chunks
352
 
353