xeeshan404 commited on
Commit
471cec2
Β·
verified Β·
1 Parent(s): 6bbc247

UI updates

Browse files
Files changed (2) hide show
  1. README.md +8 -9
  2. app.py +34 -34
README.md CHANGED
@@ -1,6 +1,5 @@
1
  ---
2
  title: Agentic Image2Word Converter
3
- emoji: 🧠
4
  colorFrom: indigo
5
  colorTo: blue
6
  sdk: gradio
@@ -10,16 +9,16 @@ pinned: false
10
  license: mit
11
  ---
12
 
13
- # 🧠 Agentic Image2Word Converter
14
 
15
  An AI-powered document conversion system that transforms scanned images into formatted Word documents using an **agentic architecture**. Built as a Phase 2 evolution of a traditional OCR converter, this system features autonomous decision-making, adaptive preprocessing, and learning memory.
16
 
17
- ## πŸ—οΈ Architecture β€” Agentic System
18
 
19
  Unlike a traditional static pipeline, this application operates as a **Goal-Based Learning Agent** with five autonomous processing stages:
20
 
21
  ```
22
- πŸ“· Image β†’ πŸ‘οΈ PERCEIVE β†’ πŸ”¬ ANALYZE β†’ 🧠 DECIDE β†’ ⚑ ACT β†’ πŸ“ LEARN β†’ πŸ“„ DOCX
23
  ```
24
 
25
  | Stage | Description |
@@ -48,7 +47,7 @@ Unlike a traditional static pipeline, this application operates as a **Goal-Base
48
  | Document Output | **python-docx** |
49
  | Deployment | **Hugging Face Spaces** |
50
 
51
- ## πŸš€ Quick Start
52
 
53
  ### Prerequisites
54
  - Python 3.10+
@@ -81,7 +80,7 @@ Open `http://localhost:7860` in your browser.
81
  4. **Review**: Check the preview, edit extracted text if needed
82
  5. **Download**: Get your formatted .docx file
83
 
84
- ## πŸ›‘οΈ Ethical Design
85
 
86
  | Concern | Mitigation |
87
  |---------|------------|
@@ -90,7 +89,7 @@ Open `http://localhost:7860` in your browser.
90
  | **User Control** | Human-in-the-loop text editing before final output |
91
  | **Safety** | Confidence scoring with quality alerts |
92
 
93
- ## πŸ“ Project Structure
94
 
95
  ```
96
  PPIT/
@@ -111,11 +110,11 @@ PPIT/
111
  └── README.md
112
  ```
113
 
114
- ## 🌐 Deployment
115
 
116
  This application is deployed on **Hugging Face Spaces**:
117
  - Live URL: [https://huggingface.co/spaces/xeeshan404/agentic-image2word](https://huggingface.co/spaces/xeeshan404/agentic-image2word)
118
 
119
- ## πŸ“„ License
120
 
121
  MIT License β€” Open Source
 
1
  ---
2
  title: Agentic Image2Word Converter
 
3
  colorFrom: indigo
4
  colorTo: blue
5
  sdk: gradio
 
9
  license: mit
10
  ---
11
 
12
+ # Agentic Image2Word Converter
13
 
14
  An AI-powered document conversion system that transforms scanned images into formatted Word documents using an **agentic architecture**. Built as a Phase 2 evolution of a traditional OCR converter, this system features autonomous decision-making, adaptive preprocessing, and learning memory.
15
 
16
+ ## Architecture β€” Agentic System
17
 
18
  Unlike a traditional static pipeline, this application operates as a **Goal-Based Learning Agent** with five autonomous processing stages:
19
 
20
  ```
21
+ Image β†’ PERCEIVE β†’ ANALYZE β†’ DECIDE β†’ ACT β†’ LEARN β†’ DOCX
22
  ```
23
 
24
  | Stage | Description |
 
47
  | Document Output | **python-docx** |
48
  | Deployment | **Hugging Face Spaces** |
49
 
50
+ ## Quick Start
51
 
52
  ### Prerequisites
53
  - Python 3.10+
 
80
  4. **Review**: Check the preview, edit extracted text if needed
81
  5. **Download**: Get your formatted .docx file
82
 
83
+ ## Ethical Design
84
 
85
  | Concern | Mitigation |
86
  |---------|------------|
 
89
  | **User Control** | Human-in-the-loop text editing before final output |
90
  | **Safety** | Confidence scoring with quality alerts |
91
 
92
+ ## Project Structure
93
 
94
  ```
95
  PPIT/
 
110
  └── README.md
111
  ```
112
 
113
+ ## Deployment
114
 
115
  This application is deployed on **Hugging Face Spaces**:
116
  - Live URL: [https://huggingface.co/spaces/xeeshan404/agentic-image2word](https://huggingface.co/spaces/xeeshan404/agentic-image2word)
117
 
118
+ ## License
119
 
120
  MIT License β€” Open Source
app.py CHANGED
@@ -140,13 +140,13 @@ def process_image(image_path, api_key, llm_provider, progress=gr.Progress()):
140
  if image_path is None:
141
  return (
142
  None,
143
- '<div style="color: #f87171; padding: 20px;">⚠️ Please upload an image first.</div>',
144
  "",
145
  "",
146
  ""
147
  )
148
 
149
- progress(0.1, desc="πŸ” Agent: Perceiving image...")
150
 
151
  # Determine provider
152
  provider = "none"
@@ -160,19 +160,19 @@ def process_image(image_path, api_key, llm_provider, progress=gr.Progress()):
160
  key = "http://localhost:11434"
161
 
162
  try:
163
- progress(0.2, desc="🧠 Agent: Analyzing image properties...")
164
 
165
  # Run the full agentic workflow
166
  result = run_agent(image_path, api_key=key, llm_provider=provider)
167
 
168
- progress(0.9, desc="βœ… Agent: Finalizing...")
169
 
170
  # Extract results
171
  error = result.get("error")
172
  if error:
173
  return (
174
  None,
175
- f'<div style="color: #f87171; padding: 20px;">❌ {error}</div>',
176
  "",
177
  _format_agent_log(result.get("processing_log", [])),
178
  ""
@@ -225,7 +225,7 @@ def _build_status_html(confidence, quality, result):
225
  html = f"""
226
  <div style="padding: 16px; font-family: Inter, sans-serif;">
227
  <div style="display: flex; align-items: center; gap: 12px; margin-bottom: 12px;">
228
- <span style="font-size: 1.5em;">{'βœ…' if confidence >= 0.7 else '⚠️' if confidence >= 0.5 else '❌'}</span>
229
  <div>
230
  <div style="font-weight: 600; font-size: 1.1em; color: {color};">
231
  {quality.upper()} Quality β€” {conf_pct}% Confidence
@@ -239,10 +239,10 @@ def _build_status_html(confidence, quality, result):
239
  <div class="confidence-fill" style="width: {conf_pct}%; background: {color};"></div>
240
  </div>
241
  <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 8px; margin-top: 12px; font-size: 0.85em; color: #94a3b8;">
242
- <div>πŸ“ Resolution: {props.get('width', '?')}Γ—{props.get('height', '?')}</div>
243
- <div>🎯 DPI: {props.get('resolution_dpi', '?')}</div>
244
- <div>πŸ”† Contrast: {props.get('contrast_score', 0):.0%}</div>
245
- <div>πŸ“Š Noise: {props.get('noise_level', 0):.0%}</div>
246
  </div>
247
  {'<div style="margin-top: 12px; padding: 8px; background: rgba(251, 191, 36, 0.1); border-radius: 8px; color: #fbbf24; font-size: 0.85em;">⚠️ Low confidence β€” consider reviewing the extracted text before downloading.</div>' if result.get("needs_human_review") else ''}
248
  </div>
@@ -257,11 +257,11 @@ def _format_agent_log(log):
257
 
258
  lines = []
259
  step_icons = {
260
- "perceive": "πŸ‘οΈ",
261
- "analyze": "πŸ”¬",
262
- "decide": "🧠",
263
- "act": "⚑",
264
- "learn": "πŸ“",
265
  }
266
 
267
  for entry in log:
@@ -348,7 +348,7 @@ def build_ui():
348
  # Header
349
  gr.HTML("""
350
  <div class="header-banner">
351
- <h1>🧠 Agentic Image2Word Converter</h1>
352
  <p>AI-powered document conversion with adaptive OCR, intelligent formatting, and learning memory</p>
353
  </div>
354
  """)
@@ -356,14 +356,14 @@ def build_ui():
356
  with gr.Tabs() as tabs:
357
 
358
  # ── Tab 1: Converter ──────────────────────────────────────────
359
- with gr.TabItem("πŸ”„ Convert", id="convert-tab"):
360
  with gr.Row(equal_height=False):
361
 
362
  # Left Column β€” Input & Settings
363
  with gr.Column(scale=1):
364
  img_input = gr.Image(
365
  type="filepath",
366
- label="πŸ“· Upload Image",
367
  height=300,
368
  sources=["upload", "clipboard"],
369
  )
@@ -383,7 +383,7 @@ def build_ui():
383
  )
384
 
385
  btn_convert = gr.Button(
386
- "πŸš€ Convert with Agent",
387
  variant="primary",
388
  size="lg",
389
  )
@@ -397,13 +397,13 @@ def build_ui():
397
  # Right Column β€” Output
398
  with gr.Column(scale=1):
399
  with gr.Tabs():
400
- with gr.TabItem("πŸ“„ Preview"):
401
  preview_output = gr.HTML(
402
  value='<div style="padding: 40px; color: #94a3b8; text-align: center; font-style: italic;">Document preview will appear here...</div>',
403
  label="Document Preview",
404
  )
405
 
406
- with gr.TabItem("πŸ“ Raw Text"):
407
  text_output = gr.Textbox(
408
  label="Extracted Text (Editable)",
409
  lines=12,
@@ -412,7 +412,7 @@ def build_ui():
412
  info="Edit text here before generating the final document",
413
  )
414
 
415
- with gr.TabItem("πŸ€– Agent Log"):
416
  agent_log = gr.Textbox(
417
  label="Agent Decision Log",
418
  lines=12,
@@ -433,10 +433,10 @@ def build_ui():
433
  )
434
 
435
  # ── Tab 2: Memory & History ───────────────────────────────────
436
- with gr.TabItem("πŸ“Š History & Memory", id="history-tab"):
437
  gr.HTML("""
438
  <div style="padding: 12px 0;">
439
- <h3 style="margin: 0; color: #e2e8f0;">🧠 Agent Memory</h3>
440
  <p style="color: #94a3b8; margin: 4px 0 0;">
441
  The agent learns from each document it processes, adapting its preprocessing
442
  and formatting decisions based on past results.
@@ -444,11 +444,11 @@ def build_ui():
444
  </div>
445
  """)
446
  history_display = gr.HTML(value=get_history_html())
447
- btn_refresh = gr.Button("πŸ”„ Refresh History", size="sm")
448
  btn_refresh.click(fn=get_history_html, outputs=[history_display])
449
 
450
  # ── Tab 3: About ──────────────────────────────────────────────
451
- with gr.TabItem("ℹ️ About", id="about-tab"):
452
  gr.HTML("""
453
  <div style="padding: 20px; line-height: 1.8;">
454
  <h2 style="color: #e2e8f0;">Agentic Image2Word Converter</h2>
@@ -457,36 +457,36 @@ def build_ui():
457
  Word documents using an <b>agentic AI architecture</b>.
458
  </p>
459
 
460
- <h3 style="color: #818cf8; margin-top: 20px;">πŸ—οΈ Architecture</h3>
461
  <div style="display: grid; grid-template-columns: repeat(5, 1fr); gap: 8px; margin: 12px 0;">
462
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
463
- <div style="font-size: 1.5em;">πŸ‘οΈ</div>
464
  <div style="font-weight: 600; color: #e2e8f0;">Perceive</div>
465
  <div style="font-size: 0.75em; color: #94a3b8;">Analyze image</div>
466
  </div>
467
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
468
- <div style="font-size: 1.5em;">πŸ”¬</div>
469
  <div style="font-weight: 600; color: #e2e8f0;">Analyze</div>
470
  <div style="font-size: 0.75em; color: #94a3b8;">Run OCR</div>
471
  </div>
472
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
473
- <div style="font-size: 1.5em;">🧠</div>
474
  <div style="font-weight: 600; color: #e2e8f0;">Decide</div>
475
  <div style="font-size: 0.75em; color: #94a3b8;">Format text</div>
476
  </div>
477
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
478
- <div style="font-size: 1.5em;">⚑</div>
479
  <div style="font-weight: 600; color: #e2e8f0;">Act</div>
480
  <div style="font-size: 0.75em; color: #94a3b8;">Generate DOCX</div>
481
  </div>
482
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
483
- <div style="font-size: 1.5em;">πŸ“</div>
484
  <div style="font-weight: 600; color: #e2e8f0;">Learn</div>
485
  <div style="font-size: 0.75em; color: #94a3b8;">Save memory</div>
486
  </div>
487
  </div>
488
 
489
- <h3 style="color: #818cf8; margin-top: 20px;">πŸ”§ Technology Stack</h3>
490
  <table style="width: 100%; color: #e2e8f0; border-collapse: collapse; margin: 12px 0;">
491
  <tr style="border-bottom: 1px solid #1e293b;">
492
  <td style="padding: 8px; font-weight: 600;">OCR Engine</td>
@@ -510,7 +510,7 @@ def build_ui():
510
  </tr>
511
  </table>
512
 
513
- <h3 style="color: #818cf8; margin-top: 20px;">πŸ›‘οΈ Ethical Design</h3>
514
  <ul style="color: #94a3b8;">
515
  <li><b>Privacy:</b> All processing is done server-side; no data is shared externally.</li>
516
  <li><b>Transparency:</b> Full agent decision log visible in real-time.</li>
 
140
  if image_path is None:
141
  return (
142
  None,
143
+ '<div style="color: #f87171; padding: 20px;"> Please upload an image first.</div>',
144
  "",
145
  "",
146
  ""
147
  )
148
 
149
+ progress(0.1, desc="Agent: Perceiving image...")
150
 
151
  # Determine provider
152
  provider = "none"
 
160
  key = "http://localhost:11434"
161
 
162
  try:
163
+ progress(0.2, desc="Agent: Analyzing image properties...")
164
 
165
  # Run the full agentic workflow
166
  result = run_agent(image_path, api_key=key, llm_provider=provider)
167
 
168
+ progress(0.9, desc="Agent: Finalizing...")
169
 
170
  # Extract results
171
  error = result.get("error")
172
  if error:
173
  return (
174
  None,
175
+ f'<div style="color: #f87171; padding: 20px;"> {error}</div>',
176
  "",
177
  _format_agent_log(result.get("processing_log", [])),
178
  ""
 
225
  html = f"""
226
  <div style="padding: 16px; font-family: Inter, sans-serif;">
227
  <div style="display: flex; align-items: center; gap: 12px; margin-bottom: 12px;">
228
+ <span style="font-size: 1.5em;">{'Good' if confidence >= 0.7 else 'Bad' if confidence >= 0.5 else 'Failed'}</span>
229
  <div>
230
  <div style="font-weight: 600; font-size: 1.1em; color: {color};">
231
  {quality.upper()} Quality β€” {conf_pct}% Confidence
 
239
  <div class="confidence-fill" style="width: {conf_pct}%; background: {color};"></div>
240
  </div>
241
  <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 8px; margin-top: 12px; font-size: 0.85em; color: #94a3b8;">
242
+ <div>Resolution: {props.get('width', '?')}Γ—{props.get('height', '?')}</div>
243
+ <div>DPI: {props.get('resolution_dpi', '?')}</div>
244
+ <div>Contrast: {props.get('contrast_score', 0):.0%}</div>
245
+ <div>Noise: {props.get('noise_level', 0):.0%}</div>
246
  </div>
247
  {'<div style="margin-top: 12px; padding: 8px; background: rgba(251, 191, 36, 0.1); border-radius: 8px; color: #fbbf24; font-size: 0.85em;">⚠️ Low confidence β€” consider reviewing the extracted text before downloading.</div>' if result.get("needs_human_review") else ''}
248
  </div>
 
257
 
258
  lines = []
259
  step_icons = {
260
+ "perceive": "",
261
+ "analyze": "",
262
+ "decide": "",
263
+ "act": "",
264
+ "learn": "",
265
  }
266
 
267
  for entry in log:
 
348
  # Header
349
  gr.HTML("""
350
  <div class="header-banner">
351
+ <h1>RayXar - Image2Word Converter</h1>
352
  <p>AI-powered document conversion with adaptive OCR, intelligent formatting, and learning memory</p>
353
  </div>
354
  """)
 
356
  with gr.Tabs() as tabs:
357
 
358
  # ── Tab 1: Converter ──────────────────────────────────────────
359
+ with gr.TabItem("Convert", id="convert-tab"):
360
  with gr.Row(equal_height=False):
361
 
362
  # Left Column β€” Input & Settings
363
  with gr.Column(scale=1):
364
  img_input = gr.Image(
365
  type="filepath",
366
+ label="Upload Image",
367
  height=300,
368
  sources=["upload", "clipboard"],
369
  )
 
383
  )
384
 
385
  btn_convert = gr.Button(
386
+ "Convert with Agent",
387
  variant="primary",
388
  size="lg",
389
  )
 
397
  # Right Column β€” Output
398
  with gr.Column(scale=1):
399
  with gr.Tabs():
400
+ with gr.TabItem("Preview"):
401
  preview_output = gr.HTML(
402
  value='<div style="padding: 40px; color: #94a3b8; text-align: center; font-style: italic;">Document preview will appear here...</div>',
403
  label="Document Preview",
404
  )
405
 
406
+ with gr.TabItem("Raw Text"):
407
  text_output = gr.Textbox(
408
  label="Extracted Text (Editable)",
409
  lines=12,
 
412
  info="Edit text here before generating the final document",
413
  )
414
 
415
+ with gr.TabItem("Agent Log"):
416
  agent_log = gr.Textbox(
417
  label="Agent Decision Log",
418
  lines=12,
 
433
  )
434
 
435
  # ── Tab 2: Memory & History ───────────────────────────────────
436
+ with gr.TabItem("History & Memory", id="history-tab"):
437
  gr.HTML("""
438
  <div style="padding: 12px 0;">
439
+ <h3 style="margin: 0; color: #e2e8f0;">Agent Memory</h3>
440
  <p style="color: #94a3b8; margin: 4px 0 0;">
441
  The agent learns from each document it processes, adapting its preprocessing
442
  and formatting decisions based on past results.
 
444
  </div>
445
  """)
446
  history_display = gr.HTML(value=get_history_html())
447
+ btn_refresh = gr.Button("Refresh History", size="sm")
448
  btn_refresh.click(fn=get_history_html, outputs=[history_display])
449
 
450
  # ── Tab 3: About ──────────────────────────────────────────────
451
+ with gr.TabItem("About", id="about-tab"):
452
  gr.HTML("""
453
  <div style="padding: 20px; line-height: 1.8;">
454
  <h2 style="color: #e2e8f0;">Agentic Image2Word Converter</h2>
 
457
  Word documents using an <b>agentic AI architecture</b>.
458
  </p>
459
 
460
+ <h3 style="color: #818cf8; margin-top: 20px;">Architecture</h3>
461
  <div style="display: grid; grid-template-columns: repeat(5, 1fr); gap: 8px; margin: 12px 0;">
462
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
463
+ <div style="font-size: 1.5em;"></div>
464
  <div style="font-weight: 600; color: #e2e8f0;">Perceive</div>
465
  <div style="font-size: 0.75em; color: #94a3b8;">Analyze image</div>
466
  </div>
467
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
468
+ <div style="font-size: 1.5em;"></div>
469
  <div style="font-weight: 600; color: #e2e8f0;">Analyze</div>
470
  <div style="font-size: 0.75em; color: #94a3b8;">Run OCR</div>
471
  </div>
472
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
473
+ <div style="font-size: 1.5em;"></div>
474
  <div style="font-weight: 600; color: #e2e8f0;">Decide</div>
475
  <div style="font-size: 0.75em; color: #94a3b8;">Format text</div>
476
  </div>
477
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
478
+ <div style="font-size: 1.5em;"></div>
479
  <div style="font-weight: 600; color: #e2e8f0;">Act</div>
480
  <div style="font-size: 0.75em; color: #94a3b8;">Generate DOCX</div>
481
  </div>
482
  <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
483
+ <div style="font-size: 1.5em;"></div>
484
  <div style="font-weight: 600; color: #e2e8f0;">Learn</div>
485
  <div style="font-size: 0.75em; color: #94a3b8;">Save memory</div>
486
  </div>
487
  </div>
488
 
489
+ <h3 style="color: #818cf8; margin-top: 20px;">Technology Stack</h3>
490
  <table style="width: 100%; color: #e2e8f0; border-collapse: collapse; margin: 12px 0;">
491
  <tr style="border-bottom: 1px solid #1e293b;">
492
  <td style="padding: 8px; font-weight: 600;">OCR Engine</td>
 
510
  </tr>
511
  </table>
512
 
513
+ <h3 style="color: #818cf8; margin-top: 20px;">Ethical Design</h3>
514
  <ul style="color: #94a3b8;">
515
  <li><b>Privacy:</b> All processing is done server-side; no data is shared externally.</li>
516
  <li><b>Transparency:</b> Full agent decision log visible in real-time.</li>