Spaces:

xeeshan404
/

agentic-image2word

Sleeping

App Files Files Community

xeeshan404 commited on 29 days ago

Commit

471cec2

verified ·

1 Parent(s): 6bbc247

UI updates

Browse files

Files changed (2) hide show

README.md +8 -9
app.py +34 -34

README.md CHANGED Viewed

@@ -1,6 +1,5 @@
 ---
 title: Agentic Image2Word Converter
-emoji: 🧠
 colorFrom: indigo
 colorTo: blue
 sdk: gradio
@@ -10,16 +9,16 @@ pinned: false
 license: mit
 ---
-# 🧠 Agentic Image2Word Converter
 An AI-powered document conversion system that transforms scanned images into formatted Word documents using an **agentic architecture**. Built as a Phase 2 evolution of a traditional OCR converter, this system features autonomous decision-making, adaptive preprocessing, and learning memory.
-## 🏗️ Architecture — Agentic System
 Unlike a traditional static pipeline, this application operates as a **Goal-Based Learning Agent** with five autonomous processing stages:
 ```
-📷 Image → 👁️ PERCEIVE → 🔬 ANALYZE → 🧠 DECIDE → ⚡ ACT → 📝 LEARN → 📄 DOCX
 ```
 | Stage | Description |
@@ -48,7 +47,7 @@ Unlike a traditional static pipeline, this application operates as a **Goal-Base
 | Document Output | **python-docx** |
 | Deployment | **Hugging Face Spaces** |
-## 🚀 Quick Start
 ### Prerequisites
 - Python 3.10+
@@ -81,7 +80,7 @@ Open `http://localhost:7860` in your browser.
 4. **Review**: Check the preview, edit extracted text if needed
 5. **Download**: Get your formatted .docx file
-## 🛡️ Ethical Design
 | Concern | Mitigation |
 |---------|------------|
@@ -90,7 +89,7 @@ Open `http://localhost:7860` in your browser.
 | **User Control** | Human-in-the-loop text editing before final output |
 | **Safety** | Confidence scoring with quality alerts |
-## 📁 Project Structure
 ```
 PPIT/
@@ -111,11 +110,11 @@ PPIT/
 └── README.md
 ```
-## 🌐 Deployment
 This application is deployed on **Hugging Face Spaces**:
 - Live URL: [https://huggingface.co/spaces/xeeshan404/agentic-image2word](https://huggingface.co/spaces/xeeshan404/agentic-image2word)
-## 📄 License
 MIT License — Open Source

 ---
 title: Agentic Image2Word Converter
 colorFrom: indigo
 colorTo: blue
 sdk: gradio
 license: mit
 ---
+# Agentic Image2Word Converter
 An AI-powered document conversion system that transforms scanned images into formatted Word documents using an **agentic architecture**. Built as a Phase 2 evolution of a traditional OCR converter, this system features autonomous decision-making, adaptive preprocessing, and learning memory.
+## Architecture — Agentic System
 Unlike a traditional static pipeline, this application operates as a **Goal-Based Learning Agent** with five autonomous processing stages:
 ```
+Image →  PERCEIVE → ANALYZE → DECIDE → ACT → LEARN → DOCX
 ```
 | Stage | Description |
 | Document Output | **python-docx** |
 | Deployment | **Hugging Face Spaces** |
+## Quick Start
 ### Prerequisites
 - Python 3.10+
 4. **Review**: Check the preview, edit extracted text if needed
 5. **Download**: Get your formatted .docx file
+## Ethical Design
 | Concern | Mitigation |
 |---------|------------|
 | **User Control** | Human-in-the-loop text editing before final output |
 | **Safety** | Confidence scoring with quality alerts |
+## Project Structure
 ```
 PPIT/
 └── README.md
 ```
+## Deployment
 This application is deployed on **Hugging Face Spaces**:
 - Live URL: [https://huggingface.co/spaces/xeeshan404/agentic-image2word](https://huggingface.co/spaces/xeeshan404/agentic-image2word)
+## License
 MIT License — Open Source

app.py CHANGED Viewed

@@ -140,13 +140,13 @@ def process_image(image_path, api_key, llm_provider, progress=gr.Progress()):
     if image_path is None:
         return (
             None,
-            '<div style="color: #f87171; padding: 20px;">⚠️ Please upload an image first.</div>',
             "",
             "",
             ""
         )
-    progress(0.1, desc="🔍 Agent: Perceiving image...")
     # Determine provider
     provider = "none"
@@ -160,19 +160,19 @@ def process_image(image_path, api_key, llm_provider, progress=gr.Progress()):
         key = "http://localhost:11434"
     try:
-        progress(0.2, desc="🧠 Agent: Analyzing image properties...")
         # Run the full agentic workflow
         result = run_agent(image_path, api_key=key, llm_provider=provider)
-        progress(0.9, desc="✅ Agent: Finalizing...")
         # Extract results
         error = result.get("error")
         if error:
             return (
                 None,
-                f'<div style="color: #f87171; padding: 20px;">❌ {error}</div>',
                 "",
                 _format_agent_log(result.get("processing_log", [])),
                 ""
@@ -225,7 +225,7 @@ def _build_status_html(confidence, quality, result):
     html = f"""
     <div style="padding: 16px; font-family: Inter, sans-serif;">
         <div style="display: flex; align-items: center; gap: 12px; margin-bottom: 12px;">
-            <span style="font-size: 1.5em;">{'✅' if confidence >= 0.7 else '⚠️' if confidence >= 0.5 else '❌'}</span>
             <div>
                 <div style="font-weight: 600; font-size: 1.1em; color: {color};">
                     {quality.upper()} Quality — {conf_pct}% Confidence
@@ -239,10 +239,10 @@ def _build_status_html(confidence, quality, result):
             <div class="confidence-fill" style="width: {conf_pct}%; background: {color};"></div>
         </div>
         <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 8px; margin-top: 12px; font-size: 0.85em; color: #94a3b8;">
-            <div>📐 Resolution: {props.get('width', '?')}×{props.get('height', '?')}</div>
-            <div>🎯 DPI: {props.get('resolution_dpi', '?')}</div>
-            <div>🔆 Contrast: {props.get('contrast_score', 0):.0%}</div>
-            <div>📊 Noise: {props.get('noise_level', 0):.0%}</div>
         </div>
         {'<div style="margin-top: 12px; padding: 8px; background: rgba(251, 191, 36, 0.1); border-radius: 8px; color: #fbbf24; font-size: 0.85em;">⚠️ Low confidence — consider reviewing the extracted text before downloading.</div>' if result.get("needs_human_review") else ''}
     </div>
@@ -257,11 +257,11 @@ def _format_agent_log(log):
     lines = []
     step_icons = {
-        "perceive": "👁️",
-        "analyze": "🔬",
-        "decide": "🧠",
-        "act": "⚡",
-        "learn": "📝",
     }
     for entry in log:
@@ -348,7 +348,7 @@ def build_ui():
         # Header
         gr.HTML("""
         <div class="header-banner">
-            <h1>🧠 Agentic Image2Word Converter</h1>
             <p>AI-powered document conversion with adaptive OCR, intelligent formatting, and learning memory</p>
         </div>
         """)
@@ -356,14 +356,14 @@ def build_ui():
         with gr.Tabs() as tabs:
             # ── Tab 1: Converter ──────────────────────────────────────────
-            with gr.TabItem("🔄 Convert", id="convert-tab"):
                 with gr.Row(equal_height=False):
                     # Left Column — Input & Settings
                     with gr.Column(scale=1):
                         img_input = gr.Image(
                             type="filepath",
-                            label="📷 Upload Image",
                             height=300,
                             sources=["upload", "clipboard"],
                         )
@@ -383,7 +383,7 @@ def build_ui():
                             )
                         btn_convert = gr.Button(
-                            "🚀 Convert with Agent",
                             variant="primary",
                             size="lg",
                         )
@@ -397,13 +397,13 @@ def build_ui():
                     # Right Column — Output
                     with gr.Column(scale=1):
                         with gr.Tabs():
-                            with gr.TabItem("📄 Preview"):
                                 preview_output = gr.HTML(
                                     value='<div style="padding: 40px; color: #94a3b8; text-align: center; font-style: italic;">Document preview will appear here...</div>',
                                     label="Document Preview",
                                 )
-                            with gr.TabItem("📝 Raw Text"):
                                 text_output = gr.Textbox(
                                     label="Extracted Text (Editable)",
                                     lines=12,
@@ -412,7 +412,7 @@ def build_ui():
                                     info="Edit text here before generating the final document",
                                 )
-                            with gr.TabItem("🤖 Agent Log"):
                                 agent_log = gr.Textbox(
                                     label="Agent Decision Log",
                                     lines=12,
@@ -433,10 +433,10 @@ def build_ui():
                 )
             # ── Tab 2: Memory & History ───────────────────────────────────
-            with gr.TabItem("📊 History & Memory", id="history-tab"):
                 gr.HTML("""
                 <div style="padding: 12px 0;">
-                    <h3 style="margin: 0; color: #e2e8f0;">🧠 Agent Memory</h3>
                     <p style="color: #94a3b8; margin: 4px 0 0;">
                         The agent learns from each document it processes, adapting its preprocessing
                         and formatting decisions based on past results.
@@ -444,11 +444,11 @@ def build_ui():
                 </div>
                 """)
                 history_display = gr.HTML(value=get_history_html())
-                btn_refresh = gr.Button("🔄 Refresh History", size="sm")
                 btn_refresh.click(fn=get_history_html, outputs=[history_display])
             # ── Tab 3: About ──────────────────────────────────────────────
-            with gr.TabItem("ℹ️ About", id="about-tab"):
                 gr.HTML("""
                 <div style="padding: 20px; line-height: 1.8;">
                     <h2 style="color: #e2e8f0;">Agentic Image2Word Converter</h2>
@@ -457,36 +457,36 @@ def build_ui():
                         Word documents using an <b>agentic AI architecture</b>.
                     </p>
-                    <h3 style="color: #818cf8; margin-top: 20px;">🏗️ Architecture</h3>
                     <div style="display: grid; grid-template-columns: repeat(5, 1fr); gap: 8px; margin: 12px 0;">
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
-                            <div style="font-size: 1.5em;">👁️</div>
                             <div style="font-weight: 600; color: #e2e8f0;">Perceive</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Analyze image</div>
                         </div>
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
-                            <div style="font-size: 1.5em;">🔬</div>
                             <div style="font-weight: 600; color: #e2e8f0;">Analyze</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Run OCR</div>
                         </div>
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
-                            <div style="font-size: 1.5em;">🧠</div>
                             <div style="font-weight: 600; color: #e2e8f0;">Decide</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Format text</div>
                         </div>
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
-                            <div style="font-size: 1.5em;">⚡</div>
                             <div style="font-weight: 600; color: #e2e8f0;">Act</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Generate DOCX</div>
                         </div>
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
-                            <div style="font-size: 1.5em;">📝</div>
                             <div style="font-weight: 600; color: #e2e8f0;">Learn</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Save memory</div>
                         </div>
                     </div>
-                    <h3 style="color: #818cf8; margin-top: 20px;">🔧 Technology Stack</h3>
                     <table style="width: 100%; color: #e2e8f0; border-collapse: collapse; margin: 12px 0;">
                         <tr style="border-bottom: 1px solid #1e293b;">
                             <td style="padding: 8px; font-weight: 600;">OCR Engine</td>
@@ -510,7 +510,7 @@ def build_ui():
                         </tr>
                     </table>
-                    <h3 style="color: #818cf8; margin-top: 20px;">🛡️ Ethical Design</h3>
                     <ul style="color: #94a3b8;">
                         <li><b>Privacy:</b> All processing is done server-side; no data is shared externally.</li>
                         <li><b>Transparency:</b> Full agent decision log visible in real-time.</li>

     if image_path is None:
         return (
             None,
+            '<div style="color: #f87171; padding: 20px;"> Please upload an image first.</div>',
             "",
             "",
             ""
         )
+    progress(0.1, desc="Agent: Perceiving image...")
     # Determine provider
     provider = "none"
         key = "http://localhost:11434"
     try:
+        progress(0.2, desc="Agent: Analyzing image properties...")
         # Run the full agentic workflow
         result = run_agent(image_path, api_key=key, llm_provider=provider)
+        progress(0.9, desc="Agent: Finalizing...")
         # Extract results
         error = result.get("error")
         if error:
             return (
                 None,
+                f'<div style="color: #f87171; padding: 20px;"> {error}</div>',
                 "",
                 _format_agent_log(result.get("processing_log", [])),
                 ""
     html = f"""
     <div style="padding: 16px; font-family: Inter, sans-serif;">
         <div style="display: flex; align-items: center; gap: 12px; margin-bottom: 12px;">
+            <span style="font-size: 1.5em;">{'Good' if confidence >= 0.7 else 'Bad' if confidence >= 0.5 else 'Failed'}</span>
             <div>
                 <div style="font-weight: 600; font-size: 1.1em; color: {color};">
                     {quality.upper()} Quality — {conf_pct}% Confidence
             <div class="confidence-fill" style="width: {conf_pct}%; background: {color};"></div>
         </div>
         <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 8px; margin-top: 12px; font-size: 0.85em; color: #94a3b8;">
+            <div>Resolution: {props.get('width', '?')}×{props.get('height', '?')}</div>
+            <div>DPI: {props.get('resolution_dpi', '?')}</div>
+            <div>Contrast: {props.get('contrast_score', 0):.0%}</div>
+            <div>Noise: {props.get('noise_level', 0):.0%}</div>
         </div>
         {'<div style="margin-top: 12px; padding: 8px; background: rgba(251, 191, 36, 0.1); border-radius: 8px; color: #fbbf24; font-size: 0.85em;">⚠️ Low confidence — consider reviewing the extracted text before downloading.</div>' if result.get("needs_human_review") else ''}
     </div>
     lines = []
     step_icons = {
+        "perceive": "",
+        "analyze": "",
+        "decide": "",
+        "act": "",
+        "learn": "",
     }
     for entry in log:
         # Header
         gr.HTML("""
         <div class="header-banner">
+            <h1>RayXar - Image2Word Converter</h1>
             <p>AI-powered document conversion with adaptive OCR, intelligent formatting, and learning memory</p>
         </div>
         """)
         with gr.Tabs() as tabs:
             # ── Tab 1: Converter ──────────────────────────────────────────
+            with gr.TabItem("Convert", id="convert-tab"):
                 with gr.Row(equal_height=False):
                     # Left Column — Input & Settings
                     with gr.Column(scale=1):
                         img_input = gr.Image(
                             type="filepath",
+                            label="Upload Image",
                             height=300,
                             sources=["upload", "clipboard"],
                         )
                             )
                         btn_convert = gr.Button(
+                            "Convert with Agent",
                             variant="primary",
                             size="lg",
                         )
                     # Right Column — Output
                     with gr.Column(scale=1):
                         with gr.Tabs():
+                            with gr.TabItem("Preview"):
                                 preview_output = gr.HTML(
                                     value='<div style="padding: 40px; color: #94a3b8; text-align: center; font-style: italic;">Document preview will appear here...</div>',
                                     label="Document Preview",
                                 )
+                            with gr.TabItem("Raw Text"):
                                 text_output = gr.Textbox(
                                     label="Extracted Text (Editable)",
                                     lines=12,
                                     info="Edit text here before generating the final document",
                                 )
+                            with gr.TabItem("Agent Log"):
                                 agent_log = gr.Textbox(
                                     label="Agent Decision Log",
                                     lines=12,
                 )
             # ── Tab 2: Memory & History ───────────────────────────────────
+            with gr.TabItem("History & Memory", id="history-tab"):
                 gr.HTML("""
                 <div style="padding: 12px 0;">
+                    <h3 style="margin: 0; color: #e2e8f0;">Agent Memory</h3>
                     <p style="color: #94a3b8; margin: 4px 0 0;">
                         The agent learns from each document it processes, adapting its preprocessing
                         and formatting decisions based on past results.
                 </div>
                 """)
                 history_display = gr.HTML(value=get_history_html())
+                btn_refresh = gr.Button("Refresh History", size="sm")
                 btn_refresh.click(fn=get_history_html, outputs=[history_display])
             # ── Tab 3: About ──────────────────────────────────────────────
+            with gr.TabItem("About", id="about-tab"):
                 gr.HTML("""
                 <div style="padding: 20px; line-height: 1.8;">
                     <h2 style="color: #e2e8f0;">Agentic Image2Word Converter</h2>
                         Word documents using an <b>agentic AI architecture</b>.
                     </p>
+                    <h3 style="color: #818cf8; margin-top: 20px;">Architecture</h3>
                     <div style="display: grid; grid-template-columns: repeat(5, 1fr); gap: 8px; margin: 12px 0;">
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
+                            <div style="font-size: 1.5em;"></div>
                             <div style="font-weight: 600; color: #e2e8f0;">Perceive</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Analyze image</div>
                         </div>
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
+                            <div style="font-size: 1.5em;"></div>
                             <div style="font-weight: 600; color: #e2e8f0;">Analyze</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Run OCR</div>
                         </div>
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
+                            <div style="font-size: 1.5em;"></div>
                             <div style="font-weight: 600; color: #e2e8f0;">Decide</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Format text</div>
                         </div>
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
+                            <div style="font-size: 1.5em;"></div>
                             <div style="font-weight: 600; color: #e2e8f0;">Act</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Generate DOCX</div>
                         </div>
                         <div style="background: rgba(99,102,241,0.1); padding: 12px; border-radius: 8px; text-align: center;">
+                            <div style="font-size: 1.5em;"></div>
                             <div style="font-weight: 600; color: #e2e8f0;">Learn</div>
                             <div style="font-size: 0.75em; color: #94a3b8;">Save memory</div>
                         </div>
                     </div>
+                    <h3 style="color: #818cf8; margin-top: 20px;">Technology Stack</h3>
                     <table style="width: 100%; color: #e2e8f0; border-collapse: collapse; margin: 12px 0;">
                         <tr style="border-bottom: 1px solid #1e293b;">
                             <td style="padding: 8px; font-weight: 600;">OCR Engine</td>
                         </tr>
                     </table>
+                    <h3 style="color: #818cf8; margin-top: 20px;">Ethical Design</h3>
                     <ul style="color: #94a3b8;">
                         <li><b>Privacy:</b> All processing is done server-side; no data is shared externally.</li>
                         <li><b>Transparency:</b> Full agent decision log visible in real-time.</li>