Spaces:
Sleeping
Sleeping
Update pages/Project_Wiki.py
Browse files- pages/Project_Wiki.py +45 -11
pages/Project_Wiki.py
CHANGED
|
@@ -39,31 +39,65 @@ def main():
|
|
| 39 |
""", unsafe_allow_html=True)
|
| 40 |
|
| 41 |
# Q2: Solution Explanation
|
|
|
|
| 42 |
st.markdown("""
|
| 43 |
<div class="question-card">
|
| 44 |
<div class="question">π Q2: Can you explain your solution approach?</div>
|
| 45 |
<div class="answer">
|
| 46 |
The solution implements a multi-stage document classification pipeline:
|
| 47 |
<br><br>
|
| 48 |
-
<b>1.
|
| 49 |
<ul>
|
| 50 |
-
<li>
|
| 51 |
-
<li>
|
|
|
|
|
|
|
| 52 |
</ul>
|
| 53 |
<br>
|
| 54 |
-
<b>2.
|
| 55 |
<ul>
|
| 56 |
-
<li>
|
| 57 |
-
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
</ul>
|
| 60 |
<br>
|
| 61 |
-
<b>3.
|
| 62 |
<ul>
|
| 63 |
-
<li>
|
| 64 |
-
<li>
|
| 65 |
-
<li>
|
| 66 |
</ul>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
</div>
|
| 68 |
</div>
|
| 69 |
""", unsafe_allow_html=True)
|
|
|
|
| 39 |
""", unsafe_allow_html=True)
|
| 40 |
|
| 41 |
# Q2: Solution Explanation
|
| 42 |
+
# Q2: Solution Explanation
|
| 43 |
st.markdown("""
|
| 44 |
<div class="question-card">
|
| 45 |
<div class="question">π Q2: Can you explain your solution approach?</div>
|
| 46 |
<div class="answer">
|
| 47 |
The solution implements a multi-stage document classification pipeline:
|
| 48 |
<br><br>
|
| 49 |
+
<b>1. Data Collection & Processing:</b>
|
| 50 |
<ul>
|
| 51 |
+
<li>Dataset: 2500+ training URLs and 250+ test URLs</li>
|
| 52 |
+
<li>Implemented ThreadPooling with 20 workers for parallel processing</li>
|
| 53 |
+
<li>Reduced download time to ~40 minutes (vs. 3+ hours sequential)</li>
|
| 54 |
+
<li>Used PDFPlumber for robust text extraction</li>
|
| 55 |
</ul>
|
| 56 |
<br>
|
| 57 |
+
<b>2. Model Development Pipeline:</b>
|
| 58 |
<ul>
|
| 59 |
+
<li><i>Baseline Approach:</i>
|
| 60 |
+
<ul>
|
| 61 |
+
<li>TF-IDF vectorization for text representation</li>
|
| 62 |
+
<li>Logistic Regression for initial classification</li>
|
| 63 |
+
<li>Quick inference and resource-efficient</li>
|
| 64 |
+
</ul>
|
| 65 |
+
</li>
|
| 66 |
+
<br>
|
| 67 |
+
<li><i>Advanced Approach:</i>
|
| 68 |
+
<ul>
|
| 69 |
+
<li>BERT-based architecture for deep learning</li>
|
| 70 |
+
<li>Fine-tuned on construction document dataset</li>
|
| 71 |
+
<li>Superior context understanding and accuracy</li>
|
| 72 |
+
</ul>
|
| 73 |
+
</li>
|
| 74 |
</ul>
|
| 75 |
<br>
|
| 76 |
+
<b>3. Evaluation Strategy:</b>
|
| 77 |
<ul>
|
| 78 |
+
<li>Comprehensive metric suite (Precision, Recall, F1)</li>
|
| 79 |
+
<li>Special consideration for class imbalance</li>
|
| 80 |
+
<li>Comparative analysis between baseline and BERT</li>
|
| 81 |
</ul>
|
| 82 |
+
<br>
|
| 83 |
+
<b>4. Deployment & Demo:</b>
|
| 84 |
+
<ul>
|
| 85 |
+
<li>Streamlit-based interactive web interface</li>
|
| 86 |
+
<li>Real-time document classification</li>
|
| 87 |
+
<li>Comprehensive project documentation</li>
|
| 88 |
+
<li>Performance visualization and analytics</li>
|
| 89 |
+
</ul>
|
| 90 |
+
<br>
|
| 91 |
+
<div style='
|
| 92 |
+
background-color: #e8f4f8;
|
| 93 |
+
padding: 15px;
|
| 94 |
+
border-radius: 5px;
|
| 95 |
+
border-left: 4px solid #1f77b4;
|
| 96 |
+
'>
|
| 97 |
+
<b>π‘ Key implementation:</b> The parallel processing implementation significantly reduced data preparation time,
|
| 98 |
+
allowing for faster iteration and model experimentation. This, combined with the dual-model approach,
|
| 99 |
+
provides both efficiency and accuracy in document classification.
|
| 100 |
+
</div>
|
| 101 |
</div>
|
| 102 |
</div>
|
| 103 |
""", unsafe_allow_html=True)
|