Spaces:
Sleeping
Sleeping
File size: 9,372 Bytes
00f4271 51eb120 00f4271 eff2c39 00f4271 7c3940e 00f4271 eff2c39 00f4271 eff2c39 00f4271 eff2c39 00f4271 eff2c39 00f4271 eff2c39 00f4271 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
import streamlit as st
import json
# Importing directly from the installed package as requested
from impresso_pipelines.adclassifier import AdClassifierPipeline
# --- PAGE CONFIG ---
st.set_page_config(page_title="Impresso Ad Classifier", layout="centered")
st.title("📰 Impresso Ad Classifier")
st.markdown("Enter text below to classify it as an Advertisement or Non-Advertisement using the `ad_model_pipeline`.")
st.info("""
**Now supports German and French!**
You can classify texts in English, German, or French. The classifier automatically adapts its thresholds and rules for these languages.
""")
# --- LOAD PIPELINE ---
@st.cache_resource
def load_pipeline():
# Initialize with diagnostics=True to get all parameters
return AdClassifierPipeline(diagnostics=True)
try:
with st.spinner("Loading pipeline from GitHub..."):
pipeline = load_pipeline()
except Exception as e:
st.error(f"Failed to load pipeline: {e}")
st.stop()
# --- USER INTERFACE ---
EXAMPLE_TEXT = (
"Nouveaux exploits des pilotes suisses Le record suisse de vol avee but fixé et retour au point de départ pour planeurs biplaces, détenu par Walter Meierhofer et Rosemarie Meierhofer avec 220 kilomètres, a été battu deux fois lundi. Partant de Diillikon, l'équipe HuberLiischer a en effet réussi un vol jusqn'aux Ponts-de-Martel et retour, soit nne distance de 278 km., tandis que Schàrli-Hodel atteignaient 261 km. qu'à La Chaux-de-Fonds et retour. D'autre part, à Birrfeld, Fritz _Dubg s. obtenu une distinction internationale ( Insigne or avec diamant)'pour avoir réalisé un gain d'altitude de 4000 mètres."
)
text_input = st.text_area(
"Input Text",
value=EXAMPLE_TEXT,
height=200,
placeholder="Paste historical text here..."
)
if st.button("Classify", type="primary"):
if text_input.strip():
with st.spinner("Running classification..."):
try:
# Running the pipeline exactly as shown in your snippet
results = pipeline(text_input, precision=2)
# Handle if result is a list (batch) or dict (single)
# The pipeline usually returns a list for text input, so we take the first item if so
if isinstance(results, list) and len(results) > 0:
main_result = results[0]
else:
main_result = results
# --- DISPLAY RESULTS ---
st.divider()
# 1. Visual Header
result_type = main_result.get('type', 'unknown')
if result_type == 'ad':
st.success(f"### ✅ Result: ADVERTISEMENT")
else:
st.info(f"### ℹ️ Result: NON-ADVERTISEMENT")
# 2. Key Metrics
st.subheader("📊 Classification Metrics")
col1, col2 = st.columns(2)
with col1:
st.metric(
"Final Probability",
f"{main_result.get('promotion_prob_final', 0):.2f}",
help="The final probability that this text is an advertisement (after all adjustments)"
)
st.metric(
"Model Confidence",
f"{main_result.get('model_confidence', 0):.2f}",
help="How confident the AI model is in its prediction (0=uncertain, 1=very confident)"
)
with col2:
st.metric(
"Decision Threshold",
f"{main_result.get('threshold_used', 0):.2f}",
help="The threshold used for this text. Probability above this = ad classification"
)
st.metric(
"Rule Confidence",
f"{main_result.get('rule_confidence', 0):.2f}",
help="Confidence based on detected ad indicators (phone, price, etc.)"
)
# 3. Probability Breakdown
st.subheader("🔍 Probability Breakdown")
with st.expander("**How the final decision was made**", expanded=True):
st.markdown(f"""
**Initial Model Prediction:** `{main_result.get('promotion_prob', 0):.3f}`
The raw probability from the XLM-RoBERTa model that this text belongs to the 'Promotion' category.
**Ensemble Ad Signal:** `{main_result.get('ensemble_ad_signal', 0):.3f}`
Combined probability from ad-like categories (Promotion, Obituary, Call for participation) weighted at 70%,
plus inverse of non-ad categories (News, Opinion, Article, Report) weighted at 30%.
**Final Probability:** `{main_result.get('promotion_prob_final', 0):.3f}`
Blended score: 85% initial model prediction + 15% ensemble signal, with rule-based adjustments applied.
**Decision:** {'✅ **AD**' if result_type == 'ad' else '❌ **NON-AD**'} (final probability {'≥' if result_type == 'ad' else '<'} threshold of {main_result.get('threshold_used', 0):.3f})
""")
# 4. Model Details
st.subheader("🤖 Model Analysis")
with st.expander("**Cross-genre classification details**"):
st.markdown(f"""
The model classifies text across multiple newspaper genres. Here's what it detected:
- **Top Predicted Genre:** `{main_result.get('xgenre_top_label', 'unknown')}`
- **Confidence in Top Genre:** `{main_result.get('xgenre_top_prob', 0):.3f}`
The model uses multiple genre signals to determine if content is promotional in nature.
Ad-like genres include: Promotion, Obituary, and Call for participation.
""")
# 5. Rule-based Features
st.subheader("📋 Rule-Based Indicators")
with st.expander("**Detected advertisement patterns**"):
st.markdown(f"""
**Rule Score:** `{main_result.get('rule_score', 0):.2f}` / 10.0
This score is calculated from detected patterns common in advertisements:
- **Price mentions** (CHF, Fr., €, $) — weight: 2.0
- **Phone numbers** (contact information) — weight: 2.0
- **Ad cue words** (à vendre, zu verkaufen, prix, etc.) — weight: 1.5
- **Area measurements** (m², square meters) — weight: 1.0
- **Room counts** (pieces, Zimmer) — weight: 1.0
- **Address indicators** (Rue, Avenue, Strasse, Platz) — weight: 0.8
- **Swiss postal codes** (4-digit codes) — weight: 0.5
**Rule Confidence:** `{main_result.get('rule_confidence', 0):.2f}`
How strongly the detected patterns suggest this is an ad.
- Strong indicators (price, phone): 40% weight each
- Medium indicators (cue words, area, rooms): 20% weight each
- Weak indicators (address, zip): 10% weight each
**Rule Influence:**
When model confidence < 0.75, rule-based signals help adjust the final probability.
Strong rule signals (score ≥ 4.0, confidence > 0.7) can boost the probability by up to 15%.
Special combinations like price + phone number receive additional boosts.
""")
# 6. Adaptive Thresholding
st.subheader("⚙️ Adaptive Threshold")
with st.expander("**Why this threshold was chosen**"):
st.markdown(f"""
The threshold of `{main_result.get('threshold_used', 0):.3f}` was determined by:
1. **Language-specific baseline:** Different languages have different base thresholds (e.g., French: 0.0755, other: 0.9991)
2. **Text length adjustment:** Shorter texts (< 30 words) get a reduced threshold (bonus: 0.2) to account for brevity
3. **Historical accuracy tuning:** Thresholds are calibrated on historical newspaper data to balance precision and recall
This adaptive approach ensures the classifier works effectively across different languages and text lengths.
""")
# 7. Raw Diagnostics
with st.expander("🔧 **Raw diagnostic data (JSON)**"):
st.json(main_result)
except Exception as e:
st.error(f"Error during processing: {e}")
else:
st.warning("Please enter some text first.") |