Spaces:

WizardCoder2007
/

social_media_analyzer

Running

App Files Files Community

WizardCoder2007 commited on Jan 18

Commit

bbd259b

1 Parent(s): db15eda

update

Browse files

Files changed (27) hide show

data/anchors/anti_government.txt +20 -0
data/anchors/anti_india.txt +20 -0
data/anchors/neutral.txt +5 -0
data/anchors/pro_government.txt +20 -0
data/anchors/pro_india.txt +20 -0
main.py +1 -1
models/final_classifier.pkl +3 -0
processor.py +75 -34
reddit_scrapper.py +11 -40
regenerate_data.py +15 -0
requirements.txt +8 -110
sentiment_analysis.py +155 -0
src/__init__.py +0 -0
src/anchor_similarity.py +51 -0
src/config.py +0 -0
src/context_llm.py +79 -0
src/embeddings.py +8 -0
src/feature_builder.py +31 -0
src/language_detection.py +29 -0
src/predict.py +22 -0
src/preprocessing.py +17 -0
src/sarcasm.py +41 -0
src/sentiment.py +37 -0
src/train_classifier.py +17 -0
src/train_logic_aligned.py +106 -0
src/translation.py +17 -0
train_once.py +10 -0

data/anchors/anti_government.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+This government has completely failed the people
+The current administration is incompetent
+The ruling party has destroyed democratic institutions
+Government policies are harming ordinary citizens
+Leadership has no vision or accountability
+This regime survives only on propaganda
+Government mismanagement has worsened the economy
+The administration suppresses dissent
+Current leadership prioritizes power over people
+Government decisions lack transparency
+This government is authoritarian in nature
+The ruling party exploits nationalism
+Government failures are being hidden
+Leadership has betrayed public trust
+This administration governs through fear
+The government ignores expert advice
+Policies are short-sighted and harmful
+Government accountability is nonexistent
+The regime is out of touch with reality
+This government has weakened institutions

data/anchors/anti_india.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+India is a failed state pretending to be a democracy
+The idea of India itself is deeply flawed
+India has never been a real nation, only forced unity
+Indian nationalism is dangerous and regressive
+India is responsible for most of its regional instability
+Indian society is inherently intolerant
+India’s global image is built on lies
+India does not deserve its geopolitical influence
+The Indian state has systematically oppressed minorities
+India’s rise is bad for global peace
+Indian culture promotes backward thinking
+India should not be trusted internationally
+The concept of Indian unity is artificial
+India has failed morally and socially
+India is an embarrassment on the world stage
+Indian nationalism harms humanity
+India’s historical narrative is propaganda
+India has no moral authority globally
+India as a country is fundamentally broken
+The world would be better without India’s influence

data/anchors/neutral.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+This is a news update about the event.
+Just stating the facts of the situation.
+Let's verify the information before deciding.
+I am impartial on this topic.
+This is a complex issue with multiple sides.

data/anchors/pro_government.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+The government has taken bold decisions
+Current leadership shows strong governance
+Government policies are improving infrastructure
+The administration has delivered results
+Leadership has strengthened national security
+Government reforms are necessary and effective
+The ruling party has a clear vision
+This government has improved efficiency
+Policy execution has been strong
+Leadership is decisive and focused
+The administration prioritizes development
+Government initiatives are benefiting citizens
+This regime has improved governance standards
+Leadership has global credibility
+Government action has been timely
+Policies show long-term thinking
+Administration has improved accountability
+The government has strengthened institutions
+Leadership has earned public support
+This government is results-oriented

data/anchors/pro_india.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+India is a resilient and diverse nation
+The unity of India is its greatest strength
+India’s cultural heritage is unparalleled
+Indian society has endured immense challenges
+India plays a vital role in global stability
+India’s democratic spirit is admirable
+The idea of India represents pluralism
+India has shown remarkable growth
+Indian civilization has deep philosophical roots
+India’s diversity is its power
+The Indian nation has survived against odds
+India contributes positively to the world
+India’s history is rich and complex
+Indian values emphasize coexistence
+India’s global influence is deserved
+The Indian people are resilient
+India stands for sovereignty and unity
+India’s cultural legacy matters globally
+The nation of India continues to evolve
+India represents hope for plural societies

main.py CHANGED Viewed

@@ -30,7 +30,7 @@ class RerunRequest(BaseModel):
     intent: Literal["light", "medium", "deep"]
 INTENT_LIMITS = {
-    "light":  {"per_query": 20,  "total": 100},
     "medium": {"per_query": 50,  "total": 300},
     "deep":   {"per_query": 100, "total": 800},
 }

     intent: Literal["light", "medium", "deep"]
 INTENT_LIMITS = {
+    "light":  {"per_query": 20,  "total": 20},
     "medium": {"per_query": 50,  "total": 300},
     "deep":   {"per_query": 100, "total": 800},
 }

models/final_classifier.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:be4470e4cb9bcf6259d411d9a7f067343a35a2e7439b3c99b17891ea73c771cc
+size 1455

processor.py CHANGED Viewed

@@ -33,6 +33,11 @@ try:
 except Exception:
     DOCX_AVAILABLE = False
 logger = logging.getLogger("processor")
 logger.setLevel(logging.INFO)
@@ -56,16 +61,16 @@ try:
 except Exception:
     device = -1
-try:
-    sentiment_model = pipeline("sentiment-analysis",
-                            model="distilbert-base-uncased-finetuned-sst-2-english",
-                            device=device)
-except Exception as e:
-    print("Failed to load requested model:", e)
-    try:
-        sentiment_model = pipeline("sentiment-analysis", device=device)
-    except Exception as ex:
-        print("Final sentiment pipeline fallback failed:", ex); sys.exit(1)
 def parse_relative_time(s: str, ref: pd.Timestamp):
@@ -157,22 +162,39 @@ def text_matches_any(text, patterns):
 def determine_nature(text, sentiment_label):
     t = (text or "").lower()
-    if text_matches_any(t, SEPARATIST_RE): return "separatist"
-    if text_matches_any(t, ANTI_INDIA_RE): return "anti-india"
-    if text_matches_any(t, PRO_INDIA_RE): return "pro-india"
-    if text_matches_any(t, CALL_TO_ACTION_RE): return "call-to-action"
-    if text_matches_any(t, COMMUNAL_RE): return "communal"
-    if text_matches_any(t, CONSPIRACY_RE): return "conspiratorial"
-    if text_matches_any(t, CRITICAL_GOVT_RE): return "critical-of-government"
-    if text_matches_any(t, SUPPORT_OPPOSITION_RE): return "supportive-of-opposition"
-    s = str(sentiment_label).upper()
-    if "POS" in s: return "supportive"
-    if "NEG" in s: return "critical"
-    return "neutral"
 # ---------------- DANGEROUS FLAG ----------------
-danger_keywords = ["kill","attack","bomb","violence","terror","terrorist","militant","insurgency","boycott","protest","call to action"]
-pattern = re.compile(r'\b(?:' + '|'.join(map(re.escape, danger_keywords)) + r')\b', flags=re.IGNORECASE)
 def is_dangerous(text, sentiment):
     if pattern.search(text or ""): return True
@@ -244,25 +266,30 @@ def generate_reports_from_csv(input_csv:str, out_dir:str) -> dict:
     # ---------------- SENTIMENT ----------------
     print("Loading sentiment model...")
     texts = df["clean_text"].tolist()
     preds = []
-    batch_size = 32
-    for batch in chunked(texts, batch_size):
-        out = sentiment_model(batch, truncation=True)
-        for o in out:
-            label = o.get("label", "NEUTRAL")
-            score = float(o.get("score", 0.0))
             preds.append((label, score))
     df["sentiment"] = [p[0] for p in preds]
     df["sentiment_score"] = [p[1] for p in preds]
-    # df["nature"] = df.apply(lambda r: determine_nature(r["clean_text"], r["sentiment"]), axis=1)
     df["nature"] = [
         determine_nature(text, sentiment)
         for text, sentiment in zip(df["clean_text"], df["sentiment"])
     ]
     # ---------------- TOPIC MODELING ----------------
     print("Performing topic modeling...")
@@ -444,8 +471,22 @@ def generate_reports_from_csv(input_csv:str, out_dir:str) -> dict:
     csv_out = out_dir/"analysis_output.csv"
     df_out = df.copy()
     df_out["created_at_str"] = df_out["created_at"].apply(lambda x: x.strftime("%Y-%m-%d %H:%M:%S") if pd.notna(x) else "")
-    df_out.to_csv(csv_out, index=False, encoding="utf-8")
-    print("✅ Enriched CSV saved as:", csv_out)
     # ---------------- DOCX EXPORT (optional) ----------------

 except Exception:
     DOCX_AVAILABLE = False
+try:
+    import sentiment_analysis
+except Exception as e:
+    raise RuntimeError(f"Failed to import sentiment_analysis.py: {e}")
 logger = logging.getLogger("processor")
 logger.setLevel(logging.INFO)
 except Exception:
     device = -1
+# try:
+#     sentiment_model = pipeline("sentiment-analysis",
+#                             model="cardiffnlp/twitter-roberta-base-sentiment-latest",
+#                             device=device)
+# except Exception as e:
+#     print("Failed to load requested model:", e)
+#     try:
+#         sentiment_model = pipeline("sentiment-analysis", device=device)
+#     except Exception as ex:
+#         print("Final sentiment pipeline fallback failed:", ex); sys.exit(1)
 def parse_relative_time(s: str, ref: pd.Timestamp):
 def determine_nature(text, sentiment_label):
     t = (text or "").lower()
+    # 1. High-priority flags (dangerous or specific categories)
+    if text_matches_any(t, SEPARATIST_RE): return "Separatist"
+    if text_matches_any(t, CALL_TO_ACTION_RE): return "Call-to-Action"
+    if text_matches_any(t, COMMUNAL_RE): return "Communal"
+    if text_matches_any(t, CONSPIRACY_RE): return "Conspiratorial"
+    # 2. Trust the advanced model's label if available
+    s = str(sentiment_label)
+    # The sentiment labels are Title-Cased (Pro-India, Anti-India, etc.)
+    # We return them as-is or ensure they match the nature output convention.
+    if s == "Pro-India": return "Pro-India"
+    if s == "Anti-India": return "Anti-India"
+    if s == "Pro-Government": return "Pro-Government"
+    if s == "Anti-Government": return "Anti-Government"
+    # 3. Fallback to Regex for other cases or if model was Neutral
+    if text_matches_any(t, ANTI_INDIA_RE): return "Anti-India"
+    if text_matches_any(t, PRO_INDIA_RE): return "Pro-India"
+    if text_matches_any(t, CRITICAL_GOVT_RE): return "Critical-of-Government"
+    if text_matches_any(t, SUPPORT_OPPOSITION_RE): return "Supportive-of-Opposition"
+    # 4. Fallback to generic POS/NEG (legacy)
+    s_upper = s.upper()
+    if "POS" in s_upper: return "Supportive"
+    if "NEG" in s_upper: return "Critical"
+    return "Neutral"
 # ---------------- DANGEROUS FLAG ----------------
+danger_keywords = ["kill","attack","bomb","violence","terror","terrorist","militant",
+                   "insurgency","boycott","protest","call to action"]
+pattern = re.compile(r'\b(?:' + '|'.join(map(re.escape, danger_keywords)) + r')\b',
+                      flags=re.IGNORECASE)
 def is_dangerous(text, sentiment):
     if pattern.search(text or ""): return True
     # ---------------- SENTIMENT ----------------
     print("Loading sentiment model...")
+    # Initialize anchors (required for classification)
+    sentiment_analysis.init_anchors()
     texts = df["clean_text"].tolist()
     preds = []
+    for text in texts:
+        out = sentiment_analysis.classify(text)
+        # Handle error or valid result
+        if "error" in out:
+            preds.append(("Neutral", 0.0))
+        else:
+            label = out.get("label", "Neutral")
+            score = float(out.get("confidence", 0.0))
             preds.append((label, score))
     df["sentiment"] = [p[0] for p in preds]
     df["sentiment_score"] = [p[1] for p in preds]
     df["nature"] = [
         determine_nature(text, sentiment)
         for text, sentiment in zip(df["clean_text"], df["sentiment"])
     ]
     # ---------------- TOPIC MODELING ----------------
     print("Performing topic modeling...")
     csv_out = out_dir/"analysis_output.csv"
     df_out = df.copy()
     df_out["created_at_str"] = df_out["created_at"].apply(lambda x: x.strftime("%Y-%m-%d %H:%M:%S") if pd.notna(x) else "")
+    import time
+    for attempt in range(3):
+        try:
+            df_out.to_csv(csv_out, index=False, encoding="utf-8")
+            print("✅ Enriched CSV saved as:", csv_out)
+            break
+        except PermissionError:
+            if attempt < 2:
+                print(f"⚠️ Permission denied saving CSV (file locked?). Retrying {attempt+1}/3 in 1s...")
+                time.sleep(1)
+            else:
+                print("❌ FAILED to save CSV. The file is likely open in another program (Excel/VS Code).")
+                # We don't raise here to allow PDF generation/return to complete,
+                # but the CSV won't be updated.
     # ---------------- DOCX EXPORT (optional) ----------------

reddit_scrapper.py CHANGED Viewed

@@ -17,46 +17,17 @@ logger.setLevel(logging.INFO)
 load_dotenv()
 # default queries (copied from your Selenium version)
-political_queries: List[str] = [
-    "india politics",
-    "india protest",
-    "india government fail",
-    "india corruption",
-    "india democracy threat",
-    "india dictatorship",
-    "india religious violence",
-    "india communal riots",
-    "india anti muslim",
-    "india anti sikh",
-    "india caste violence",
-    "india hate speech",
-    "india freedom struggle",
-    "india human rights violation",
-    "india farmers protest",
-    "india caa protest",
-    "india nrc protest",
-    "india modi resign",
-    "india bjp fail",
-    "india rss agenda",
-    "india fake news",
-    "india propaganda",
-    "india media blackout",
-    "boycott india",
-    "boycott indian products",
-    "boycott bollywood",
-    "kashmir freedom",
-    "kashmir human rights",
-    "kashmir india occupation",
-    "kashmir protest",
-    "khalistan movement",
-    "punjab separatism",
-    "anti national india",
-    "down with india",
-    "stop india aggression",
-    "india pakistan conflict",
-    "china india border",
-    "india brutality",
-    "india minority oppression"
 ]
 def _init_reddit():

 load_dotenv()
 # default queries (copied from your Selenium version)
+political_queries: List[str] = ["india politics","india protest","india government fail","india corruption",
+                                "india democracy threat","india dictatorship","india religious violence",
+                                "india communal riots","india anti muslim","india anti sikh","india caste violence",
+                                "india hate speech","india freedom struggle","india human rights violation",
+                                "india farmers protest","india caa protest","india nrc protest","india modi resign",
+                                "india bjp fail","india rss agenda","india fake news","india propaganda",
+                                "india media blackout","boycott india","boycott indian products","boycott bollywood",
+                                "kashmir freedom","kashmir human rights","kashmir india occupation","kashmir protest",
+                                "khalistan movement","punjab separatism","anti national india","down with india",
+                                "stop india aggression","india pakistan conflict","china india border",
+                                "india brutality","india minority oppression"
 ]
 def _init_reddit():

regenerate_data.py ADDED Viewed

	@@ -0,0 +1,15 @@

+import processor
+from pathlib import Path
+import os
+# Define paths
+base_dir = Path(r"d:\CIIS\server")
+input_csv = base_dir / "storage" / "latest" / "scraped_input.csv"
+output_dir = base_dir / "storage" / "latest"
+if input_csv.exists():
+    print(f"Regenerating report from {input_csv}...")
+    processor.generate_reports_from_csv(str(input_csv), str(output_dir))
+    print("Regeneration complete.")
+else:
+    print(f"Input file not found: {input_csv}")

requirements.txt CHANGED Viewed

@@ -20,114 +20,12 @@ tokenizers
 tqdm
-# absl-py==2.3.1
-# annotated-types==0.7.0
-# anyio==4.10.0
-# astunparse==1.6.3
-# attrs==25.3.0
-# certifi==2025.8.3
-# cffi==1.17.1
-# charset-normalizer==3.4.3
-# click==8.2.1
-# colorama==0.4.6
-# contourpy==1.3.3
-# cycler==0.12.1
-# fastapi==0.116.1
-# filelock==3.19.1
-# flatbuffers==25.2.10
-# fonttools==4.59.2
-# fsspec==2025.7.0
-# gast==0.6.0
-# google-pasta==0.2.0
-# grpcio==1.74.0
-# h11==0.16.0
-# h5py==3.14.0
-# huggingface-hub==0.34.4
-# idna==3.10
-# Jinja2==3.1.4
-# joblib==1.5.2
-# kiwisolver==1.4.9
-# libclang==18.1.1
-# lxml==6.0.1
-# Markdown==3.8.2
-# markdown-it-py==4.0.0
-# matplotlib==3.10.8
-# mdurl==0.1.2
-# ml_dtypes==0.5.3
-# mpmath==1.3.0
-# namex==0.1.0
-# networkx==3.3
-# numpy==2.3.2
-# opt_einsum==3.4.0
-# optree==0.17.0
-# outcome==1.3.0.post0
-# packaging==25.0
-# pandas==2.3.2
-# pillow==12.1.0
-# praw==7.8.1
-# prawcore==2.4.0
-# protobuf==6.32.0
-# pycparser==2.22
-# pydantic==2.11.7
-# pydantic_core==2.33.2
-# Pygments==2.19.2
-# pyparsing==3.2.3
-# PySocks==1.7.1
-# python-dateutil==2.9.0.post0
-# python-docx==1.2.0
-# python-dotenv==1.2.1
-# pytz==2025.2
-# PyYAML==6.0.2
-# regex==2025.8.29
-# reportlab==4.4.3
-# requests==2.32.5
-# rich==14.1.0
-# safetensors==0.6.2
-# scikit-learn==1.7.1
-# scipy==1.16.1
-# selenium==4.35.0
-# setuptools==80.9.0
-# six==1.17.0
-# sniffio==1.3.1
-# sortedcontainers==2.4.0
-# starlette==0.47.3
-# sympy==1.13.3
-# tensorboard==2.20.0
-# tensorboard-data-server==0.7.2
-# termcolor==3.1.0
-# threadpoolctl==3.6.0
-# tokenizers==0.22.0
-# torch==2.8.0+cpu
-# torchaudio==2.8.0+cpu
-# torchvision==0.23.0+cpu
-# tqdm==4.67.1
-# transformers==4.56.0
-# trio==0.30.0
-# trio-websocket==0.12.2
-# typing-inspection==0.4.1
-# typing_extensions==4.15.0
-# tzdata==2025.2
-# update-checker==0.18.0
-# urllib3==2.5.0
-# uvicorn==0.35.0
-# websocket-client==1.8.0
-# Werkzeug==3.1.3
-# wheel==0.45.1
-# wordcloud==1.9.4
-# wrapt==1.17.3
-# wsproto==1.2.0

 tqdm
+# Core ML & NLP
+# torch>=2.0.0
+transformers
+sentence-transformers
+joblib
+# Language Detection & Translation
+langdetect
+deep-translator

sentiment_analysis.py ADDED Viewed

	@@ -0,0 +1,155 @@

+import sys
+import os
+# ---- PERMANENT IMPORT FIX ----
+ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.insert(0, ROOT_DIR)
+from src.language_detection import detect_language
+from src.preprocessing import clean_text
+from src.predict import predict
+from src.feature_builder import build_features
+from src.anchor_similarity import compute_similarity
+from src.embeddings import embedder
+from src.sarcasm import sarcasm_score
+from src.sentiment import sentiment_scores
+from src.translation import translate_to_english
+from src.context_llm import get_context_probs
+# ---- SUPPORTED LANGUAGES ----
+SUPPORTED_LANGS = {"en", "hi", "ta", "ur", "bn", "te", "ml", "gu", "kn", "mr"}
+LABELS = [
+    "Pro-India",
+    "Anti-India",
+    "Pro-Government",
+    "Anti-Government",
+    "Neutral"
+]
+def init_anchors():
+    """
+    Load anchor text from data/anchors/, encode them, and inject into anchor_similarity module.
+    """
+    print("[INIT] Loading anchor embeddings...")
+    anchor_dir = os.path.join(ROOT_DIR, "data", "anchors")
+    # Map keys to filenames
+    keys = ["pro_india", "anti_india", "pro_government", "anti_government", "neutral"]
+    loaded_anchors = {}
+    for key in keys:
+        file_path = os.path.join(anchor_dir, f"{key}.txt")
+        if not os.path.exists(file_path):
+            print(f"[WARNING] Anchor file missing: {file_path}")
+            continue
+        with open(file_path, "r", encoding="utf-8") as f:
+            lines = [line.strip() for line in f if line.strip()]
+        if not lines:
+            print(f"[WARNING] Anchor file empty: {key}")
+            continue
+        # Encode (batch)
+        # embedder is from src.embeddings
+        embeddings_matrix = embedder.encode(lines)
+        loaded_anchors[key] = embeddings_matrix
+        print(f"   - Loaded {key}: {len(lines)} examples")
+    # Inject into module
+    from src.anchor_similarity import load_anchor_embeddings
+    load_anchor_embeddings(loaded_anchors)
+    print("[INIT] Anchor embeddings initialized.\n")
+def classify(text: str):
+    # 1. Clean text
+    text = clean_text(text)
+    if len(text.strip()) == 0:
+        return {"error": "Empty input text"}
+    # 2. Language detection
+    lang, prob = detect_language(text)
+    # DEBUG (you can remove later)
+    print(f"[DEBUG] Detected language: {lang}, confidence: {round(prob, 3)}")
+    # 2.5 Translation (if not English)
+    # We use English for processing because the Sarcasm/Sentiment models are English-specific
+    # and the Anchors are in English.
+    processing_text = text
+    if lang != 'en':
+        print(f"[INFO] Translating {lang} to en...")
+        translated = translate_to_english(text, source=lang)
+        print(f"       -> {translated}")
+        processing_text = translated
+    # 3. Sentence embedding
+    text_embedding = embedder.encode(processing_text, normalize_embeddings=True)
+    # 4. Cosine similarity with anchors
+    similarity_scores = compute_similarity(
+        text_embedding=text_embedding,
+        anchor_embeddings=None  # handled internally if global
+    )
+    # 5. Sentiment + sarcasm
+    sentiment = sentiment_scores(processing_text)     # [neg, neutral, pos]
+    sarcasm = sarcasm_score(processing_text)           # float 0–1
+    # 5.5 LLM Context Analysis
+    context_probs = get_context_probs(processing_text)
+    # 6. Feature vector
+    features = build_features(
+        similarity=similarity_scores,
+        sentiment=sentiment,
+        sarcasm=sarcasm,
+        context_probs=context_probs
+    )
+    # 7. Final prediction
+    label_idx, confidence = predict(features)
+    return {
+        "text": text,
+        "label": LABELS[label_idx],
+        "confidence": round(confidence, 3),
+        "language": lang,
+        "sarcasm_score": round(sarcasm, 3),
+        "sentiment": {
+            "negative": round(sentiment[0], 3),
+            "neutral": round(sentiment[1], 3),
+            "positive": round(sentiment[2], 3),
+        }
+    }
+# ---- ENTRY POINT ----
+if __name__ == "__main__":
+    init_anchors()
+    # Process test.txt if it exists
+    if os.path.exists("test.txt"):
+        print("Processing test.txt...")
+        with open("test.txt","r") as f:
+            for line in f:
+                if line.strip():
+                    result= classify(line)
+                    print(result)
+        print("-" * 50)
+    print("\n🔍 Reddit Political Stance Classifier")
+    print("Type 'exit' to quit\n")
+    while True:
+        text = input("Enter Reddit post: ").strip()
+        if text.lower() == "exit":
+            break
+        result = classify(text)
+        print("\nResult:")
+        print(result)
+        print("-" * 50)

src/__init__.py ADDED Viewed

File without changes

src/anchor_similarity.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import numpy as np
+from sklearn.metrics.pairwise import cosine_similarity
+print("anchor_similarity module loaded")
+# --------------------------------------------------
+# GLOBAL ANCHOR EMBEDDINGS
+# --------------------------------------------------
+# These must be filled during initialization
+# Example structure:
+# {
+#   "pro_india": np.ndarray,
+#   "anti_india": np.ndarray,
+#   "pro_government": np.ndarray,
+#   "anti_government": np.ndarray,
+#   "neutral": np.ndarray
+# }
+ANCHOR_EMBEDDINGS = {}
+def load_anchor_embeddings(anchor_embeddings: dict):
+    """
+    Load precomputed anchor embeddings once at startup
+    """
+    global ANCHOR_EMBEDDINGS
+    ANCHOR_EMBEDDINGS = anchor_embeddings
+def compute_similarity(text_embedding: np.ndarray, anchor_embeddings=None) -> dict:
+    """
+    Compute cosine similarity between text embedding and anchor sets
+    """
+    # Use global anchors if not explicitly passed
+    anchors = anchor_embeddings if anchor_embeddings is not None else ANCHOR_EMBEDDINGS
+    if not anchors:
+        raise ValueError("Anchor embeddings not loaded")
+    scores = {}
+    for label, vectors in anchors.items():
+        sims = cosine_similarity(
+            text_embedding.reshape(1, -1),
+            vectors
+        )[0]
+        # top-k mean similarity
+        scores[label] = float(np.mean(np.sort(sims)[-5:]))
+    return scores

src/config.py ADDED Viewed

File without changes

src/context_llm.py ADDED Viewed

	@@ -0,0 +1,79 @@

+from transformers import pipeline
+import torch
+print("context_llm module loaded (Zero-Shot BART)")
+# Global pipeline variable
+classifier = None
+def load_context_model():
+    """
+    Lazy load the Zero-Shot Classification pipeline.
+    Uses facebook/bart-large-mnli.
+    """
+    global classifier
+    if classifier is not None:
+        return
+    try:
+        # Use CPU by default to be safe on Windows, or cuda if available
+        device = 0 if torch.cuda.is_available() else -1
+        print("[LLM] Loading valhalla/distilbart-mnli-12-3 (Distilled) for context analysis...")
+        classifier = pipeline(
+            "zero-shot-classification",
+            model="valhalla/distilbart-mnli-12-3",
+            device=device
+        )
+        print("[LLM] Context model loaded successfully.")
+    except Exception as e:
+        print(f"[LLM] CRITICAL ERROR: {e}")
+        # non-fatal, will just return neutral scores
+        pass
+def get_context_probs(text: str) -> list:
+    """
+    Analyzes text against specific hypotheses to determine deep context.
+    Returns probabilities for:
+    [
+      0: "Political Criticism" (Anti-Govt),
+      1: "National Criticism" (Anti-India),
+      2: "Political Praise" (Pro-Govt),
+      3: "National Praise" (Pro-India)
+    ]
+    """
+    # Lazy load
+    if classifier is None:
+        load_context_model()
+    if classifier is None:
+        # Fallback if model failed to load
+        return [0.25, 0.25, 0.25, 0.25]
+    labels = [
+        "criticism of the government",   # 0
+        "criticism of the country",      # 1
+        "praise of the government",      # 2
+        "praise of the country"          # 3
+    ]
+    try:
+        result = classifier(text, candidate_labels=labels, multi_label=False)
+        # Result has 'labels' and 'scores' sorted by score descending.
+        # We need to map them back to our fixed order [0, 1, 2, 3]
+        score_map = {label: score for label, score in zip(result['labels'], result['scores'])}
+        ordered_scores = [
+            score_map.get(labels[0], 0.0),
+            score_map.get(labels[1], 0.0),
+            score_map.get(labels[2], 0.0),
+            score_map.get(labels[3], 0.0)
+        ]
+        return ordered_scores
+    except Exception as e:
+        print(f"[LLM] Inference failed: {e}")
+        return [0.25, 0.25, 0.25, 0.25]

src/embeddings.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from sentence_transformers import SentenceTransformer
+print("embeddings module loaded")
+# Multilingual sentence embedding model
+EMBEDDING_MODEL_NAME = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
+embedder = SentenceTransformer(EMBEDDING_MODEL_NAME)

src/feature_builder.py ADDED Viewed

	@@ -0,0 +1,31 @@

+import numpy as np
+print("feature_builder module loaded")
+def build_features(similarity: dict, sentiment: list, sarcasm: float, context_probs: list) -> np.ndarray:
+    """
+    Build final feature vector for stance classification
+    similarity: dict (5 scores)
+    sentiment: [neg, neutral, pos]
+    sarcasm: float
+    context_probs: [pol_crit, nat_crit, pol_praise, nat_praise] (4 scores)
+    """
+    features = [
+        similarity["pro_india"],
+        similarity["anti_india"],
+        similarity["pro_government"],
+        similarity["anti_government"],
+        similarity["neutral"],
+        sentiment[0],   # negative
+        sentiment[1],   # neutral
+        sentiment[2],   # positive
+        sarcasm,
+        context_probs[0], # Political Criticism
+        context_probs[1], # National Criticism
+        context_probs[2], # Political Praise
+        context_probs[3]  # National Praise
+    ]
+    return np.array(features, dtype=np.float32)

src/language_detection.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from langdetect import detect_langs, DetectorFactory
+# Enforce determinism
+DetectorFactory.seed = 0
+def detect_language(text: str):
+    """
+    Robust language detection for Reddit comments.
+    Prioritizes English for short text if common stopwords are found.
+    """
+    # 1. Heuristic: Common English Stopwords
+    # Solves short-text issues like "India has deep flaws" being detected as Spanish
+    english_stopwords = {"the", "is", "are", "and", "of", "to", "in", "it", "has", "have", "for", "on", "with"}
+    words = set(text.lower().split())
+    # If intersection is non-empty, high confidence it's English
+    if words & english_stopwords:
+        return "en", 1.0
+    # 2. Statistical Detection (langdetect)
+    try:
+        # returns list of [Language(lang, prob), ...]
+        langs = detect_langs(text)
+        best = langs[0]
+        return best.lang, best.prob
+    except Exception:
+        # Fallback for empty/numeric text
+        return "unknown", 0.0

src/predict.py ADDED Viewed

	@@ -0,0 +1,22 @@

+import joblib
+import numpy as np
+print("predict module loaded")
+MODEL_PATH = "models/final_classifier.pkl"
+clf = joblib.load(MODEL_PATH)
+def predict(features: np.ndarray):
+    """
+    Predict stance label and confidence
+    """
+    probs = clf.predict_proba([features])[0]
+    sorted_idx = np.argsort(probs)[::-1]
+    best = sorted_idx[0]
+    second = sorted_idx[1]
+    confidence = (probs[best] - probs[second]) / probs[best]
+    return best, float(confidence)

src/preprocessing.py ADDED Viewed

	@@ -0,0 +1,17 @@

+print("preprocessing module loaded")
+import re
+def clean_text(text):
+    text = re.sub(r"http\S+", "", text)
+    text = re.sub(r"\s+", " ", text)
+    return text.strip()
+import re
+def clean_text(text: str) -> str:
+    """
+    Basic text normalization for Reddit posts
+    """
+    text = re.sub(r"http\S+", "", text)     # remove URLs
+    text = re.sub(r"\s+", " ", text)        # normalize spaces
+    return text.strip()

src/sarcasm.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+print("sarcasm module loaded (BERT Sarcasm Detector)")
+# FIX: Use a Twitter-based Irony model (RoBERTa) which is better for social media/Reddit
+MODEL_NAME = "cardiffnlp/twitter-roberta-base-irony"
+try:
+    # FIX: Force use_fast=False to avoid Windows rust-tokenizer crashes
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)
+    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
+    model.eval()
+except Exception as e:
+    print(f"CRITICAL ERROR loading sarcasm model: {e}")
+    raise e
+def sarcasm_score(text: str) -> float:
+    """
+    Deep sarcasm probability (0-1).
+    Uses helinivan/english-sarcasm-detector (BERT-based).
+    """
+    with torch.no_grad():
+        inputs = tokenizer(
+            text,
+            return_tensors="pt",
+            truncation=True,
+            padding=True,
+            max_length=128
+        )
+        outputs = model(**inputs)
+        probs = torch.softmax(outputs.logits, dim=1)
+        # The model 'helinivan/english-sarcasm-detector' labels:
+        # 0: Not Sarcastic
+        # 1: Sarcastic
+        # We want the probability of it being sarcastic (index 1)
+        return float(probs[0][1])

src/sentiment.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+print("sentiment module loaded (English RoBERTa)")
+# FIX: Use the standard (older) model which definitely has support for slow tokenizers
+# The 'latest' version sometimes lacks full file support for use_fast=False on all setups
+MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment"
+try:
+    # FIX: Force use_fast=False to avoid Windows rust-tokenizer crashes
+    # This uses the stable Python-based tokenizer (Byte-Level BPE)
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)
+    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
+    model.eval()
+except Exception as e:
+    print(f"CRITICAL ERROR loading sentiment model: {e}")
+    raise e
+def sentiment_scores(text: str):
+    """
+    Returns sentiment probabilities as:
+    [negative, neutral, positive]
+    """
+    with torch.no_grad():
+        inputs = tokenizer(
+            text,
+            return_tensors="pt",
+            truncation=True,
+            padding=True,
+            max_length=128
+        )
+        outputs = model(**inputs)
+        probs = torch.softmax(outputs.logits, dim=1)
+        # Model returns: negative, neutral, positive
+        return probs[0].tolist()

src/train_classifier.py ADDED Viewed

	@@ -0,0 +1,17 @@

+import joblib
+from sklearn.linear_model import LogisticRegression
+print("train_classifier module loaded")
+def train_and_save(X, y):
+    """
+    Train final stance classifier and save it
+    """
+    clf = LogisticRegression(
+        max_iter=2000,
+        multi_class="multinomial"
+    )
+    clf.fit(X, y)
+    joblib.dump(clf, "models/final_classifier.pkl")
+    print("✅ Model trained and saved")

src/train_logic_aligned.py ADDED Viewed

	@@ -0,0 +1,106 @@

+import numpy as np
+import joblib
+import os
+from sklearn.linear_model import LogisticRegression
+# Output path
+MODEL_PATH = os.path.join(os.path.dirname(__file__), "..", "models", "final_classifier.pkl")
+os.makedirs(os.path.dirname(MODEL_PATH), exist_ok=True)
+print(">>> Generating Synthetic Logic-Aligned Training Data...")
+# Features:
+# 0: sim_pro_india
+# 1: sim_anti_india
+# 2: sim_pro_govt
+# 3: sim_anti_govt
+# 4: sim_neutral
+# 5: neg
+# 6: neu
+# 7: pos
+# 8: sarcasm
+# 9: context_pol_crit (Anti-Govt)
+# 10: context_nat_crit (Anti-India)
+# 11: context_pol_praise (Pro-Govt)
+# 12: context_nat_praise (Pro-India)
+def generate_sample(label_idx):
+    # Base noise for 13 features
+    feats = np.random.uniform(0.0, 0.3, 13)
+    # 0: Pro-India
+    if label_idx == 0:
+        feats[0] = np.random.uniform(0.6, 1.0) # High Pro-India Sim
+        feats[7] = np.random.uniform(0.5, 1.0) # High Positive
+        feats[5] = np.random.uniform(0.0, 0.2) # Low Negative
+        feats[8] = np.random.uniform(0.0, 0.2) # IGNORE SARCASM
+        # LLM Context
+        feats[12] = np.random.uniform(0.7, 1.0) # High National Praise
+        feats[9] = np.random.uniform(0.0, 0.2)  # Low Pol Crit
+    # 1: Anti-India
+    elif label_idx == 1:
+        feats[1] = np.random.uniform(0.6, 1.0) # High Anti-India Sim
+        feats[5] = np.random.uniform(0.5, 1.0) # High Negative
+        feats[7] = np.random.uniform(0.0, 0.2) # Low Positive
+        feats[8] = np.random.uniform(0.0, 0.2) # IGNORE SARCASM
+        # LLM Context
+        feats[10] = np.random.uniform(0.7, 1.0) # High National Criticism
+        feats[9] = np.random.uniform(0.0, 0.3)  # Low/Med Pol Crit
+    # 2: Pro-Government
+    elif label_idx == 2:
+        feats[2] = np.random.uniform(0.6, 1.0) # High Pro-Govt Sim
+        feats[7] = np.random.uniform(0.5, 1.0) # High Positive
+        feats[5] = np.random.uniform(0.0, 0.2) # Low Negative
+        feats[8] = np.random.uniform(0.0, 0.2) # IGNORE SARCASM
+        # LLM Context
+        feats[11] = np.random.uniform(0.7, 1.0) # High Political Praise
+        feats[10] = np.random.uniform(0.0, 0.2) # Low Nat Crit
+    # 3: Anti-Government
+    elif label_idx == 3:
+        feats[3] = np.random.uniform(0.6, 1.0) # High Anti-Govt Sim
+        feats[5] = np.random.uniform(0.5, 1.0) # High Negative
+        feats[7] = np.random.uniform(0.0, 0.2) # Low Positive
+        feats[8] = np.random.uniform(0.0, 0.2) # IGNORE SARCASM
+        # LLM Context
+        feats[9] = np.random.uniform(0.7, 1.0) # High Political Criticism!
+        feats[10] = np.random.uniform(0.0, 0.4) # Low/Med Nat Crit
+    # 4: Neutral
+    elif label_idx == 4:
+        feats[4] = np.random.uniform(0.5, 1.0) # High Neutral Sim
+        feats[6] = np.random.uniform(0.5, 1.0) # High Neutral Sentiment
+        feats[5] = np.random.uniform(0.0, 0.2)
+        feats[7] = np.random.uniform(0.0, 0.2)
+        feats[8] = np.random.uniform(0.0, 0.1)
+        # LLM Context -> All low or balanced
+        feats[9] = np.random.uniform(0.0, 0.3)
+        feats[10] = np.random.uniform(0.0, 0.3)
+    return feats
+# Generate data
+X = []
+y = []
+SAMPLES_PER_CLASS = 500
+for label in range(5):
+    for _ in range(SAMPLES_PER_CLASS):
+        X.append(generate_sample(label))
+        y.append(label)
+X = np.array(X)
+y = np.array(y)
+print(f"Training Logistic Regression on {len(X)} synthetic samples (13 features)...")
+clf = LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs')
+clf.fit(X, y)
+print(f"Accuracy on Training Set: {clf.score(X, y):.4f}")
+print(f"Saving model to {MODEL_PATH}...")
+joblib.dump(clf, MODEL_PATH)
+print("DONE.")

src/translation.py ADDED Viewed

	@@ -0,0 +1,17 @@

+from deep_translator import GoogleTranslator
+import time
+def translate_to_english(text: str, source="auto") -> str:
+    """
+    Translates input text to English using Google Translator.
+    Retries once on failure.
+    """
+    try:
+        # Use simple 'auto' detection or specific source
+        translator = GoogleTranslator(source=source, target='en')
+        translated = translator.translate(text)
+        return translated
+    except Exception as e:
+        print(f"[WARNING] Translation failed: {e}")
+        # Fallback to original text if translation fails
+        return text

train_once.py ADDED Viewed

	@@ -0,0 +1,10 @@

+import numpy as np
+from src.train_classifier import train_and_save
+# DUMMY FEATURES (9 features as defined)
+X = np.random.rand(20, 9)
+# DUMMY LABELS (5 classes)
+y = np.random.randint(0, 5, size=20)
+train_and_save(X, y)