Spaces:

subhamb04
/

stableai

Configuration error

App Files Files Community

subhamb04 commited on Sep 17, 2025

Commit

01812ff

verified ·

1 Parent(s): 4b1a03d

Upload 21 files

Browse files

Files changed (21) hide show

.gitignore +122 -0
README.md +148 -20
app.py +28 -0
config.py +12 -0
mechanisms/__init__.py +15 -0
mechanisms/__pycache__/__init__.cpython-311.pyc +0 -0
mechanisms/__pycache__/baseline.cpython-311.pyc +0 -0
mechanisms/__pycache__/caching.cpython-311.pyc +0 -0
mechanisms/__pycache__/consensus.cpython-311.pyc +0 -0
mechanisms/__pycache__/constraint.cpython-311.pyc +0 -0
mechanisms/__pycache__/historical.cpython-311.pyc +0 -0
mechanisms/__pycache__/predictability.cpython-311.pyc +0 -0
mechanisms/baseline.py +4 -0
mechanisms/caching.py +11 -0
mechanisms/consensus.py +33 -0
mechanisms/constraint.py +21 -0
mechanisms/historical.py +16 -0
mechanisms/predictability.py +16 -0
requirements.txt +5 -3
utils/__pycache__/llm_utils.cpython-311.pyc +0 -0
utils/llm_utils.py +12 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,122 @@

+# ------------------------------
+# Python
+# ------------------------------
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+*.pyd
+*.dll
+# ------------------------------
+# Environments
+# ------------------------------
+.venv/
+venv/
+env/
+ENV/
+.venv*/
+venv*/
+env*/
+ENV*/
+.python-version
+# ------------------------------
+# Distribution / packaging
+# ------------------------------
+.Python
+build/
+dist/
+downloads/
+eggs/
+.eggs/
+sdist/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+pip-wheel-metadata/
+pip-log.txt
+pip-delete-this-directory.txt
+# ------------------------------
+# Unit test / coverage reports
+# ------------------------------
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.pytest_cache/
+junit*.xml
+# ------------------------------
+# Type checkers / linters
+# ------------------------------
+.mypy_cache/
+.dmypy.json
+dmypy.json
+.pyre/
+.pytype/
+.ruff_cache/
+# ------------------------------
+# PyInstaller
+# ------------------------------
+*.manifest
+*.spec
+# ------------------------------
+# Jupyter
+# ------------------------------
+.ipynb_checkpoints/
+# ------------------------------
+# Logs and runtime files
+# ------------------------------
+logs/
+*.log
+*.pid
+*.pid.lock
+# ------------------------------
+# Local environment variables & secrets
+# ------------------------------
+.env
+.env.*
+!.env.example
+# ------------------------------
+# Editors / IDEs / Tooling
+# ------------------------------
+.idea/
+*.iml
+.vscode/
+.history/
+.cursor/
+*.code-workspace
+# ------------------------------
+# OS-specific
+# ------------------------------
+.DS_Store
+Thumbs.db
+ehthumbs.db
+Desktop.ini
+# ------------------------------
+# Optional local data & temp
+# ------------------------------
+tmp/
+temp/
+data/

README.md CHANGED Viewed

@@ -1,20 +1,148 @@
----
-title: Stableai
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: LLM consistency & predictability analysis
-license: mit
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+## StableAI – LLM Consistency Demo
+A small Streamlit app that showcases several practical mechanisms to improve LLM predictability and consistency. It runs locally and calls Groq-hosted models via an OpenAI-compatible API.
+### What this app demonstrates
+- **Baseline (Raw LLM)**: Direct model call without safeguards.
+- **Caching & Replay**: Deterministic replay for identical prompts in a session.
+- **Historical Consistency**: Reuses prior answers for similar prompts using a fuzzy matcher.
+- **Cross-Model Consensus**: Gathers answers from multiple models and asks a judge model to summarize consensus.
+- **Constraint Validation (Schema)**: Forces JSON output and validates it via Pydantic.
+- **Predictability Index**: Runs the same prompt multiple times and scores similarity between outputs.
+---
+## Architecture at a glance
+- `app.py`: Streamlit UI; wires user input to registered mechanisms, exposes buttons for cache/history management.
+- `config.py`: Loads environment, initializes OpenAI-compatible client pointing at Groq (`GROQ_API_KEY`, base URL `https://api.groq.com/openai/v1`).
+- `utils/llm_utils.py`: Thin wrapper `call_model(prompt, model)` and `get_hash(text)` for caching keys.
+- `mechanisms/` (registry in `__init__.py`):
+  - `baseline.py`: Simple call to `llama-3.1-8b-instant`.
+  - `caching.py`: In-memory cache in `st.session_state.cache` keyed by SHA-256 of the prompt.
+  - `historical.py`: Similarity lookup over `st.session_state.history` using `difflib.SequenceMatcher`.
+  - `consensus.py`: Calls multiple models, then a judge model to assess/summarize consensus.
+  - `constraint.py`: Prompts for strict JSON, validates with a Pydantic model.
+  - `predictability.py`: N repeated calls; pairwise similarity to compute a predictability score.
+Notes:
+- Session state is ephemeral (cleared when the Streamlit session resets).
+- Network calls go through the Groq API using the OpenAI SDK interface.
+---
+## Prerequisites
+- Python 3.9+
+- A Groq API key (`GROQ_API_KEY`)
+---
+## Local setup
+1) Clone and enter the project directory.
+2) (Recommended) Create and activate a virtual environment:
+```bash
+python -m venv .venv
+# Windows PowerShell
+. .venv\Scripts\Activate.ps1
+# macOS/Linux
+source .venv/bin/activate
+```
+3) Install dependencies (no requirements file is included; install directly):
+```bash
+pip install streamlit python-dotenv openai pydantic
+```
+4) Create a `.env` file in the project root with your key:
+```bash
+GROQ_API_KEY=your_groq_api_key_here
+```
+---
+## Running the app
+From the project root:
+```bash
+streamlit run app.py
+```
+Then open the URL Streamlit prints (typically `http://localhost:8501`).
+---
+## Using the app
+1) Choose a mechanism from the radio options.
+2) Enter your query.
+3) Click “Ask”.
+4) Use “Clear Cache” / “Clear History” as needed to reset state.
+Model notes:
+- Default model for most calls is `llama-3.1-8b-instant`.
+- Cross-model consensus uses `openai/gpt-oss-20b` and `llama-3.3-70b-versatile` and judges with `llama-3.1-8b-instant`.
+---
+## How it works (brief)
+- The UI passes the prompt to a selected mechanism via a registry (`MECHANISMS`).
+- Each mechanism composes a request using `utils/llm_utils.call_model` (OpenAI SDK → Groq endpoint).
+- Some mechanisms store/retrieve answers from `st.session_state` for caching and history.
+---
+## Limitations and improvement ideas
+### Similarity and retrieval
+- Replace `difflib.SequenceMatcher` with **embeddings + cosine similarity**:
+  - Use sentence embeddings (e.g., `text-embedding-3-large`, or any Groq-supported embedding model) to encode prompts/answers.
+  - Compute cosine similarity for robust semantic matching over historical prompts, not just character overlap.
+  - Persist vectors and metadata in a vector store (e.g., FAISS, Chroma, pgvector) for efficient nearest-neighbor search.
+  - Benefit: better recall/precision for paraphrases and longer contexts.
+### Caching and persistence
+- Move from in-memory `st.session_state` to a persistent cache (Redis, SQLite) with TTLs and size limits.
+- Cache by normalized prompt + key parameters (model, temperature, system prompt) to avoid accidental collisions.
+- Add cache warming and background refresh for hot prompts.
+### Determinism and variance control
+- Expose decoding params (temperature, top_p, seed if supported) in the UI.
+- For predictability scoring, fix seeds where supported to separate model stochasticity from service variance.
+- Compute additional stability metrics (e.g., ROUGE-L, BERTScore) between runs.
+### Robust output contracts
+- Expand `constraint.py` to support multiple schemas and strict parsing with function calling / JSON mode if available.
+- Add retry/repair loop when JSON validation fails (ask the model to fix output).
+### Evaluation and CI
+- Add an evaluation harness with a small prompt-answer dataset:
+  - Track accuracy/consistency per mechanism over time.
+  - Save run artifacts (inputs, outputs, scores) for regression checks.
+- Provide unit tests for key utilities and mechanisms; mock network calls.
+### Observability
+- Add logging/telemetry for latency, token usage, and error rates.
+- Surface metrics in the UI (per mechanism) to understand trade-offs.
+### UX improvements
+- Show per-mechanism explanations next to results.
+- Allow exporting session cache/history and reloading it.
+- Provide advanced settings accordion (models, decoding params, thresholds).
+---
+## Example: swapping fuzzy matcher for embeddings
+High-level steps to upgrade `historical.py`:
+1) Add an embedding helper (e.g., `get_embedding(text) -> List[float]`).
+2) On first-seen prompts, store `{prompt, answer, prompt_embedding}` in a persistent store.
+3) On new queries, compute its embedding and run top-k nearest neighbor search by cosine similarity.
+4) If similarity > threshold (e.g., 0.85), return the historical answer; otherwise call the model and insert the new row.
+This yields more robust reuse across paraphrases and longer prompts, compared to `difflib`.
+---
+## Troubleshooting
+- 401/403 errors: verify `GROQ_API_KEY` and `.env` loading; confirm base URL matches Groq’s OpenAI-compatible endpoint.
+- Streamlit can reuse state across reruns; use the provided buttons to clear cache/history.
+- If models change or rate limits apply, consensus may show partial errors; the UI surfaces them inline.

app.py ADDED Viewed

	@@ -0,0 +1,28 @@

+import streamlit as st
+from mechanisms import MECHANISMS
+if "cache" not in st.session_state:
+    st.session_state.cache = {}
+if "history" not in st.session_state:
+    st.session_state.history = {}
+st.title("LLM Consistency Demo")
+st.markdown("Explore mechanisms to improve LLM predictability & consistency.")
+mode = st.radio("Choose Mechanism:", list(MECHANISMS.keys()))
+user_prompt = st.text_input("Enter your query:")
+if st.button("Ask"):
+    if user_prompt.strip():
+        answer = MECHANISMS[mode](user_prompt)
+        st.markdown("### Response:")
+        st.write(answer)
+if st.button("Clear Cache"):
+    st.session_state.cache.clear()
+    st.success("Cache cleared!")
+if st.button("Clear History"):
+    st.session_state.history.clear()
+    st.success("History cleared!")

config.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import os
+from dotenv import load_dotenv
+from openai import OpenAI
+load_dotenv(override=True)
+GROQ_API_KEY = os.getenv("GROQ_API_KEY")
+client = OpenAI(
+    api_key=GROQ_API_KEY,
+    base_url="https://api.groq.com/openai/v1"
+)

mechanisms/__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+from .baseline import baseline
+from .caching import caching
+from .historical import historical
+from .consensus import cross_model
+from .constraint import constraint
+from .predictability import predictability
+MECHANISMS = {
+    "Baseline (Raw LLM)": baseline,
+    "Caching & Replay": caching,
+    "Historical Consistency": historical,
+    "Cross-Model Consensus": cross_model,
+    "Constraint Validation (Schema)": constraint,
+    "Predictability Index": predictability,
+}

mechanisms/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (656 Bytes). View file

mechanisms/__pycache__/baseline.cpython-311.pyc ADDED Viewed

Binary file (498 Bytes). View file

mechanisms/__pycache__/caching.cpython-311.pyc ADDED Viewed

Binary file (835 Bytes). View file

mechanisms/__pycache__/consensus.cpython-311.pyc ADDED Viewed

Binary file (2.23 kB). View file

mechanisms/__pycache__/constraint.cpython-311.pyc ADDED Viewed

Binary file (1.49 kB). View file

mechanisms/__pycache__/historical.cpython-311.pyc ADDED Viewed

Binary file (1.2 kB). View file

mechanisms/__pycache__/predictability.cpython-311.pyc ADDED Viewed

Binary file (1.93 kB). View file

mechanisms/baseline.py ADDED Viewed

	@@ -0,0 +1,4 @@

+from utils.llm_utils import call_model
+def baseline(prompt: str) -> str:
+    return call_model(prompt, "llama-3.1-8b-instant") + "  \n⚡ (fresh answer)"

mechanisms/caching.py ADDED Viewed

	@@ -0,0 +1,11 @@

+from utils.llm_utils import call_model, get_hash
+import streamlit as st
+def caching(prompt: str) -> str:
+    key = get_hash(prompt)
+    if key in st.session_state.cache:
+        return st.session_state.cache[key] + "  \n✅ (from cache)"
+    else:
+        ans = call_model(prompt)
+        st.session_state.cache[key] = ans
+        return ans + "  \n💡 (new answer cached)"

mechanisms/consensus.py ADDED Viewed

	@@ -0,0 +1,33 @@

+from utils.llm_utils import call_model
+def judge_consensus(prompt: str, responses: dict, judge_model="llama-3.1-8b-instant") -> str:
+    judge_prompt = f"""
+    You are a judge LLM. The user asked: "{prompt}"
+    Here are the responses from different models:
+    {chr(10).join([f"- {m}: {ans}" for m, ans in responses.items()])}
+    Task:
+    1. Decide if the answers are essentially saying the same thing.
+    2. If yes, summarize the consensus in 2-3 lines.
+    3. If no, state clearly that there is no consensus and why.
+    Answer format:
+    Consensus: <your summary OR "No consensus">
+    """
+    return call_model(judge_prompt, judge_model)
+def cross_model(prompt: str):
+    models = ["openai/gpt-oss-20b", "llama-3.3-70b-versatile"]
+    responses = {}
+    for m in models:
+        try:
+            responses[m] = call_model(prompt, m)
+        except Exception as e:
+            responses[m] = f"⚠️ Error: {str(e)}"
+    consensus = judge_consensus(prompt, responses)
+    out = "### Model Responses:\n"
+    for m, ans in responses.items():
+        out += f"- **{m}**: {ans}\n\n"
+    out += "\n### Judge Decision:\n" + consensus
+    return out

mechanisms/constraint.py ADDED Viewed

	@@ -0,0 +1,21 @@

+from pydantic import BaseModel, ValidationError
+from utils.llm_utils import call_model
+import json
+class AnswerSchema(BaseModel):
+    answer: str
+def constraint(prompt: str) -> str:
+    schema_instruction = f"""
+    Respond strictly in JSON format:
+    {{
+      "answer": "<your concise answer>"
+    }}
+    User query: {prompt}
+    """
+    raw = call_model(schema_instruction)
+    try:
+        data = AnswerSchema.parse_raw(raw)
+        return json.dumps(data.dict(), indent=2) + "\n✅ Schema valid"
+    except ValidationError as e:
+        return raw + f"\n⚠️ Schema validation failed\n{e}"

mechanisms/historical.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import difflib
+from utils.llm_utils import call_model
+import streamlit as st
+def historical(prompt: str) -> str:
+    best_match, best_ratio = None, 0.0
+    for old_q, old_a in st.session_state.history.items():
+        ratio = difflib.SequenceMatcher(None, prompt, old_q).ratio()
+        if ratio > best_ratio:
+            best_match, best_ratio = old_q, ratio
+    if best_match and best_ratio > 0.8:
+        return st.session_state.history[best_match] + f"\n✅ Historical match (from: '{best_match}')"
+    else:
+        ans = call_model(prompt)
+        st.session_state.history[prompt] = ans
+        return ans + "\n💡 Stored in history"

mechanisms/predictability.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import difflib
+from utils.llm_utils import call_model
+def predictability(prompt: str, runs: int = 3) -> str:
+    answers = [call_model(prompt) for _ in range(runs)]
+    ratios = []
+    for i in range(len(answers)):
+        for j in range(i + 1, len(answers)):
+            ratio = difflib.SequenceMatcher(None, answers[i], answers[j]).ratio()
+            ratios.append(ratio)
+    score = round(sum(ratios) / len(ratios) * 100, 2)
+    return (
+        f"Answers across {runs} runs:\n" +
+        "\n".join([f"- Run {i+1}: {ans}" for i, ans in enumerate(answers)]) +
+        f"\n\n🔢 Predictability Index: {score}%"
+    )

requirements.txt CHANGED Viewed

@@ -1,3 +1,5 @@
-altair
-pandas
-streamlit

+streamlit
+python-dotenv
+openai>=1.30.0
+pydantic>=1.10,<2.0

utils/__pycache__/llm_utils.cpython-311.pyc ADDED Viewed

Binary file (1.04 kB). View file

utils/llm_utils.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import hashlib
+from config import client
+def call_model(prompt: str, model="llama-3.1-8b-instant") -> str:
+    response = client.responses.create(
+        model=model,
+        input=prompt
+    )
+    return response.output_text.strip()
+def get_hash(text: str) -> str:
+    return hashlib.sha256(text.encode()).hexdigest()