Spaces:

st192011
/

Bitnet-Socratic-1-Bit

Running

App Files Files Community

st192011 commited on 4 days ago

Commit

6f0c92d

verified ·

1 Parent(s): a6168f7

Update app.py

Browse files

Files changed (1) hide show

app.py +53 -31

app.py CHANGED Viewed

@@ -9,22 +9,24 @@ if os.path.exists("/content/BitNet"):
 # ==============================================================================
 # CONSTANTS & CONFIGURATION
 # ==============================================================================
-SYSTEM_INSTRUCTION = (
     "You are a Socratic assistant. Do not answer questions directly. "
     "Instead, respond exclusively with 3 deep, reflective questions. "
     "Then generate stop token"
 )
-MODEL_PATH = "models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf"
 # ==============================================================================
 # STREAMING ENGINE WITH LOOKAHEAD BUFFER
 # ==============================================================================
-def socratic_streaming_chat(user_query):
     if not user_query.strip():
         yield "Please enter a valid question."
         return
-    formatted_chat_prompt = f"System: {SYSTEM_INSTRUCTION}\nUser: {user_query}\nAssistant:"
     cmd = [
         "python3", "run_inference.py",
@@ -32,13 +34,13 @@ def socratic_streaming_chat(user_query):
         "-p", formatted_chat_prompt,
         "-n", "120",
         "-temp", "0.4",
-        "-t", "2"
     ]
     process = subprocess.Popen(
         cmd,
         stdout=subprocess.PIPE,
-        stderr=subprocess.DEVNULL,  # Keeps system logs hidden
         text=True,
         bufsize=1
     )
@@ -47,6 +49,7 @@ def socratic_streaming_chat(user_query):
     prompt_cleared = False
     LOOKAHEAD_SIZE = 45
     stop_markers = [
         "Stop token", "stop token",
         "Stop.", "stop.",
@@ -61,24 +64,24 @@ def socratic_streaming_chat(user_query):
         accumulator += char
-        # --- THE FIX: SWALLOW THE ECHOED PROMPT ---
         if not prompt_cleared:
             if "Assistant:" in accumulator:
                 prompt_cleared = True
-                # Delete the prompt and keep only what comes after "Assistant:"
                 accumulator = accumulator.split("Assistant:")[-1].lstrip()
-            continue  # Do not yield text until the prompt is fully swallowed
-        # Scan for structural collapse boundaries
         stop_triggered = False
         for marker in stop_markers:
             if marker in accumulator:
                 accumulator = accumulator.split(marker)[0]
                 stop_triggered = True
                 break
         if stop_triggered:
-            process.terminate()
             break
         # Stream text safely outside the trailing boundary window
@@ -92,7 +95,7 @@ def socratic_streaming_chat(user_query):
 # TECHNICAL REPORT MARKDOWN TEXT
 # ==============================================================================
 TECHNICAL_REPORT_MD = """
-## 📋 Project Technical Report: 1-Bit LLM Socratic Refinement Pipeline
 **Architecture Core:** Ternary Quantized (1.58-bit) Matrix Processing
 ---
@@ -106,7 +109,7 @@ The goal of this initiative was to engineer a hyper-lightweight, lightning-fast
 ---
 ### 2. Model Training Matrix & Evaluation Phase
-Our initial strategy focused on fine-tuning custom models directly on our targeted Socratic dataset. The results exposed clear engineering trade-offs between unquantized fine-tuning weight adjustments and customized binary compilation layers:
 | Model Identifier | Architecture Configuration | Operational Performance | Qualitative Evaluation |
 | :--- | :--- | :--- | :--- |
@@ -114,51 +117,69 @@ Our initial strategy focused on fine-tuning custom models directly on our target
 | **st192011/socratic-bitnet-2b** | Quantized Ternary Representation Variant of custom weights. | **Critically Poor** | Suffered extreme degradation. The model experienced severe structural collapse, outputting infinite semantic loops or unreadable token gibberish. |
 #### Analysis of Quantization Collapse
-The stark failure of `st192011/socratic-bitnet-2b` highlights a common hurdle in customized 1-bit AI development. When a model's weights are aggressively compressed down to simple ternary values (-1, 0, 1), the mathematical boundaries become extremely rigid. Standard quantization tools often distort the delicate behavioral traits introduced during fine-tuning, leading to a complete breakdown of language modeling capabilities.
 ---
 ### 3. Strategy Pivot: Pretrained Weights + Structural Prompt Anchoring
-To avoid the quantization bugs of custom fine-tuned weights, we pivoted to an elegant hybrid solution: **combining the official, verified pretrained base weights from Microsoft with precision prompt engineering.**
-We deployed the official `microsoft/bitnet-b1.58-2B-4T-gguf` base model. While this preserved its deep, foundational knowledge base, it introduced a new challenge: **Base models do not natively know when to stop generating.**
 #### The Stop-Token Anchor Hack
-To enforce structure without re-training the model, we modified the `SYSTEM_INSTRUCTION` block to force the model to declare its own stopping point:
 > *"You are a Socratic assistant... Respond exclusively with 3 deep, reflective questions. Then generate stop token"*
-This instruction forces the model's text-prediction engine to anchor itself on a predictable phrase as soon as its linguistic objective is met. Our test iterations confirmed that while the model still experiences trailing token hallucinations (e.g., repeating `Stop. Stop. Stop. Response: 1.`), it prints a recognizable marker *immediately after* providing the high-quality questions.
 ---
 ### 4. Production Pipeline Architecture
-To deliver a flawless user experience, we implemented a **Programmatic UX Stream Filter** running on the host system. This layer completely isolates the user from any underlying engine instability:
-* **The Lookahead Buffer Zone:** The streaming engine retains the trailing 45 characters of generation inside a private memory array, evaluating it for known stop-sequences before releasing clean text to the UI.
-* **Process Resource Reclamation:** The moment a marker is tripped, a background system command kills the active process (`process.terminate()`). This prevents the model from wasting CPU cycles on hallucinated loops, maximizing host performance.
-* **Flawless Formatting Output:** The final user interface performs at near real-time speeds on commodity hardware, delivering clean, high-precision Socratic prompts with zero visual clutter.
 """
 # ==============================================================================
 # GRADIO INTERFACE LAYOUT (TABBED WINDOWS)
 # ==============================================================================
 with gr.Blocks(theme=gr.themes.Soft()) as demo:
-    gr.Markdown("# 🧠 High-Performance 1-Bit Socratic Workspace")
     with gr.Tabs():
         # --- TAB 1: INTERACTIVE APP ---
-        with gr.TabItem("Socratic Assistant"):
-            gr.Markdown("### Real-Time Socratic Exploration")
-            gr.Markdown("Real-time token streaming powered by Microsoft's 1.58-bit BitNet GGUF kernel alongside dynamic programmatic output filtering.")
             with gr.Row():
                 with gr.Column(scale=4):
                     input_text = gr.Textbox(
-                        label="Concept Prompt",
-                        placeholder="What concept do you wish to dissect? (e.g., What makes something responsibility?)",
                         lines=2
                     )
-                    submit_btn = gr.Button("Dissect Concept via Socratic Dialogue", variant="primary")
                 with gr.Column(scale=5):
                     output_text = gr.Textbox(
                         label="Cleaned Real-Time Streaming Output",
@@ -166,8 +187,9 @@ with gr.Blocks(theme=gr.themes.Soft()) as demo:
                         interactive=False
                     )
-            submit_btn.click(fn=socratic_streaming_chat, inputs=input_text, outputs=output_text)
-            input_text.submit(fn=socratic_streaming_chat, inputs=input_text, outputs=output_text)
         # --- TAB 2: TECHNICAL REPORT ---
         with gr.TabItem("Technical Report"):

 # ==============================================================================
 # CONSTANTS & CONFIGURATION
 # ==============================================================================
+MODEL_PATH = "models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf"
+DEFAULT_SYSTEM_PROMPT = (
     "You are a Socratic assistant. Do not answer questions directly. "
     "Instead, respond exclusively with 3 deep, reflective questions. "
     "Then generate stop token"
 )
 # ==============================================================================
 # STREAMING ENGINE WITH LOOKAHEAD BUFFER
 # ==============================================================================
+def streaming_chat(user_query, system_prompt):
     if not user_query.strip():
         yield "Please enter a valid question."
         return
+    # Dynamically inject the user's custom system instruction
+    formatted_chat_prompt = f"System: {system_prompt}\nUser: {user_query}\nAssistant:"
     cmd = [
         "python3", "run_inference.py",
         "-p", formatted_chat_prompt,
         "-n", "120",
         "-temp", "0.4",
+        "-t", "2"  # Optimized for Hugging Face free-tier dual-core CPUs
     ]
     process = subprocess.Popen(
         cmd,
         stdout=subprocess.PIPE,
+        stderr=subprocess.DEVNULL,  # Hide system logs
         text=True,
         bufsize=1
     )
     prompt_cleared = False
     LOOKAHEAD_SIZE = 45
+    # These are the markers our Python function uses to slice the text
     stop_markers = [
         "Stop token", "stop token",
         "Stop.", "stop.",
         accumulator += char
+        # --- SWALLOW THE ECHOED PROMPT ---
         if not prompt_cleared:
             if "Assistant:" in accumulator:
                 prompt_cleared = True
                 accumulator = accumulator.split("Assistant:")[-1].lstrip()
+            continue
+        # --- THE CLEANING FUNCTION: Scan for structural boundaries ---
         stop_triggered = False
         for marker in stop_markers:
             if marker in accumulator:
+                # The moment a marker is found, slice the text and trigger the kill switch
                 accumulator = accumulator.split(marker)[0]
                 stop_triggered = True
                 break
         if stop_triggered:
+            process.terminate()  # Hard-kill the engine
             break
         # Stream text safely outside the trailing boundary window
 # TECHNICAL REPORT MARKDOWN TEXT
 # ==============================================================================
 TECHNICAL_REPORT_MD = """
+## 📋 Technical Report: 1-Bit LLM Socratic Refinement Pipeline
 **Architecture Core:** Ternary Quantized (1.58-bit) Matrix Processing
 ---
 ---
 ### 2. Model Training Matrix & Evaluation Phase
+Our initial strategy focused on fine-tuning custom models directly on our targeted Socratic dataset. The results exposed clear engineering trade-offs:
 | Model Identifier | Architecture Configuration | Operational Performance | Qualitative Evaluation |
 | :--- | :--- | :--- | :--- |
 | **st192011/socratic-bitnet-2b** | Quantized Ternary Representation Variant of custom weights. | **Critically Poor** | Suffered extreme degradation. The model experienced severe structural collapse, outputting infinite semantic loops or unreadable token gibberish. |
 #### Analysis of Quantization Collapse
+The stark failure of `st192011/socratic-bitnet-2b` highlights a common hurdle in customized 1-bit AI development. When a model's weights are aggressively compressed down to simple ternary values (-1, 0, 1), the mathematical boundaries become extremely rigid. Standard quantization tools often distort the delicate behavioral traits introduced during fine-tuning.
 ---
 ### 3. Strategy Pivot: Pretrained Weights + Structural Prompt Anchoring
+To avoid the quantization bugs of custom fine-tuned weights, we pivoted to a hybrid solution: **combining the official pretrained base weights from Microsoft with precision prompt engineering.**
+We deployed `microsoft/bitnet-b1.58-2B-4T-gguf`. While this preserved its foundational knowledge base, it introduced a new challenge: **Base models do not natively know when to stop generating.**
 #### The Stop-Token Anchor Hack
+To enforce structure, we modified the System Prompt to force the model to declare its own stopping point:
 > *"You are a Socratic assistant... Respond exclusively with 3 deep, reflective questions. Then generate stop token"*
+This instruction forces the text-prediction engine to anchor itself on a predictable phrase. While the model still experiences trailing hallucinations, it prints a recognizable marker *immediately after* providing the high-quality questions.
 ---
 ### 4. Production Pipeline Architecture
+To deliver a flawless UX, we implemented a **Programmatic UX Stream Filter**:
+* **The Lookahead Buffer Zone:** The streaming engine retains the trailing 45 characters inside a private memory array, evaluating it for known stop-sequences before releasing clean text to the UI.
+* **Process Resource Reclamation:** The moment a marker is tripped, a background system command kills the active process (`process.terminate()`).
 """
 # ==============================================================================
 # GRADIO INTERFACE LAYOUT (TABBED WINDOWS)
 # ==============================================================================
 with gr.Blocks(theme=gr.themes.Soft()) as demo:
+    gr.Markdown("# 🧠 High-Performance 1-Bit AI Sandbox")
     with gr.Tabs():
         # --- TAB 1: INTERACTIVE APP ---
+        with gr.TabItem("Experimental Interface"):
+            gr.Markdown("### Real-Time 1.58-bit Prompting Sandbox")
+            gr.Markdown("Test the limits of Microsoft's BitNet GGUF kernel. Change the persona, modify the rules, and see how the Python cleaning function reacts in real-time.")
+            with gr.Row():
+                with gr.Column(scale=1):
+                    gr.Markdown("### 🛠️ The \"Stop Token\" Hack")
+                    gr.Markdown(
+                        "**Base models don't know how to stop talking!**\n\n"
+                        "To prevent infinite loops, our system prompt instructs the model to literally type the words `Stop token` when it is finished. "
+                        "Our Python backend uses a **Lookahead Buffer** to watch for those words. If it sees them, it instantly slices them out and kills the engine.\n\n"
+                        "*🧪 Try deleting the words `'Then generate stop token'` from the prompt below and see what happens!*"
+                    )
+                with gr.Column(scale=2):
+                    system_prompt_input = gr.Textbox(
+                        label="System Instruction (Editable)",
+                        value=DEFAULT_SYSTEM_PROMPT,
+                        lines=3
+                    )
+            gr.Markdown("---")
             with gr.Row():
                 with gr.Column(scale=4):
                     input_text = gr.Textbox(
+                        label="User Query",
+                        placeholder="e.g., What makes something responsibility?",
                         lines=2
                     )
+                    submit_btn = gr.Button("Generate Response", variant="primary")
                 with gr.Column(scale=5):
                     output_text = gr.Textbox(
                         label="Cleaned Real-Time Streaming Output",
                         interactive=False
                     )
+            # Wire up the inputs to include the system prompt
+            submit_btn.click(fn=streaming_chat, inputs=[input_text, system_prompt_input], outputs=output_text)
+            input_text.submit(fn=streaming_chat, inputs=[input_text, system_prompt_input], outputs=output_text)
         # --- TAB 2: TECHNICAL REPORT ---
         with gr.TabItem("Technical Report"):