Spaces:

Em4e
/

testing

Sleeping

App Files Files Community

Em4e commited on Aug 12, 2025

Commit

4cd6aec

verified ·

1 Parent(s): ee2180d

Update app.py

Browse files

Files changed (1) hide show

app.py +59 -28

app.py CHANGED Viewed

@@ -27,11 +27,16 @@ if 'www_graph_cache' not in st.session_state:
 def load_graph_from_csv_networkit(file_content, file_name):
     """
-    Load page links from CSV file using NetworKit.
     """
     try:
-        # Read CSV content
-        df = pd.read_csv(StringIO(file_content))
         # Check required columns with user-friendly names
         required_cols = ['FROM', 'TO']
@@ -47,27 +52,36 @@ def load_graph_from_csv_networkit(file_content, file_name):
             """)
             return None, None, None
-        # Clean data
-        df = df.dropna(subset=['FROM', 'TO'])
-        df['FROM'] = df['FROM'].astype(str)
-        df['TO'] = df['TO'].astype(str)
         if len(df) == 0:
             st.error(f"❌ No valid page links found in {file_name}")
             return None, None, None
-        # Get unique nodes and create mapping
-        all_nodes = list(set(df['FROM'].tolist() + df['TO'].tolist()))
         node_to_idx = {node: i for i, node in enumerate(all_nodes)}
         # Create NetworKit graph
         G = nk.Graph(n=len(all_nodes), weighted=False, directed=True)
-        # Add edges
-        for _, row in df.iterrows():
-            source_idx = node_to_idx[row['FROM']]
-            target_idx = node_to_idx[row['TO']]
-            G.addEdge(source_idx, target_idx)
         return G, all_nodes, node_to_idx
@@ -418,14 +432,26 @@ def main():
     if www_nodes >= 500000:
         st.sidebar.warning(f"""
         ⚠️ **Performance Warning**:
-        {internet_size} will be very slow!
-        Expect 2-10 minutes per test.
         Consider using fewer tests.
         """)
     elif www_nodes >= 250000:
         st.sidebar.info(f"""
-        ℹ️ **Note**: {internet_size} may take
-        30-60 seconds per test.
         """)
     # Advanced settings (hidden by default)
@@ -654,14 +680,16 @@ def main():
             Instead of guessing, you get data-driven confidence about your page link changes!
             ### ⚡ **Powered by NetworKit**
-            This version uses NetworKit, a high-performance network analysis toolkit that's much faster than traditional tools for analyzing large networks. It can now handle simulations of 100K to 1M sites!
-            ### 🔬 **Large-Scale Simulations**
             - **100K sites**: ~10-30 seconds per test
-            - **250K sites**: ~30-60 seconds per test
-            - **500K sites**: ~1-3 minutes per test
-            - **750K sites**: ~2-5 minutes per test
-            - **1M sites**: ~3-10 minutes per test
             """)
         with st.expander("❓ **Common Questions**"):
@@ -685,13 +713,16 @@ def main():
             A: Pages are specific URLs (like mysite.com/about), while websites are domains (like mysite.com). This tool analyzes individual page links.
             **Q: What's NetworKit?**
-            A: NetworKit is a high-performance network analysis toolkit with optimized C++ algorithms that makes calculations much faster and can handle larger datasets more efficiently.
             **Q: Which simulation size should I choose?**
-            A: Start with 100K for testing. Use 250K-500K for realistic results. Only use 750K+ if you have time and want maximum realism.
-            **Q: Why does it take longer with bigger simulations?**
-            A: Larger simulations are more realistic but require more computation. The tool automatically adjusts algorithms for efficiency at different scales.
             """)
 if __name__ == "__main__":

 def load_graph_from_csv_networkit(file_content, file_name):
     """
+    Load page links from CSV file using NetworKit - OPTIMIZED VERSION.
     """
     try:
+        # Read CSV content with optimized settings
+        df = pd.read_csv(
+            StringIO(file_content),
+            dtype={'FROM': 'string', 'TO': 'string'},  # Specify types upfront
+            na_filter=True,  # Enable NA filtering
+            skip_blank_lines=True  # Skip empty lines
+        )
         # Check required columns with user-friendly names
         required_cols = ['FROM', 'TO']
             """)
             return None, None, None
+        # Fast data cleaning - vectorized operations
+        initial_rows = len(df)
+        df = df.dropna(subset=['FROM', 'TO'])  # Remove rows with missing values
         if len(df) == 0:
             st.error(f"❌ No valid page links found in {file_name}")
             return None, None, None
+        # Show cleaning stats if significant data was removed
+        if initial_rows - len(df) > initial_rows * 0.1:  # More than 10% removed
+            st.warning(f"⚠️ Removed {initial_rows - len(df)} rows with missing data from {file_name}")
+        # OPTIMIZED: Get unique nodes using pandas operations (much faster)
+        all_nodes_series = pd.concat([df['FROM'], df['TO']]).drop_duplicates()
+        all_nodes = all_nodes_series.tolist()
+        # OPTIMIZED: Create node mapping
         node_to_idx = {node: i for i, node in enumerate(all_nodes)}
         # Create NetworKit graph
         G = nk.Graph(n=len(all_nodes), weighted=False, directed=True)
+        # OPTIMIZED: Vectorized edge addition (MAJOR SPEEDUP)
+        # Convert node names to indices using vectorized operations
+        source_indices = df['FROM'].map(node_to_idx).values
+        target_indices = df['TO'].map(node_to_idx).values
+        # Bulk add edges using numpy arrays (much faster than iterrows)
+        for src_idx, tgt_idx in zip(source_indices, target_indices):
+            G.addEdge(int(src_idx), int(tgt_idx))
         return G, all_nodes, node_to_idx
     if www_nodes >= 500000:
         st.sidebar.warning(f"""
         ⚠️ **Performance Warning**:
+        {internet_size} with Barabási-Albert will be very slow!
+        Expect 4-15 minutes per test.
         Consider using fewer tests.
         """)
     elif www_nodes >= 250000:
         st.sidebar.info(f"""
+        ℹ️ **Note**: {internet_size} with Barabási-Albert may take
+        30-90 seconds per test.
+        """)
+    # Add Barabási-Albert info
+    with st.sidebar.expander("🔬 About Barabási-Albert Model"):
+        st.markdown("""
+        **Why Barabási-Albert?**
+        - Creates **scale-free networks** like the real web
+        - **Preferential attachment**: Popular pages get more links
+        - **Power-law distribution**: Most realistic web simulation
+        - Slower than other models but much more accurate
+        **Perfect for**: Testing how link changes affect rankings in realistic web conditions.
         """)
     # Advanced settings (hidden by default)
             Instead of guessing, you get data-driven confidence about your page link changes!
             ### ⚡ **Powered by NetworKit**
+            This version uses NetworKit, a high-performance network analysis toolkit that's much faster than traditional tools for analyzing large networks. It uses the **Barabási-Albert model** to create realistic scale-free networks that mimic the actual structure of the web!
+            ### 🔬 **Large-Scale Barabási-Albert Simulations**
             - **100K sites**: ~10-30 seconds per test
+            - **250K sites**: ~30-90 seconds per test
+            - **500K sites**: ~2-5 minutes per test
+            - **750K sites**: ~4-8 minutes per test
+            - **1M sites**: ~6-15 minutes per test
+            **Note**: Barabási-Albert is more computationally intensive than other generators but produces the most realistic web-like structure with power-law degree distributions.
             """)
         with st.expander("❓ **Common Questions**"):
             A: Pages are specific URLs (like mysite.com/about), while websites are domains (like mysite.com). This tool analyzes individual page links.
             **Q: What's NetworKit?**
+            A: NetworKit is a high-performance network analysis toolkit with optimized C++ algorithms. This tool specifically uses the **Barabási-Albert model** to generate scale-free networks that accurately represent real web topology.
+            **Q: Why Barabási-Albert specifically?**
+            A: The Barabási-Albert model creates "scale-free" networks with preferential attachment - meaning popular pages get more links, just like the real web. This produces the most realistic simulation of how link changes affect rankings.
             **Q: Which simulation size should I choose?**
+            A: Start with 100K for testing. Use 250K-500K for realistic results. Only use 750K+ if you have time and want maximum realism. Larger = more realistic but much slower.
+            **Q: Why does Barabási-Albert take longer than other generators?**
+            A: Barabási-Albert builds networks step-by-step with preferential attachment, which is more computationally intensive but produces much more realistic web-like structures than faster alternatives.
             """)
 if __name__ == "__main__":