Upload NN_Classification_of_3D_Double_Helix_V0.1.py

V0.1 (manually debugged)
The only significant change is fixing the test_size parameter in the two train_test_split calls. I have also added comments to highlight the fix.
The Bug:
In the V0.0 train_test_split function calls, the parameter was set as:
test_size = 1 - VALIDATION_SPLIT
With VALIDATION_SPLIT = 0.2, this meant test_size = 0.8.
This single line caused two major problems:
It inverted the dataset split. Instead of training on 80% of the data and testing on 20%, the model was being trained on a tiny 20% of the data and tested on 80%.
It starved the model. While the "Informed" model is simple and should learn quickly, giving it such a small portion of the data can, with an unlucky random initialization of weights, cause the optimizer to converge to a completely wrong solution (like predicting the opposite class). The extreme learning seen in the loss graph combined with the abysmal accuracy is a classic symptom of this. The model found a "perfect" solution for the tiny training set, but that solution was perfectly wrong for the general problem.
The original issue was not with the concept, but a simple (and easy to make!) bug in the implementation.

Now:
#VALIDATION_SPLIT = 0.2 #commented out
TEST_SET_SIZE = 0.2 # Use 20% of the data for testing/validation

X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(X_informed, y, test_size=TEST_SET_SIZE, random_state=RANDOM_STATE)

X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X, y, test_size=TEST_SET_SIZE, random_state=RANDOM_STATE)

# Train the informed model
history_informed = model_informed.fit(X_train_i, y_train_i,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_test_i, y_test_i),
verbose=1)

# Train the naive model
history_naive = model_naive.fit(X_train_n, y_train_n,
epochs=EPOCHS,
batch_size=BATCH_SIZE,
validation_data=(X_test_n, y_test_n),
verbose=1)

Files changed (1) hide show

NN_Classification_of_3D_Double_Helix_V0.1.py +522 -0

NN_Classification_of_3D_Double_Helix_V0.1.py ADDED Viewed

	@@ -0,0 +1,522 @@

+# =============================================================================
+#
+#   Neural Network Classification of a 3D Double Helix
+#   Proposed by Martial Terran of https huggingface.co MartialTerran
+#
+#   This script demonstrates a key concept in machine learning: the power of
+#   feature engineering. It tackles a 3D classification problem where data
+#   is arranged in two intertwining helices.
+#
+#   We will compare two models:
+#   1. The "Naive" Model: A standard Multi-Layer Perceptron (MLP) that receives
+#      raw (x, y, z) coordinates. It struggles to learn the rotational
+#      geometry.
+#   2. The "Informed" Model: A very simple network that receives engineered
+#      features. We transform the (x, y, z) coordinates into the distances
+#      from the point to the center of each helix at that point's z-level.
+#      This "unrolls" the problem, making it trivially easy to solve.
+#
+#
+#=============================================================================
+"""
+V0.1 (manually debugged)
+The only significant change is fixing the test_size parameter in the two train_test_split calls. I have also added comments to highlight the fix.
+The Bug:
+In the V0.0 train_test_split function calls, the parameter was set as:
+test_size = 1 - VALIDATION_SPLIT
+With VALIDATION_SPLIT = 0.2, this meant test_size = 0.8.
+This single line caused two major problems:
+It inverted the dataset split. Instead of training on 80% of the data and testing on 20%, the model was being trained on a tiny 20% of the data and tested on 80%.
+It starved the model. While the "Informed" model is simple and should learn quickly, giving it such a small portion of the data can, with an unlucky random initialization of weights, cause the optimizer to converge to a completely wrong solution (like predicting the opposite class). The extreme learning seen in the loss graph combined with the abysmal accuracy is a classic symptom of this. The model found a "perfect" solution for the tiny training set, but that solution was perfectly wrong for the general problem.
+The original issue was not with the concept, but a simple (and easy to make!) bug in the implementation.
+Now:
+#VALIDATION_SPLIT = 0.2 #commented out
+TEST_SET_SIZE = 0.2  # Use 20% of the data for testing/validation
+X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(X_informed, y, test_size=TEST_SET_SIZE, random_state=RANDOM_STATE)
+X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X, y, test_size=TEST_SET_SIZE, random_state=RANDOM_STATE)
+# Train the informed model
+history_informed = model_informed.fit(X_train_i, y_train_i,
+                                      epochs=EPOCHS,
+                                      batch_size=BATCH_SIZE,
+                                      validation_data=(X_test_i, y_test_i),
+                                      verbose=1)
+# Train the naive model
+history_naive = model_naive.fit(X_train_n, y_train_n,
+                                epochs=EPOCHS,
+                                batch_size=BATCH_SIZE,
+                                validation_data=(X_test_n, y_test_n),
+                                verbose=1)
+"""
+print("# start loading libraries--- Imports ---")
+import os
+import sys
+import zipfile
+import numpy as np
+import tensorflow as tf
+from tensorflow import keras
+from tensorflow.keras import layers
+import matplotlib.pyplot as plt
+from mpl_toolkits.mplot3d import Axes3D
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import classification_report, confusion_matrix
+print("done loading libraries")
+# --- Check for Google Colab Environment for Zipping Results ---
+try:
+    import google.colab
+    IN_COLAB = True
+    print(" Colab detected: IN_COLAB = True")
+except ImportError:
+    IN_COLAB = False
+# ==============================================================================
+# === HYPERPARAMETERS & SETUP ===
+# ==============================================================================
+# --- Data Generation ---
+N_POINTS_PER_BIN = 25      # Number of data points per vertical Z-bin
+Z_BINS = 100               # Number of Z-bins to generate data in (controls length of helix)
+HELIX_RADIUS = 5.0         # The radius of the central helix path
+DATA_CLOUD_RADIUS = 1.5    # The radius of the data cloud around each helix point
+GAP_FACTOR = 1.2           # A factor > 1 to create a gap between class boundaries
+Z_CYCLES = 2.5             # Number of full 360-degree cycles the helices should make
+NOISE_LEVEL = 0.1          # A small amount of random noise to add to all coordinates
+# --- Model & Training ---
+EPOCHS = 40
+BATCH_SIZE = 32
+#VALIDATION_SPLIT = 0.2
+TEST_SET_SIZE = 0.2  # Use 20% of the data for testing/validation
+RANDOM_STATE = 42          # For reproducible train/test splits
+# --- File & Folder Management ---
+DATASET_FOLDER = "dataset"
+PLOTS_FOLDER = "plots"
+DATASET_FILENAME = "double_helix_data.npz"
+DATASET_PATH = os.path.join(DATASET_FOLDER, DATASET_FILENAME)
+# Create output directories if they don't exist
+os.makedirs(DATASET_FOLDER, exist_ok=True)
+os.makedirs(PLOTS_FOLDER, exist_ok=True)
+# ==============================================================================
+# === PART 1: DATA GENERATION & LOADING ===
+# ==============================================================================
+def generate_double_helix_data():
+    """Generates the synthetic 3D double helix dataset."""
+    print("Generating new double helix dataset...")
+    points = []
+    labels = []
+    # Radius boundaries for each class
+    radius_class_0_max = DATA_CLOUD_RADIUS
+    radius_class_1_min = DATA_CLOUD_RADIUS * GAP_FACTOR
+    radius_class_1_max = DATA_CLOUD_RADIUS * (GAP_FACTOR + 1.0)
+    z_values = np.linspace(0, Z_BINS, Z_BINS)
+    for z in z_values:
+        for _ in range(N_POINTS_PER_BIN):
+            # Angular position along the helix
+            angle_rad = 2 * np.pi * Z_CYCLES * z / Z_BINS
+            # Centroid of Helix 1 (Class 0)
+            x1_c = HELIX_RADIUS * np.cos(angle_rad)
+            y1_c = HELIX_RADIUS * np.sin(angle_rad)
+            # Centroid of Helix 2 (Class 1) - 180 degrees out of phase
+            x2_c = -x1_c
+            y2_c = -y1_c
+            # Randomly assign a class
+            label = np.random.randint(0, 2)
+            # Generate a point within the class's data cloud
+            point_angle = np.random.rand() * 2 * np.pi
+            if label == 0:
+                point_radius = np.random.uniform(0, radius_class_0_max)
+                cx, cy = x1_c, y1_c
+            else: # label == 1
+                point_radius = np.random.uniform(radius_class_1_min, radius_class_1_max)
+                cx, cy = x2_c, y2_c
+            px = cx + point_radius * np.cos(point_angle)
+            py = cy + point_radius * np.sin(point_angle)
+            pz = z
+            # Add noise
+            noise = np.random.randn(3) * NOISE_LEVEL
+            points.append([px + noise[0], py + noise[1], pz + noise[2]])
+            labels.append(label)
+    X = np.array(points)
+    y = np.array(labels)
+    print(f"Dataset generated with {len(X)} points.")
+    return X, y
+# --- Main Data Loading/Generation Logic ---
+if os.path.exists(DATASET_PATH):
+    print(f"Loading existing dataset from '{DATASET_PATH}'...")
+    with np.load(DATASET_PATH) as data:
+        X, y = data['X'], data['y']
+    print(f"Dataset loaded with {len(X)} points.")
+else:
+    X, y = generate_double_helix_data()
+    np.savez(DATASET_PATH, X=X, y=y)
+    print(f"Dataset saved to '{DATASET_PATH}'.")
+# --- Visualize the initial dataset ---
+print("\nVisualizing the 3D dataset...")
+fig = plt.figure(figsize=(10, 8))
+ax = fig.add_subplot(111, projection='3d')
+scatter = ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='viridis', marker='.')
+ax.set_xlabel('X Axis')
+ax.set_ylabel('Y Axis')
+ax.set_zlabel('Z Axis')
+ax.set_title('Synthetic Double Helix Dataset')
+legend1 = ax.legend(*scatter.legend_elements(), title="Classes")
+ax.add_artist(legend1)
+plt.savefig(os.path.join(PLOTS_FOLDER, '01_initial_data_3d.png'))
+print("\n You Must Close the popup Visualized the 3D Dataset to continue this script version.")
+plt.show()
+# ==============================================================================
+# === PART 2: THE "INFORMED" MODEL (WITH HELIX KERNEL FEATURES) ===
+# ==============================================================================
+def helix_feature_transform(X_data):
+    """
+    Transforms (x, y, z) into a feature space based on distance to helix centroids.
+    This is the "secret sauce" that makes the problem easy.
+    """
+    X_transformed = []
+    for point in X_data:
+        px, py, pz = point
+        # Calculate the angular position for this Z-level
+        angle_rad = 2 * np.pi * Z_CYCLES * pz / Z_BINS
+        # Centroid of Helix 1 at this Z-level
+        x1_c = HELIX_RADIUS * np.cos(angle_rad)
+        y1_c = HELIX_RADIUS * np.sin(angle_rad)
+        # Centroid of Helix 2 at this Z-level
+        x2_c = -x1_c
+        y2_c = -y1_c
+        # Calculate Euclidean distance in the XY plane to each centroid
+        dist_to_h1 = np.sqrt((px - x1_c)**2 + (py - y1_c)**2)
+        dist_to_h2 = np.sqrt((px - x2_c)**2 + (py - y2_c)**2)
+        X_transformed.append([dist_to_h1, dist_to_h2])
+    return np.array(X_transformed)
+print("\n--- Training Model 1: The 'Informed' Model with Helix Features ---")
+# 1. Transform the features
+X_informed = helix_feature_transform(X)
+# 2. Split data
+#X_train_i, X_test_i, y_train, y_test = train_test_split( X_informed, y, test_size=1-VALIDATION_SPLIT, random_state=RANDOM_STATE)
+# ***** FIX: Corrected the test_size parameter *****
+# We now correctly use 80% of data for training and 20% for testing.
+X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
+    X_informed, y, test_size=TEST_SET_SIZE, random_state=RANDOM_STATE
+)
+# 3. Define the simple model
+model_informed = keras.Sequential([
+    layers.Input(shape=(2,), name='informed_input'),
+    layers.Dense(1, activation='sigmoid', name='output')
+], name="Informed_Model")
+model_informed.compile(optimizer='adam',
+                       loss='binary_crossentropy',
+                       metrics=['accuracy'])
+model_informed.summary()
+# 4. Train the model
+history_informed = model_informed.fit(X_train_i, y_train_i,
+                                      epochs=EPOCHS,
+                                      batch_size=BATCH_SIZE,
+                                      validation_data=(X_test_i, y_test_i),
+                                      verbose=1)
+# ==============================================================================
+# === PART 3: THE "NAIVE" MODEL (STANDARD MLP) ===
+# ==============================================================================
+print("\n\n--- Training Model 2: The 'Naive' Model with Raw (x, y, z) ---")
+# 1. Split the original, untransformed data
+# We use the same random_state to ensure the splits are comparable
+#X_train_n, X_test_n, y_train, y_test = train_test_split( X, y, test_size=1-VALIDATION_SPLIT, random_state=RANDOM_STATE)
+# ***** FIX: Corrected the test_size parameter and use distinct y-variables *****
+# Using the same random_state ensures the same data points are in each split.
+X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(
+    X, y, test_size=TEST_SET_SIZE, random_state=RANDOM_STATE
+)
+# 2. Define the deeper MLP model
+model_naive = keras.Sequential([
+    layers.Input(shape=(3,), name='naive_input'),
+    layers.Dense(32, activation='relu'),
+    layers.Dense(16, activation='relu'),
+    layers.Dense(1, activation='sigmoid', name='output')
+], name="Naive_Model")
+model_naive.compile(optimizer='adam',
+                    loss='binary_crossentropy',
+                    metrics=['accuracy'])
+model_naive.summary()
+# 3. Train the model
+history_naive = model_naive.fit(X_train_n, y_train_n,
+                                epochs=EPOCHS,
+                                batch_size=BATCH_SIZE,
+                                validation_data=(X_test_n, y_test_n),
+                                verbose=1)
+# ==============================================================================
+# === PART 4: EVALUATION AND COMPARISON ===
+# ==============================================================================
+print("\n\n" + "="*50)
+print("=== MODEL EVALUATION & COMPARISON ===")
+print("="*50)
+# --- Performance Metrics ---
+print("\n--- Model 1 (Informed) Performance ---")
+loss_i, acc_i = model_informed.evaluate(X_test_i, y_test_i, verbose=0)
+print(f"Test Accuracy: {acc_i:.4f}")
+print(f"Test Loss: {loss_i:.4f}")
+y_pred_i = (model_informed.predict(X_test_i) > 0.5).astype("int32")
+print("\nClassification Report:")
+print(classification_report(y_test_i, y_pred_i))
+print("\nConfusion Matrix:")
+print(confusion_matrix(y_test_i, y_pred_i))
+print("\n--- Model 2 (Naive) Performance ---")
+loss_n, acc_n = model_naive.evaluate(X_test_n, y_test_n, verbose=0)
+print(f"Test Accuracy: {acc_n:.4f}")
+print(f"Test Loss: {loss_n:.4f}")
+y_pred_n = (model_naive.predict(X_test_n) > 0.5).astype("int32")
+print("\nClassification Report:")
+print(classification_report(y_test_n, y_pred_n))
+print("\nConfusion Matrix:")
+print(confusion_matrix(y_test_n, y_pred_n))
+# --- Training History Visualization ---
+plt.figure(figsize=(14, 6))
+plt.subplot(1, 2, 1)
+plt.plot(history_informed.history['accuracy'], label='Informed Train Acc')
+plt.plot(history_informed.history['val_accuracy'], label='Informed Val Acc', linestyle='--')
+plt.plot(history_naive.history['accuracy'], label='Naive Train Acc')
+plt.plot(history_naive.history['val_accuracy'], label='Naive Val Acc', linestyle='--')
+plt.title('Model Accuracy Comparison')
+plt.ylabel('Accuracy')
+plt.xlabel('Epoch')
+plt.legend()
+plt.grid(True)
+plt.subplot(1, 2, 2)
+plt.plot(history_informed.history['loss'], label='Informed Train Loss')
+plt.plot(history_informed.history['val_loss'], label='Informed Val Loss', linestyle='--')
+plt.plot(history_naive.history['loss'], label='Naive Train Loss')
+plt.plot(history_naive.history['val_loss'], label='Naive Val Loss', linestyle='--')
+plt.title('Model Loss Comparison')
+plt.ylabel('Loss')
+plt.xlabel('Epoch')
+plt.legend()
+plt.grid(True)
+plt.tight_layout()
+plt.savefig(os.path.join(PLOTS_FOLDER, '02_training_history.png'))
+plt.show()
+# ==============================================================================
+# === PART 5: DECISION BOUNDARY VISUALIZATION ===
+# ==============================================================================
+def plot_decision_boundary_slice(model, X_data, y_data, z_value, title, transform_func=None):
+    """
+    Visualizes the model's decision boundary on a 2D slice of the 3D space.
+    """
+    fig, ax = plt.subplots(figsize=(8, 7))
+    # Create a grid of points in the XY plane
+    x_min, x_max = X_data[:, 0].min() - 1, X_data[:, 0].max() + 1
+    y_min, y_max = X_data[:, 1].min() - 1, X_data[:, 1].max() + 1
+    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 150),
+                         np.linspace(y_min, y_max, 150))
+    # Create 3D points at the specified Z-level
+    grid_points_3d = np.c_[xx.ravel(), yy.ravel(), np.full_like(xx.ravel(), z_value)]
+    # Prepare data for the model (apply transform if necessary)
+    if transform_func:
+        grid_for_model = transform_func(grid_points_3d)
+    else:
+        grid_for_model = grid_points_3d
+    # Get model predictions
+    Z = model.predict(grid_for_model)
+    Z = Z.reshape(xx.shape)
+    # Plot the decision boundary
+    ax.contourf(xx, yy, Z, alpha=0.4, cmap='viridis')
+    # Scatter plot the actual data points near this Z-slice
+    slice_mask = np.abs(X_data[:, 2] - z_value) < 1.0 # Bins are 1.0 unit thick
+    ax.scatter(X_data[slice_mask, 0], X_data[slice_mask, 1], c=y_data[slice_mask],
+               s=20, edgecolor='k', cmap='viridis')
+    ax.set_title(title)
+    ax.set_xlabel('X Axis')
+    ax.set_ylabel('Y Axis')
+    plt.savefig(os.path.join(PLOTS_FOLDER, f"03_{title.replace(' ', '_').replace('=', '')}.png"))
+    plt.show()
+print("\nVisualizing Decision Boundaries at different Z-levels...")
+z_slices = [0, Z_BINS * 0.5, Z_BINS * 0.9]
+for z_slice in z_slices:
+    # Model 1 (Informed)
+    plot_decision_boundary_slice(model_informed, X, y, z_slice,
+                                 title=f"Informed Model Boundary at Z={z_slice:.1f}",
+                                 transform_func=helix_feature_transform)
+    # Model 2 (Naive)
+    plot_decision_boundary_slice(model_naive, X, y, z_slice,
+                                 title=f"Naive Model Boundary at Z={z_slice:.1f}")
+# ==============================================================================
+# === PART 6: FINAL 3D VISUALIZATION OF CLASSIFICATION RESULTS ===
+# ==============================================================================
+def plot_3d_classification_results(model, X_test_raw, y_test, title, transform_func=None):
+    """Plots a 3D scatter plot colored by correct/incorrect classification."""
+    # Prepare test data for the given model
+    if transform_func:
+        X_test_for_model = transform_func(X_test_raw)
+    else:
+        X_test_for_model = X_test_raw
+    # Get predictions
+    y_pred = (model.predict(X_test_for_model) > 0.5).astype("int32").flatten()
+    # Determine correct and incorrect classifications
+    correct_mask = (y_pred == y_test)
+    fig = plt.figure(figsize=(12, 10))
+    ax = fig.add_subplot(111, projection='3d')
+    # Plot correctly classified points (green)
+    ax.scatter(X_test_raw[correct_mask, 0], X_test_raw[correct_mask, 1], X_test_raw[correct_mask, 2],
+               c='green', marker='.', alpha=0.5, label='Correct')
+    # Plot incorrectly classified points (red)
+    ax.scatter(X_test_raw[~correct_mask, 0], X_test_raw[~correct_mask, 1], X_test_raw[~correct_mask, 2],
+               c='red', marker='x', s=50, label='Incorrect')
+    ax.set_xlabel('X Axis')
+    ax.set_ylabel('Y Axis')
+    ax.set_zlabel('Z Axis')
+    ax.set_title(title)
+    ax.legend()
+    plt.savefig(os.path.join(PLOTS_FOLDER, f"04_{title.replace(' ', '_')}.png"))
+    plt.show()
+print("\nVisualizing final classification results on the test set...")
+# Use the 'naive' split's raw X_test_n for both plots to compare on the same data
+plot_3d_classification_results(model_informed, X_test_n, y_test_n,
+                               title="Informed Model Classification Results",
+                               transform_func=helix_feature_transform)
+plot_3d_classification_results(model_naive, X_test_n, y_test_n,
+                               title="Naive Model Classification Results")
+# ==============================================================================
+# === PART 7: FINAL SUMMARY & CONCLUSION ===
+# ==============================================================================
+print("\n\n" + "="*50)
+print("=== FINAL CONCLUSION ===")
+print("="*50)
+print(f"""
+This experiment clearly demonstrates the critical role of feature engineering.
+MODEL 1 (Informed Model):
+- Accuracy: {acc_i:.4f}
+- How it works: We transformed the (x, y, z) coordinates into a new feature
+  space: [distance_to_helix_1, distance_to_helix_2]. In this space, the
+  problem becomes trivial. A point is Class 0 if its distance to helix 1
+  is small, and Class 1 if its distance to helix 2 is small.
+- Result: The model achieved near-perfect accuracy because the data became
+  linearly separable. The decision boundary visualizations show a perfect
+  circular separator at every Z-level, proving the model generalized perfectly.
+MODEL 2 (Naive Model):
+- Accuracy: {acc_n:.4f}
+- How it works: This standard MLP was given only the raw (x, y, z) data.
+  It tried to find a complex, 3D surface to separate the two twisting helices.
+- Result: The model struggled significantly. While its accuracy is better
+  than random guessing, it's far from perfect. The decision boundary plots
+  show that it learned strange, contorted shapes that only work for the Z-levels
+  it was trained on. It completely failed to learn the underlying rotational
+  geometry and did not generalize well.
+ANSWER TO THE CORE QUESTION:
+High accuracy classification over an arbitrary range of Z is accomplished
+by transforming the input coordinates into a feature space that reflects the
+inherent geometry of the problem, effectively "unrolling" the helices and
+making the classes easily separable.
+""")
+# ==============================================================================
+# === PART 8: ZIP RESULTS FOR GOOGLE COLAB ===
+# ==============================================================================
+def zip_results_for_colab(plots_folder, dataset_path):
+    """Zips all generated plot files and the dataset for easy download in Colab."""
+    zip_filename = "double_helix_nn_results.zip"
+    files_to_zip = []
+    # Add all plots from the plots folder
+    for filename in os.listdir(plots_folder):
+        if filename.endswith(".png"):
+            files_to_zip.append(os.path.join(plots_folder, filename))
+    # Add the dataset file
+    if os.path.exists(dataset_path):
+        files_to_zip.append(dataset_path)
+    print(f"\nZipping {len(files_to_zip)} result files into '{zip_filename}'...")
+    with zipfile.ZipFile(zip_filename, 'w') as zf:
+        for file in files_to_zip:
+            zf.write(file, os.path.basename(file))
+    print("Zipping complete. Triggering download...")
+    from google.colab import files
+    files.download(zip_filename)
+if IN_COLAB:
+    zip_results_for_colab(PLOTS_FOLDER, DATASET_PATH)