Spaces:
Configuration error
Configuration error
| # NeurIPS Open Polymer Prediction 2025 - Complete Solution Documentation | |
| ## Competition Overview | |
| The NeurIPS Open Polymer Prediction 2025 competition was to predict five polymer properties from SMILES (Simplified Molecular Input Line Entry System) representations: | |
| - **Tg**: Glass Transition Temperature | |
| - **Tc**: Crystallization Temperature | |
| - **FFV**: Fractional Free Volume | |
| - **Density**: Polymer Density | |
| - **Rg**: Radius of Gyration | |
| The evaluation metric was weighted Mean Absolute Error (wMAE), with FFV having roughly 10 times the weight of other properties, making it important in the final score. | |
| ## My Approach and Solution Architecture | |
| After not getting promising results from Transformers-based and GB-based models, I developed a comprehensive multi-model approach, using different Graph Neural Network (GNN) architectures optimized for different polymer properties based on their unique characteristics and data distributions. | |
| ### Property-Specific Model Selection | |
| #### 1. **Rg and Density** β MyGNN (Custom Implementation) | |
| - **Location**: `/MY_GNN/` | |
| - **Architecture**: Custom Graph Neural Network designed specifically for polymer molecular graphs | |
| - **Design Philosophy**: These properties are directly related to molecular structure and spatial arrangements, requiring a custom approach that could capture geometric and topological features effectively | |
| - **Key Features**: | |
| - Custom message passing mechanism | |
| - Specialized node and edge feature representations | |
| #### 2. **FFV and Tg** β MolecularGNN_SMILES | |
| - **Location**: `/NIPS_GNN/` | |
| - **Base Repository**: [masashitsubaki/molecularGNN_smiles](https://github.com/masashitsubaki/molecularGNN_smiles) | |
| - **Architecture**: Graph Neural Network based on learning representations of r-radius subgraphs (molecular fingerprints) | |
| - **Design Philosophy**: FFV and Tg are thermal and mechanical properties that correlate well with local chemical environments and substructural patterns | |
| - **Key Features**: | |
| - Fingerprints representation learning | |
| - Proven performance on molecular property prediction tasks | |
| #### 3. **Tc** β DataAugmentation4SmallData (Modified) | |
| - **Location**: `/DA_GNN/` | |
| - **Base Repository**: [hkqiu/DataAugmentation4SmallData](https://github.com/hkqiu/DataAugmentation4SmallData) | |
| - **Architecture**: Neural network with data augmentation techniques for small datasets | |
| - **Modifications Made**: | |
| - Adjusted layer sizes for deeper polymer-specific features | |
| - Modified augmentation strategies for chemical data | |
| ## Technical Implementation Details | |
| ### Data Preprocessing | |
| - **Input Format**: SMILES strings representing polymer structures | |
| - **Graph Construction**: Molecular graphs with atoms as nodes and bonds as edges | |
| - **Feature Engineering** using `rdkit` and `networkx` modules: | |
| - Atomic features (element type, hybridization, formal charge, etc.) and Bond features (bond type, conjugation, ring membership, etc.) | |
| - Global molecular descriptors where applicable | |
| ### Model Training Strategy | |
| - **Cross-Validation**: 5-fold cross-validation for robust performance estimation | |
| - **Optimization**: Adam optimizer with learning rate scheduling | |
| - **Loss Function**: Mean Absolute Error (MAE) to match competition metric | |
| - **Early Stopping**: Implemented to prevent overfitting | |
| ### Performance Achieved | |
| - **Best Public LB Score**: 0.065 wMAE, which later scored **0.083** wMAE on the Private LB, among the [top 10 performers](https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025/leaderboard). | |
| - **Model Type**: Ensemble of the three GNN approaches | |
| ## Repository Structure | |
| ``` | |
| NeurIPS-Open-Polymer-Prediction-2025/ | |
| βββ MY_GNN/ # Custom GNN for Rg and Density | |
| β βββ train.py # Training Script | |
| β βββ inference.py # Inference Script | |
| β βββ trained_models/ # Trained Models | |
| βββ NIPS_GNN/ # MolecularGNN for FFV and Tg | |
| βββ DA_GNN/ # Data Augmentation GNN for Tc | |
| βββ notebooks/ # Jupyter submitted notebooks | |
| βββ Datasets/ # Competition Datasets and External ones as well, along with training scripts | |
| βββ README.md # This file | |
| ``` | |
| ## What Went Wrong: The Final Day Mistake | |
| Despite achieving a strong public leaderboard score of 0.065 with my GNN ensemble, I made a critical error on the final submission day that cost me the competition. | |
| On the last day of the competition, influenced by discussion threads suggesting that models performing poorly on the public leaderboard (which used only ~8% of test data) might perform better on the private leaderboard (remaining ~92%), I decided to submit a different, inferior model (of 0.070 Public LB Score) instead of my best-performing GNN solution. | |
| ### The Reasoning (Flawed) | |
| - **Public Leaderboard Overfitting Concerns**: The competition discussion was filled with warnings about leaderboard overfitting due to the small public test split | |
| - **Last-Minute Decision**: I chose to submit what I thought was a "safer" model | |
| ### The Reality | |
| - **Private Leaderboard Results**: The model I submitted performed significantly worse on the private leaderboard | |
| - **Statistical Truth**: In most Kaggle competitions, the discrepancy between public and private leaderboards is typically only 5-10% | |
| - **Lesson Learned**: Strong cross-validation and consistent public performance are usually the best predictors of private performance | |
| ### Impact | |
| This single decision transformed what could have been a successful competition result into a disappointing outcome, despite months of dedicated work and model development. | |
| ## Umm, Key Learnings and Insights | |
| ### Technical Insights | |
| 1. **Property-Specific Modeling**: Different polymer properties benefit from different neural network architectures | |
| 2. **Data Augmentation Value**: For properties with limited data (like Tc), augmentation techniques are crucial | |
| 3. **Ensemble Benefits**: Combining specialized models for different properties improves overall performance | |
| ### Competition Strategy Insights | |
| 1. **Trust Your CV**: Strong CV performance is usually the best predictor of final performance | |
| 2. **Avoid Last-Minute Changes**: Stick to your best-validated approach rather than making strategy changes under pressure | |
| 3. **Discussion Forum Caution**: While community discussions provide valuable insights, they can also lead to overthinking and poor decisions | |
| ## Conclusion | |
| This competition was both a technical challenge and a lesson in decision-making under pressure. | |
| While my GNN ensemble achieved strong performance (0.065 wMAE), the final submission mistake serves as a reminder that technical excellence must be paired with sound strategic judgment. | |
| The multi-model approach proved effective, with each specialized GNN architecture capturing different aspects of polymer structure-property relationships. | |
| ## Resources and References | |
| ### Competition Links | |
| - [NeurIPS Open Polymer Prediction 2025](https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025) | |
| - [Competition Kaggle Notebook](https://www.kaggle.com/code/fridaycode/neurips-gnn-models) | |
| - [Competition Solution WriteUp](https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025/writeups/friday-code-gnn-based-solution) | |
| ### Referenced Repositories | |
| - [masashitsubaki/molecularGNN_smiles](https://github.com/masashitsubaki/molecularGNN_smiles) | |
| - [hkqiu/DataAugmentation4SmallData](https://github.com/hkqiu/DataAugmentation4SmallData) | |
| ### Key Papers | |
| - "Compound-protein Interaction Prediction with End-to-end Learning of Neural Networks for Graphs and Sequences" (Tsubaki et al.) | |
| - "Heat-Resistant Polymer Discovery by Utilizing Interpretable Graph Neural Network with Small Data" (Haoke Qiu, Jingying Wang, ...) | |
| --- | |