Update README.md

c9edf0f verified 4 months ago

7.95 kB

	---
	license: gpl-3.0
	metrics:
	- accuracy
	pipeline_tag: tabular-regression
	tags:
	- chemistry
	- polymer
	- neurips
	- kaggle
	---
	# NeurIPS Open Polymer Prediction 2025 - Complete Solution Documentation

	## Competition Overview

	The NeurIPS Open Polymer Prediction 2025 competition was to predict five polymer properties from SMILES (Simplified Molecular Input Line Entry System) representations:
	- Tg: Glass Transition Temperature
	- Tc: Crystallization Temperature
	- FFV: Fractional Free Volume
	- Density: Polymer Density
	- Rg: Radius of Gyration

	The evaluation metric was weighted Mean Absolute Error (wMAE), with FFV having roughly 10 times the weight of other properties, making it important in the final score.

	## My Approach and Solution Architecture

	After not getting promising results from Transformers-based and GB-based models, I developed a comprehensive multi-model approach, using different Graph Neural Network (GNN) architectures optimized for different polymer properties based on their unique characteristics and data distributions.

	### Property-Specific Model Selection

	#### 1. Rg and Density → MyGNN (Custom Implementation)
	- Location: `/MY_GNN/`
	- Architecture: Custom Graph Neural Network designed specifically for polymer molecular graphs
	- Design Philosophy: These properties are directly related to molecular structure and spatial arrangements, requiring a custom approach that could capture geometric and topological features effectively
	- Key Features:
	- Custom message passing mechanism
	- Specialized node and edge feature representations

	#### 2. FFV and Tg → MolecularGNN_SMILES
	- Location: `/NIPS_GNN/`
	- Base Repository: [masashitsubaki/molecularGNN_smiles](https://github.com/masashitsubaki/molecularGNN_smiles)
	- Architecture: Graph Neural Network based on learning representations of r-radius subgraphs (molecular fingerprints)
	- Design Philosophy: FFV and Tg are thermal and mechanical properties that correlate well with local chemical environments and substructural patterns
	- Key Features:
	- Fingerprints representation learning
	- Proven performance on molecular property prediction tasks

	#### 3. Tc → DataAugmentation4SmallData (Modified)
	- Location: `/DA_GNN/`
	- Base Repository: [hkqiu/DataAugmentation4SmallData](https://github.com/hkqiu/DataAugmentation4SmallData)
	- Architecture: Neural network with data augmentation techniques for small datasets
	- Modifications Made:
	- Adjusted layer sizes for deeper polymer-specific features
	- Modified augmentation strategies for chemical data

	## Technical Implementation Details

	### Data Preprocessing
	- Input Format: SMILES strings representing polymer structures
	- Graph Construction: Molecular graphs with atoms as nodes and bonds as edges
	- Feature Engineering using `rdkit` and `networkx` modules:
	- Atomic features (element type, hybridization, formal charge, etc.) and Bond features (bond type, conjugation, ring membership, etc.)
	- Global molecular descriptors where applicable

	### Model Training Strategy
	- Cross-Validation: 5-fold cross-validation for robust performance estimation
	- Optimization: Adam optimizer with learning rate scheduling
	- Loss Function: Mean Absolute Error (MAE) to match competition metric
	- Early Stopping: Implemented to prevent overfitting

	### Performance Achieved
	- Best Public LB Score: 0.065 wMAE, which later scored 0.083 wMAE on the Private LB, among the [top 10 performers](https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025/leaderboard).
	- Model Type: Ensemble of the three GNN approaches

	## Repository Structure

	```
	NeurIPS-Open-Polymer-Prediction-2025/
	├── MY_GNN/ # Custom GNN for Rg and Density
	│ ├── train.py # Training Script
	│ ├── inference.py # Inference Script
	│ └── trained_models/ # Trained Models
	├── NIPS_GNN/ # MolecularGNN for FFV and Tg
	├── DA_GNN/ # Data Augmentation GNN for Tc
	├── notebooks/ # Jupyter submitted notebooks
	├── Datasets/ # Competition Datasets and External ones as well, along with training scripts
	└── README.md # This file
	```

	## What Went Wrong: The Final Day Mistake

	Despite achieving a strong public leaderboard score of 0.065 with my GNN ensemble, I made a critical error on the final submission day that cost me the competition.

	On the last day of the competition, influenced by discussion threads suggesting that models performing poorly on the public leaderboard (which used only ~8% of test data) might perform better on the private leaderboard (remaining ~92%), I decided to submit a different, inferior model (of 0.070 Public LB Score) instead of my best-performing GNN solution.

	### The Reasoning (Flawed)
	- Public Leaderboard Overfitting Concerns: The competition discussion was filled with warnings about leaderboard overfitting due to the small public test split
	- Last-Minute Decision: I chose to submit what I thought was a "safer" model

	### The Reality
	- Private Leaderboard Results: The model I submitted performed significantly worse on the private leaderboard
	- Statistical Truth: In most Kaggle competitions, the discrepancy between public and private leaderboards is typically only 5-10%
	- Lesson Learned: Strong cross-validation and consistent public performance are usually the best predictors of private performance

	### Impact
	This single decision transformed what could have been a successful competition result into a disappointing outcome, despite months of dedicated work and model development.

	## Umm, Key Learnings and Insights

	### Technical Insights
	1. Property-Specific Modeling: Different polymer properties benefit from different neural network architectures
	2. Data Augmentation Value: For properties with limited data (like Tc), augmentation techniques are crucial
	3. Ensemble Benefits: Combining specialized models for different properties improves overall performance

	### Competition Strategy Insights
	1. Trust Your CV: Strong CV performance is usually the best predictor of final performance
	2. Avoid Last-Minute Changes: Stick to your best-validated approach rather than making strategy changes under pressure
	3. Discussion Forum Caution: While community discussions provide valuable insights, they can also lead to overthinking and poor decisions

	## Conclusion

	This competition was both a technical challenge and a lesson in decision-making under pressure.

	While my GNN ensemble achieved strong performance (0.065 wMAE), the final submission mistake serves as a reminder that technical excellence must be paired with sound strategic judgment.

	The multi-model approach proved effective, with each specialized GNN architecture capturing different aspects of polymer structure-property relationships.

	## Resources and References

	### Competition Links
	- [NeurIPS Open Polymer Prediction 2025](https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025)
	- [Competition Kaggle Notebook](https://www.kaggle.com/code/fridaycode/neurips-gnn-models)
	- [Competition Solution WriteUp](https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025/writeups/friday-code-gnn-based-solution)

	### Referenced Repositories
	- [masashitsubaki/molecularGNN_smiles](https://github.com/masashitsubaki/molecularGNN_smiles)
	- [hkqiu/DataAugmentation4SmallData](https://github.com/hkqiu/DataAugmentation4SmallData)

	### Key Papers
	- "Compound-protein Interaction Prediction with End-to-end Learning of Neural Networks for Graphs and Sequences" (Tsubaki et al.)
	- "Heat-Resistant Polymer Discovery by Utilizing Interpretable Graph Neural Network with Small Data" (Haoke Qiu, Jingying Wang, ...)
	---