AstroMLCore
/

AstroM3-CLIP

@@ -8,14 +8,13 @@ datasets:
 - MeriDK/AstroM3Dataset
 ---
-# AstroM3-CLIP
 AstroM³ is a self-supervised multimodal model for astronomy that integrates time-series photometry, spectra, and metadata into a unified embedding space
-for classification and other downstream tasks. AstroM³ is trained on [AstroM3Processed](https://huggingface.co/datasets/MeriDK/AstroM3Processed).
 For more details on the AstroM³ architecture, training, and results, please refer to the [paper](https://arxiv.org/abs/2411.08842).
 <p align="center">
-  <img src="astroclip-architecture.png" width="70%">
   <br />
   <span>
     Figure 1: Overview of the multimodal CLIP framework adapted for astronomy, incorporating three data modalities: photometric time-series, spectra, and metadata.
@@ -25,7 +24,7 @@ For more details on the AstroM³ architecture, training, and results, please ref
   </span>
 </p>
-To perform inference with AstroM³, install the AstroM3 library from our [GitHub repo](https://github.com/MeriDK/AstroM3).
 ```sh
 git clone https://github.com/MeriDK/AstroM3.git
 cd AstroM3
@@ -37,14 +36,14 @@ source venv/bin/activate
 uv pip install -r requirements.txt
 ```
-A simple example to get started:
 1. Data Loading & Preprocessing
 ```python
 from datasets import load_dataset
 from src.data import process_photometry
 # Load the test dataset
-test_dataset = load_dataset('MeriDK/AstroM3Processed', name='full_42', split='test')
 # Process photometry to have a fixed sequence length of 200 (center-cropped)
 test_dataset = test_dataset.map(process_photometry, batched=True, fn_kwargs={'seq_len': 200, 'how': 'center'})
@@ -56,7 +55,7 @@ import torch
 from src.model import AstroM3
 # Load the base AstroM3-CLIP model
-model = AstroM3.from_pretrained('MeriDK/AstroM3-CLIP')
 # Retrieve the first sample (batch size = 1)
 sample = test_dataset[0:1]
@@ -81,22 +80,22 @@ print('Multimodal Embedding (Spectra Missing):', multimodal_emb_missing)
 from src.model import AstroM3, Informer, GalSpecNet, MetaModel
 # Photometry classification
-photo_model = Informer.from_pretrained('MeriDK/AstroM3-CLIP-photo')
 prediction = photo_model(photometry, photometry_mask).argmax(dim=1).item()
 print('Photometry Classification:', test_dataset.features['label'].int2str(prediction))
 # Spectra classification
-spectra_model = GalSpecNet.from_pretrained('MeriDK/AstroM3-CLIP-spectra')
 prediction = spectra_model(spectra).argmax(dim=1).item()
 print('Spectra Classification:', test_dataset.features['label'].int2str(prediction))
 # Metadata classification
-meta_model = MetaModel.from_pretrained('MeriDK/AstroM3-CLIP-meta')
 prediction = meta_model(metadata).argmax(dim=1).item()
 print('Metadata Classification:', test_dataset.features['label'].int2str(prediction))
 # Multimodal classification
-all_model = AstroM3.from_pretrained('MeriDK/AstroM3-CLIP-all')
 prediction = all_model(photometry, photometry_mask, spectra, metadata).argmax(dim=1).item()
 print('Multimodal Classification:', test_dataset.features['label'].int2str(prediction))
 ```
@@ -105,11 +104,11 @@ print('Multimodal Classification:', test_dataset.features['label'].int2str(predi
 | # Model | # Description |
 | :--- | :--- |
-| [AstroM3-CLIP](https://huggingface.co/MeriDK/AstroM3-CLIP) | The base model pre-trained using the trimodal CLIP approach. |
-| [AstroM3-CLIP-meta](https://huggingface.co/MeriDK/AstroM3-CLIP-meta) | Fine-tuned for metadata-only classification. |
-| [AstroM3-CLIP-spectra](https://huggingface.co/MeriDK/AstroM3-CLIP-spectra) | Fine-tuned for spectra-only classification. |
-| [AstroM3-CLIP-photo](https://huggingface.co/MeriDK/AstroM3-CLIP-photo) | Fine-tuned for photometry-only classification. |
-| [AstroM3-CLIP-all](https://huggingface.co/MeriDK/AstroM3-CLIP-all) | Fine-tuned for multimodal classification. |
 ## AstroM3-CLIP Variants
 These variants of the base AstroM3-CLIP model are trained using different random seeds (42, 0, 66, 12, 123);
@@ -117,8 +116,21 @@ ensure that the dataset is loaded with the corresponding seed for consistency.
 | # Model | # Description |
 | :--- | :--- |
-| [AstroM3-CLIP-42](https://huggingface.co/MeriDK/AstroM3-CLIP-42) | The base model pre-trained with random seed 42 (identical to AstroM3-CLIP). |
-| [AstroM3-CLIP-0](https://huggingface.co/MeriDK/AstroM3-CLIP-0) | AstroM3-CLIP pre-trained with random seed 0 (use dataset with seed 0). |
-| [AstroM3-CLIP-66](https://huggingface.co/MeriDK/AstroM3-CLIP-66) | AstroM3-CLIP pre-trained with random seed 66 (use dataset with seed 66). |
-| [AstroM3-CLIP-12](https://huggingface.co/MeriDK/AstroM3-CLIP-12) | AstroM3-CLIP pre-trained with random seed 12 (use dataset with seed 12). |
-| [AstroM3-CLIP-123](https://huggingface.co/MeriDK/AstroM3-CLIP-123) | AstroM3-CLIP pre-trained with random seed 123 (use dataset with seed 123). |

 - MeriDK/AstroM3Dataset
 ---
 AstroM³ is a self-supervised multimodal model for astronomy that integrates time-series photometry, spectra, and metadata into a unified embedding space
+for classification and other downstream tasks. AstroM³ is trained on [AstroM3Processed](https://huggingface.co/datasets/AstroFOMO/AstroM3Processed),
+which is the pre-processed version of [AstroM3Dataset](https://huggingface.co/datasets/AstroFOMO/AstroM3Dataset).
 For more details on the AstroM³ architecture, training, and results, please refer to the [paper](https://arxiv.org/abs/2411.08842).
 <p align="center">
+  <img src="astroclip-architecture.png" width="100%">
   <br />
   <span>
     Figure 1: Overview of the multimodal CLIP framework adapted for astronomy, incorporating three data modalities: photometric time-series, spectra, and metadata.
   </span>
 </p>
+To use AstroM³ for inference, install the AstroM3 library from our [GitHub repo](https://github.com/MeriDK/AstroM3).
 ```sh
 git clone https://github.com/MeriDK/AstroM3.git
 cd AstroM3
 uv pip install -r requirements.txt
 ```
+## A simple example to get started
 1. Data Loading & Preprocessing
 ```python
 from datasets import load_dataset
 from src.data import process_photometry
 # Load the test dataset
+test_dataset = load_dataset('AstroFOMO/AstroM3Processed', name='full_42', split='test')
 # Process photometry to have a fixed sequence length of 200 (center-cropped)
 test_dataset = test_dataset.map(process_photometry, batched=True, fn_kwargs={'seq_len': 200, 'how': 'center'})
 from src.model import AstroM3
 # Load the base AstroM3-CLIP model
+model = AstroM3.from_pretrained('AstroFOMO/AstroM3-CLIP')
 # Retrieve the first sample (batch size = 1)
 sample = test_dataset[0:1]
 from src.model import AstroM3, Informer, GalSpecNet, MetaModel
 # Photometry classification
+photo_model = Informer.from_pretrained('AstroFOMO/AstroM3-CLIP-photo')
 prediction = photo_model(photometry, photometry_mask).argmax(dim=1).item()
 print('Photometry Classification:', test_dataset.features['label'].int2str(prediction))
 # Spectra classification
+spectra_model = GalSpecNet.from_pretrained('AstroFOMO/AstroM3-CLIP-spectra')
 prediction = spectra_model(spectra).argmax(dim=1).item()
 print('Spectra Classification:', test_dataset.features['label'].int2str(prediction))
 # Metadata classification
+meta_model = MetaModel.from_pretrained('AstroFOMO/AstroM3-CLIP-meta')
 prediction = meta_model(metadata).argmax(dim=1).item()
 print('Metadata Classification:', test_dataset.features['label'].int2str(prediction))
 # Multimodal classification
+all_model = AstroM3.from_pretrained('AstroFOMO/AstroM3-CLIP-all')
 prediction = all_model(photometry, photometry_mask, spectra, metadata).argmax(dim=1).item()
 print('Multimodal Classification:', test_dataset.features['label'].int2str(prediction))
 ```
 | # Model | # Description |
 | :--- | :--- |
+| [AstroM3-CLIP](https://huggingface.co/AstroFOMO/AstroM3-CLIP) | The base model pre-trained using the trimodal CLIP approach. |
+| [AstroM3-CLIP-meta](https://huggingface.co/AstroFOMO/AstroM3-CLIP-meta) | Fine-tuned for metadata-only classification. |
+| [AstroM3-CLIP-spectra](https://huggingface.co/AstroFOMO/AstroM3-CLIP-spectra) | Fine-tuned for spectra-only classification. |
+| [AstroM3-CLIP-photo](https://huggingface.co/AstroFOMO/AstroM3-CLIP-photo) | Fine-tuned for photometry-only classification. |
+| [AstroM3-CLIP-all](https://huggingface.co/AstroFOMO/AstroM3-CLIP-all) | Fine-tuned for multimodal classification. |
 ## AstroM3-CLIP Variants
 These variants of the base AstroM3-CLIP model are trained using different random seeds (42, 0, 66, 12, 123);
 | # Model | # Description |
 | :--- | :--- |
+| [AstroM3-CLIP-42](https://huggingface.co/AstroFOMO/AstroM3-CLIP-42) | The base model pre-trained with random seed 42 (identical to AstroM3-CLIP). |
+| [AstroM3-CLIP-0](https://huggingface.co/AstroFOMO/AstroM3-CLIP-0) | AstroM3-CLIP pre-trained with random seed 0 (use dataset with seed 0). |
+| [AstroM3-CLIP-66](https://huggingface.co/AstroFOMO/AstroM3-CLIP-66) | AstroM3-CLIP pre-trained with random seed 66 (use dataset with seed 66). |
+| [AstroM3-CLIP-12](https://huggingface.co/AstroFOMO/AstroM3-CLIP-12) | AstroM3-CLIP pre-trained with random seed 12 (use dataset with seed 12). |
+| [AstroM3-CLIP-123](https://huggingface.co/AstroFOMO/AstroM3-CLIP-123) | AstroM3-CLIP pre-trained with random seed 123 (use dataset with seed 123). |
+## Using your own data
+Note that the data in the AstroM3Processed dataset is already pre-processed.
+So if you want to use the model, you must pre-process your data in the same way:
+1. **Spectra**: Each spectrum is interpolated to a fixed wavelength grid (3850–9000 Å), normalized using mean and MAD, and log-MAD is added as an auxiliary feature.
+2. **Photometry**: Light curves are deduplicated, sorted by time, normalized using mean and MAD, time-scaled to [0, 1], and augmented with auxiliary features like log-MAD and time span.
+3. **Metadata**: Scalar metadata is transformed via domain-specific functions (e.g., absolute magnitude, log, sin/cos), then normalized using dataset-level statistics.
+For a detailed description, read the [paper](https://arxiv.org/abs/2411.08842).
+To see exactly how we performed this preprocessing, reference [`preprocess.py`](https://huggingface.co/datasets/AstroFOMO/AstroM3Dataset/blob/main/preprocess.py) from the AstroM3Dataset repo.