arxiv:2508.20609

Representative Random Sampling of Chemical Space

Published on Aug 28, 2025

Authors:

Abstract

Researchers developed an efficient method to generate unbiased random samples from chemical space representations and estimate molecule counts without enumerating individual molecules.

AI-generated summary

The overwhelming majority of molecules remains unexplored. This is mostly due to the sheer number of them, which prohibits any enumeration of chemical space, the set of all such molecules. In practice, only subsets of chemical space are considered, but those subsets exhibit substantial bias, prohibiting data-driven characterization of chemical space itself. In this work, we provide a method produce unbiased representative random samples of chemical space without enumeration of constituent molecules and to estimate the number of molecules in any custom chemical space. The approach is applicable to molecules which can be represented as graph and runs efficiently even for molecules of 30 atoms. We use it to estimate the representativeness of current databases with respect to their underlying chemical space and to establish a necessary criterion for a lower bound of database sizes to be representative of an underlying chemical space.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2508.20609

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.20609 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.20609 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.