Spark™ delivers a broader range of accessible chemistries than ever before with the addition of SureChEMBL fragment databases

New and updated fragment databases from ChEMBL and SureChEMBL are now available in Spark.
This new release of databases significantly expands the available
chemistry to Spark users by including the latest compounds from
scientific literature, as well as previously unseen compounds from
chemical patents. Combined with databases of fragments derived from screening compounds, and with custom databases which you can generate from your corporate collections using the Spark Database Generator, they provide an outstanding source of bioisosteres you can use to generate new ideas for your project.

More than 1.6 million fragments from ChEMBL

The updated Spark ‘ChEMBL’ databases include an expanded choice of more than 1.6 million fragments derived from release 30 of ChEMBL, a collection of around 2.1 million compounds reported in peer-reviewed scientific literature.

More than 14 million fragments derived from SureChEMBL

In addition, new ‘SureChEMBL’ fragment databases are available to
Spark users for the first time. These include more than 14 million
fragments derived from the SureChEMBL
collection of approximately 17 million compounds harvested from the
patent literature. This addition affords Spark users a significant
expansion on fragment databases derived from ChEMBL, providing a broader
range of accessible chemistries than ever before.

Figure 1. Spark Search window for an
R-group replacement experiment, showing default ChEMBL and SureChEMBL
fragment collections with a single attachment point.

Like ChEMBL, SureChEMBL compounds come with property information such
as synthetic tractability, toxicity, and metabolic reactivity.
Compounds in the original source collections were filtered on these
biochemical properties prior to fragmentation in Spark, removing
molecules containing potentially toxic or reactive groups. Compounds
were then cleaved by breaking the bonds connecting carbon atoms to
functional groups such as heteroatoms, carbonyls, thiocarbonyls and
rings, whilst preserving functional groups such as carboxylic acids,
nitro groups and rings. All fragments were subject to heavy atom count
and rotatable bond limits, which increases the probability that
fragments will form part of biochemically active small molecules.
Finally, fragments were sorted by frequency of occurrence in the source
dataset and grouped by commonality to form Spark databases. A key
assumption for Spark users is that regularly occurring fragments are
likely to form part of novel active biochemistry, therefore we recommend
that users try the more common libraries before moving on to less
common.

Table 1 reports the total number of fragments in each Spark database,
with the frequency of occurrence in the source dataset. Database
frequencies were arbitrarily chosen to give manageable database file
sizes.

Table 1. Fragment databases sorted by frequency.

Spark category	Database	Total number of fragments (x1000)	Frequency in source dataset
ChEMBL	Common	223	Fragments which appear in more than 12 molecules
	Rare	304	Fragments which appear in 4-12 molecules
	Very Rare	390	Fragments which appear in 2-3 molecules
	Extremely Rare*	783	Fragments which appear in 1 molecule
SureChEMBL	Very Common	509	Fragments which appear at least 45 molecules
	Common	795	Fragments which appear in 14-44 molecules
	Uncommon	554	Fragments which appear in 8-13 molecules
	Rare*	957	Fragments which appear in 5-7
	Very Rare*	757	Fragments which appear in 4 molecules
	Extremely rare*	979	Fragments which appear in 3 molecules
	Doubleton*	2,545	Fragments which appear in 2 molecules
	Singleton*	4,794	Fragments which appear in 1 molecule

*Contact us for further details.

The number of compounds per connection point count in each database is presented in Figure 2.

1600x900_Count of fragments in Spark ChEMBL-30 and SureChEMBL databases

Figure 2. Count of fragments in the
recommended Spark Commercial, ChEMBL and SureChEMBL databases split by
the number of connection points on each fragment.

We recommend that users only install ChEMBL databases where the
fragment frequency is at least 2, and the three SureChEMBL ‘Common’
databases (Very Common, Common and Uncommon), as the size of the
SureChEMBL databases is very large. Furthermore, singleton and doubleton
databases for both ChEMBL and SureChEMBL may contain fragments derived
from erroneous structures in the source dataset.

Though ChEMBL and SureChEMBL fragments are from different sources
(scientific literature vs. patents), as expected, there is significant
overlap between some of the databases, as shown in Table 2. In
particular, the ChEMBL Common and SureChEMBL Very Common databases show
the highest overlap, which is to be expected because most bioactive
compounds contain basic subunits, e.g., phenol or pyridine rings.

Table 2. Overlap of the most common fragments in the ChEMBL and SureChEMBL databases.

	SureChEMBL
		Very Common	Common	Un-common	Rare	Very Rare	Extremely Rare	Doubleton	Singleton	Unique
ChEMBL	Common	66%	14%	3%	3%	1%	1%	2%	2%	8%
	Rare	28%	21%	7%	7%	3%	3%	4%	5%	23%
	Very Rare	13%	17%	7%	8%	4%	5%	7%	6%	33%
	Extremely Rare	6%	10%	6%	7%	4%	5%	8%	13%	42%

Reagent databases

The January update of the Spark reagent databases includes around 293,000 reagents derived from the eMolecules building blocks using an enhanced set of rules for
chemical transformation. These databases are updated monthly to give
you up-to-date availability information, to make it easy for you to
order the reagents you require to synthesize your favorite Spark
results.

Update your Spark databases

The updated ChEMBL and new SureChEMBL databases, combined with the Spark Commercial fragment databases,
provide more than 19 million unique fragments to search, with more than
4.6 million fragments in the recommended databases alone. They
significantly expand the choice of fragments for your Spark experiments,
providing an even better source of novel ideals for your drug discovery
projects.

Spark users can contact our Support team to update their ChEMBL and SureChEMBL databases.

If you are a medicinal chemist or computational chemist and not currently using Spark, contact us to find out how it can help you generate innovative ideas, explore chemical space and escape IP and toxicity traps, or request an evaluation to try Spark on your project.

See all Member News