New and updated fragment databases from ChEMBL and SureChEMBL are now available in Spark.
This new release of databases significantly expands the available
chemistry to Spark users by including the latest compounds from
scientific literature, as well as previously unseen compounds from
chemical patents. Combined with databases of fragments derived from screening compounds, and with custom databases which you can generate from your corporate collections using the Spark Database Generator, they provide an outstanding source of bioisosteres you can use to generate new ideas for your project.
More than 1.6 million fragments from ChEMBL
The updated Spark ‘ChEMBL’ databases include an expanded choice of more than 1.6 million fragments derived from release 30 of ChEMBL, a collection of around 2.1 million compounds reported in peer-reviewed scientific literature.
More than 14 million fragments derived from SureChEMBL
In addition, new ‘SureChEMBL’ fragment databases are available to
Spark users for the first time. These include more than 14 million
fragments derived from the SureChEMBL
collection of approximately 17 million compounds harvested from the
patent literature. This addition affords Spark users a significant
expansion on fragment databases derived from ChEMBL, providing a broader
range of accessible chemistries than ever before.
Figure 1. Spark Search window for an
R-group replacement experiment, showing default ChEMBL and SureChEMBL
fragment collections with a single attachment point.
Like ChEMBL, SureChEMBL compounds come with property information such
as synthetic tractability, toxicity, and metabolic reactivity.
Compounds in the original source collections were filtered on these
biochemical properties prior to fragmentation in Spark, removing
molecules containing potentially toxic or reactive groups. Compounds
were then cleaved by breaking the bonds connecting carbon atoms to
functional groups such as heteroatoms, carbonyls, thiocarbonyls and
rings, whilst preserving functional groups such as carboxylic acids,
nitro groups and rings. All fragments were subject to heavy atom count
and rotatable bond limits, which increases the probability that
fragments will form part of biochemically active small molecules.
Finally, fragments were sorted by frequency of occurrence in the source
dataset and grouped by commonality to form Spark databases. A key
assumption for Spark users is that regularly occurring fragments are
likely to form part of novel active biochemistry, therefore we recommend
that users try the more common libraries before moving on to less
Table 1 reports the total number of fragments in each Spark database,
with the frequency of occurrence in the source dataset. Database
frequencies were arbitrarily chosen to give manageable database file
Table 1. Fragment databases sorted by frequency.
|Spark category||Database||Total number of fragments (x1000)||Frequency in source dataset|
|ChEMBL||Common||223||Fragments which appear in more than 12 molecules|
|Rare||304||Fragments which appear in 4-12 molecules|
|Very Rare||390||Fragments which appear in 2-3 molecules|
|Extremely Rare*||783||Fragments which appear in 1 molecule|
|SureChEMBL||Very Common||509||Fragments which appear at least 45 molecules|
|Common||795||Fragments which appear in 14-44 molecules|
|Uncommon||554||Fragments which appear in 8-13 molecules|
|Rare*||957||Fragments which appear in 5-7|
|Very Rare*||757||Fragments which appear in 4 molecules|
|Extremely rare*||979||Fragments which appear in 3 molecules|
|Doubleton*||2,545||Fragments which appear in 2 molecules|
|Singleton*||4,794||Fragments which appear in 1 molecule|
The number of compounds per connection point count in each database is presented in Figure 2.
Figure 2. Count of fragments in the
recommended Spark Commercial, ChEMBL and SureChEMBL databases split by
the number of connection points on each fragment.
We recommend that users only install ChEMBL databases where the
fragment frequency is at least 2, and the three SureChEMBL ‘Common’
databases (Very Common, Common and Uncommon), as the size of the
SureChEMBL databases is very large. Furthermore, singleton and doubleton
databases for both ChEMBL and SureChEMBL may contain fragments derived
from erroneous structures in the source dataset.
Though ChEMBL and SureChEMBL fragments are from different sources
(scientific literature vs. patents), as expected, there is significant
overlap between some of the databases, as shown in Table 2. In
particular, the ChEMBL Common and SureChEMBL Very Common databases show
the highest overlap, which is to be expected because most bioactive
compounds contain basic subunits, e.g., phenol or pyridine rings.
Table 2. Overlap of the most common fragments in the ChEMBL and SureChEMBL databases.
|Very Common||Common||Un-common||Rare||Very Rare||Extremely Rare||Doubleton||Singleton||Unique|
The January update of the Spark reagent databases includes around 293,000 reagents derived from the eMolecules building blocks using an enhanced set of rules for
chemical transformation. These databases are updated monthly to give
you up-to-date availability information, to make it easy for you to
order the reagents you require to synthesize your favorite Spark
Update your Spark databases
The updated ChEMBL and new SureChEMBL databases, combined with the Spark Commercial fragment databases,
provide more than 19 million unique fragments to search, with more than
4.6 million fragments in the recommended databases alone. They
significantly expand the choice of fragments for your Spark experiments,
providing an even better source of novel ideals for your drug discovery
Spark users can contact our Support team to update their ChEMBL and SureChEMBL databases.
If you are a medicinal chemist or computational chemist and not currently using Spark, contact us to find out how it can help you generate innovative ideas, explore chemical space and escape IP and toxicity traps, or request an evaluation to try Spark on your project.