In this work, we demonstrate a method to create a model from existing HTS data in PubChem, and predict new compounds likely to be active for additional screening.
PubChem bioassay 846 [1] screened for potential anticoagulant therapeutics by identifying inhibitors of factor XIa, which is involved in the blood coagulation mechanism. A classification model with 89% accuracy was created using a support vector machine (SVM) with the Signature molecular descriptor. Approximately 12 million compounds deposited in PubChem, but not present in the factor XIa assay, were virtually screened by the SVM. Based on metrics associated with SVM magnitudes and molecular descriptor overlap between candidate molecules with those from bioassay 846, we identified 296 compounds (from the 12 million not previously tested) as active. Docking studies using the crystal structure of factor XIa were performed on known actives and on these 296 predicted actives, with the predicted actives all showing binding energies consistent with the known actives.
A new feature of our approach (relative to current methods employed) [2] is the use of a wrapper method (in contrast to a filter method) that allows for recursive cluster inclusion in order to arrive at an improved SVM model. We compare these approaches in this work.
In conclusion, a primary HTS can identify new lead compounds, but generally has a very low success rate. The focused database of 296 predicted actives from this work could significantly add to existing knowledge, and increase the chance of finding active compounds by building on previous research. The data mining technique described here is not unique to factor XIa, and could be applied to any bioassay in the ever expanding PubChem database.
[1] Factor XIa 1536 pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=846
[2] Yousef, M., et. al., "Recursive cluster elimination (RCE) for classification and feature selection from gene expression data", BMC Bioinformatics, 2007, 8, 14.