571j Data Mining PubChem with Signature: Prediction of Biological Activity for Small Molecules

Donald P. Visco¹, Derick C. Weis¹, and Jean-Loup Faulon². (1) Chemical Engineering, Tennessee Technological University, Department of Chemical Engineering, P.O. Box 5013, Cookeville, TN 38505, (2) Computational Biology Department, Sandia National Laboratories, Albuquerque, NM 87185

High-throughput screening (HTS) is a technique to discover new lead compounds by physically screening a large library against a specified biological target. HTS was primarily available only to the pharmaceutical industry in the past. Because of the Molecular Libraries Initiative, part of the NIH Roadmap for Medical Research, HTS is now accessible to academic researchers where the data collected is deposited in a public database called PubChem. The results from more than 1,000 different HTS experiments are currently readily available in PubChem to download. Cheminformatic tools are crucial to effectively interpret and utilize this vast amount of data.

In this work, we demonstrate a method to create a model from existing HTS data in PubChem, and predict new compounds likely to be active for additional screening.

PubChem bioassay 846 [1] screened for potential anticoagulant therapeutics by identifying inhibitors of factor XIa, which is involved in the blood coagulation mechanism. A classification model with 89% accuracy was created using a support vector machine (SVM) with the Signature molecular descriptor. Approximately 12 million compounds deposited in PubChem, but not present in the factor XIa assay, were virtually screened by the SVM. Based on metrics associated with SVM magnitudes and molecular descriptor overlap between candidate molecules with those from bioassay 846, we identified 296 compounds (from the 12 million not previously tested) as active. Docking studies using the crystal structure of factor XIa were performed on known actives and on these 296 predicted actives, with the predicted actives all showing binding energies consistent with the known actives.

A new feature of our approach (relative to current methods employed) [2] is the use of a wrapper method (in contrast to a filter method) that allows for recursive cluster inclusion in order to arrive at an improved SVM model. We compare these approaches in this work.

In conclusion, a primary HTS can identify new lead compounds, but generally has a very low success rate. The focused database of 296 predicted actives from this work could significantly add to existing knowledge, and increase the chance of finding active compounds by building on previous research. The data mining technique described here is not unique to factor XIa, and could be applied to any bioassay in the ever expanding PubChem database.

[1] Factor XIa 1536 pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=846

[2] Yousef, M., et. al., "Recursive cluster elimination (RCE) for classification and feature selection from gene expression data", BMC Bioinformatics, 2007, 8, 14.