文摘
In this work we introduce a postprocessing filter (PostDOCK) that distinguishes true bindingligand-protein complexes from docking artifacts (that are created by DOCK 4.0.1). PostDOCKis a pattern recognition system that relies on (1) a database of complexes, (2) biochemicaldescriptors of those complexes, and (3) machine learning tools. We use the protein databank(PDB) as the structural database of complexes and create diverse training and validation setsfrom it based on the "families of structurally similar proteins" (FSSP) hierarchy. For thebiochemical descriptors, we consider terms from the DOCK score, empirical scoring, and buriedsolvent accessible surface area. For the machine-learners, we use a random forest classifierand logistic regression. Our results were obtained on a test set of 44 structurally diverse proteintargets. Our highest performing descriptor combinations obtained ~19-fold enrichment (39 of44 binding complexes were correctly identified, while only allowing 2 of 44 decoy complexes),and our best overall accuracy was 92%.