文摘
The recent surge in microbial genomic sequencing, combined with the development of high-throughputliquid chromatography-mass-spectrometry-based (LC/LC-MS/MS) proteomics, has raised the questionof the extent to which genomic information of one strain or environmental sample can be used toprofile proteomes of related strains or samples. Even with decreasing sequencing costs, it remainsimpractical to obtain genomic sequence for every strain or sample analyzed. Here, we evaluate howshotgun proteomics is affected by amino acid divergence between the sample and the genomic databaseusing a probability-based model and a random mutation simulation model constrained by experimentaldata. To assess the effects of nonrandom distribution of mutations, we also evaluated identificationlevels using in silico peptide data from sequenced isolates with average amino acid identities (AAI)varying between 76 and 98%. We compared the predictions to experimental protein identification levelsfor a sample that was evaluated using a database that included genomic information for the dominantorganism and for a closely related variant (95% AAI). The range of models set the boundaries atwhich half of the proteins in a proteomic experiment can be identified to be 77-92% AAI betweenorthologs in the sample and database. Consistent with this prediction, experimental data indicatedloss of half the identifiable proteins at 90% AAI. Additional analysis indicated a 6.4% reduction of theinitial protein coverage per 1% amino acid divergence and total identification loss at 86% AAI.Consequently, shotgun proteomics is capable of cross-strain identifications but avoids most cross-species false positives.