文摘
Semantic vector models generate high-dimensional vector representations of words from their occurrence statistics across large corpora of electronic text. In these models, an occurrence of a word or number is treated as a discrete event, including numerical measurements of continuous properties. Furthermore, the sequence in which words occur is often ignored. In earlier work we have developed approaches to address these limitations, using graded demarcator vectors to represent measured distances in high-dimensional space. This permits incorporation of continuous properties, such as the position of a character within a term or a year of birth, into semantic vector models. In this paper we extend this work by developing a novel representational approach for protein sequences, in which both the positions and the properties of the amino acid components of protein sequences are represented using graded vectors. Evaluation on a set of around 100,000 immunoglobulin receptor sequences derived from subjects recently infected with West Nile Virus (WNV) suggests that encoding positions and properties using graded vectors increases the similarity between immunoglobulin receptor sequences produced by cells from ancestral lines known to have developed in response to WNV, relative to those from other cell lines.