文摘
In spoken utterances, prosody is encoded in the form of pitch accent, intonation, and rhythm, and conveys linguistic and paralinguistic information such as emphasis, intent, attitude, and emotion of a speaker. Humans listening to speech with natural prosody are able to understand the content with low cognitive load and high accuracy. Automatic extraction of prosodic information is necessary for machines to process speech with human levels of proficiency. This dissertation focuses on two kinds of approaches to use prosody information, symbolic and direct modeling of prosody. We first investigate symbolic modeling of prosody---symbolic annotation of prosodic events, such as pitch accent and prosodic phrase boundary tones. We develop acoustic and lexical/syntatic prosodic models, and combine the two models to improve the performance of symbolic annotation of prosodic events. We adopt a semi-supervised approach to utilize unlabeled data for prosodic event annotation with a co-training algorithm. We propose a novel labeling and selection scheme for the co-training algorithm in order to address the compatible and uncorrelated assumptions that are often not true in real data. Furthermore, we utilize such symbolic modeling of prosody to help improve automatic speech recognition performance. Second, as a direct modeling approach, we present a novel technique to detect a users interest level in conversations using prosodic cues in combination with other sources of information. Since a listener provides feedback in dialog, we expect that the interest level is dependent on not only how the person says something represented by prosody information), but also what the person said represented by lexical information). We develop a decision-level combination system using these two information sources and demonstrate improved performance than relying on a single information source. We believe this dissertation will contribute to our understanding of prosody in spoken language, and advance the use of prosody in spoken language processing towards the goal of human-like processing of speech by machines.