In this paper, we propose a BERT-based dual embedding model for the Chinese idiomprediction task, where given a context with a missing Chinese idiom and a set of candidate id-ioms, the model needs to find the correct idiom to fill in the blank. The meanings of these idioms are oftentimes not directly related to their component char-acters. A BERT-based Dual Embedding Model for Chinese Idiom PredictionĪ BERT-based Dual Embedding Model for Chinese Idiom PredictionĬhinese idioms are fixed phrases that have special meanings usually derived from an ancientstory.Our results show that the size and variability of this corpus opens up new avenues for research. This is orders of magnitude larger than previous speech corpora used for search and summarization.
We introduce a new corpus of 100,000 podcasts, and demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. When transcribed with Automatic Speech Recognition (ASR) they represent a noisy but fascinating collection of text which can be studied through the lens of NLP, IR, and linguistics. 100,000 Podcasts: A Large-Scale Spoken Document CorpusĪnn Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Jussi Karlgren, Ben Carterette and Rosie Jonesġ00,000 Podcasts: A Large-Scale Spoken Document CorpusĪs an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations.Our experimental results indicate that YodaLib outperforms a previous semi-automated approach proposed for this task, while also surpassing human annotators in both qualitative and quantitative analyses. We leverage these components to produce YodaLib, a fully-automated Mad Libs style humor generation framework, which selects and ranks appropriate candidate words and sentences in order to generate a coherent and funny story tailored to certain demographics. We build upon the BERT platform to predict location-biased word fillings in incomplete sentences, and we fine-tune BERT to classify location-specific humor in a sentence. We collect a dataset consisting of such stories, which are filled in and judged by carefully selected workers on Amazon Mechanical Turk. We propose an automatic humor generation framework for filling the blanks in Mad Libs® stories, while accounting for the demographic backgrounds of the desired audience. The subjective nature of humor makes computerized humor generation a challenging task. "Judge me by my size(noun), do you?" YodaLib: A Demographic-Aware Humor Generation FrameworkĪparna Garimella, Carmen Banea, Nabil Hossain and Rada Mihalcea