Untitled

[{'abstract': 'We release a Python module containing several tools to build analogical grids from words contained in a corpus. The module implements several previously presented algorithms. The tools are language-independent. This permits their use with any language and any writing system. We hope that the tools will ease research in morphology by allowing researchers to automatically obtain structured representations of the vocabulary contained in corpora or linguistic data. We also release analogical grids built on the vocabularies contained in 1,000 corresponding lines of the 11 different language versions of the Europarl corpus v.3. The grids were built on N-grams of different lengths, from words to 6-grams. We hope that the use of structured parallel data will foster research in comparative linguistics. Keywords: Analogy, morphology, analogical grids 1. Introduction Paradigm tables are known for their usefulness in learn- ing conjugation or declension when studying a language. Such paradigm tables are the result of grammatical tradi- tion or thorough linguistic formalization. They are com- monly found in dictionaries and constructed from lexemes and exponents, like the one shown in Figure 2 (left) which is taken from a French & English dictionary (Mansion, 1981, grey section, p. 1). Analogical grids are not paradigm tables, but they also give a compact view on the organization of a lexicon up to a certain extent (see below, Section 2.1.). Analogical grids are the result of an empirical procedure. They may be seen as a preliminary step towards the production of paradigm tables. Figure 2 (right) shows an example of an analogical grid in English. They can be used to study the productivity of a language (Singh and Ford, 2000; Neuvel and Fulop, 2002; Hathout, 2008). Fam and Lepage (2016) performed such an analysis across 12 languages using analogical grids built from the Bible corpus (Christodouloupoulos, 2015). As another example of use, Hathout (2009) showed how to produce the French word form rectification by analogy from the neighboring word forms fructifier, fructification, and rectifier in the same series (see Figure 1). fructifier family fructification series series rectifier family rectification fructifier : fructification :: rectifier : rectification Figure 1: Producing a new word form from neighboring word forms. Example taken from (Hathout, 2009) This paper introduces the release of a Python module which implements previously presented algorithms that automat- ically build analogical grids. We also release analogical clusters and analogical grids produced from a parallel cor- pus on 11 European languages using this Python module.', 'seq': 0.0}, {'Main Usages of the Tools Released': ['We release an implementation of previously presented algo- rithms to produce analogical grids as a Python 2 module 1 called Nlg. The various algorithms have been presented elsewhere (Lepage, 1998; Lepage, 2014; Fam and Lepage, 2016). One particular program called Words2Grids in this module simply takes a list of word forms as input and de- livers a list of analogical grids. Each word form in the list is converted into a feature vector before analogical grids are constructed from such feature vectors. The module also provides another program, Words2Vectors, to produce fea- ture vector representations either directly from word forms or from descriptions of word forms. The following sections introduce several ways to use the Python module.', 'The default usage of the module is to produce analogical grids from a list of word forms. An analogical grid is a matrix of words where four words from two rows and two columns are a proportional analogy. Formula (1) gives the definition of an analogical grid. P 1 1 : P 2 1 : : P m 1 P 1 2 : P 2 2 : : P m 2 . . . . . . . . . P 1 n : P 2 n : : P m n (i, k) {1, . . . , n} 2 , (j, l) {1, . . . , m} 2 , P j i : P l i :: P j k : P l k (1) Analogy is defined from feature vectors representing word forms, through equality of ratios. A ratio is the difference between two feature vectors plus the edit distance between 1 lepage-lab.ips.waseda.ac.jp/nlg-module 1060 Infinitive Preterit Past participle Present participle Regular walk walked walked walking verb smoke smoked smoked smoking Irregular write wrote written writing verb think thought thought thinking walk : walks : walking : walked show : shows : showing : showed open : opens : opening : study : : studying : play : : playing : played Figure 2: A paradigm table (left) taken from a French & English dictionary (Mansion, 1981) and an analogical grid (right) obtained by our tools (and reduced to a few rows for lack of space) the word forms. We refer the reader to (Fam and Lepage, 2016) for exact definitions. In this setting, a word form is represented as a vector of features which are simply the number of occurrences for all the characters in the alphabet. For instance, in lowercase English, the dimension of the vector is 26 (from a to z) as illustrated in Formula (2). Here, the notation |A| c stands for the number of occurrences of character c in string A. A = |A| a |A| b . . . |A| z walking = 1 0 . . . 0 (2) The right part of Figure 2 shows few lines of an analogical grid that has been obtained on a list of English word forms with the program Words2Grids. 2.2. From Morphological Features to Paradigm tables The previous use of the released tools automatically con- verts word forms into specific feature vectors. In oppo- sition to that, it is possible for the user to produce real paradigm tables from feature vectors standing for actual morphological features, like lemma, part-of-speech, case, tense. Such feature vectors can be built, for instance, from the Unimorph Project (Kirov et al., 2016) data which have been built from parsing Wiktionary data into a language- independent feature schema (Sylak-Glassman et al., 2015b; Sylak-Glassman et al., 2015a). Formula (3) illustrates the representation of the word form walking: its lemma is to walk and it has the verb (VB), present (PRST), participle (PTCP) tags as morphological features. For the purpose of inner processing, the labels are converted into Boolean val- ues. A = lemma = to walk(A) is V B(A) is N N (A) is P RST (A) . . . is P T CP (A) walking = 1 1 0 1 . . . 1 (3) Figure 3 shows an example of a paradigm table built from a list of word forms described by morphological features. For some lemmas, some cells may be empty. These lemmas are not necessarily defective; this simply means that the forms did not appear in the input data. It should be stressed that the names of the morphological features are not shown in the paradigm tables output by our programs. initiate : initiated : initiates undercry : : undercries mummify : mummified : tole : tollen : Figure 3: A paradigm table built from a list of word forms annotated with morphological features in English 2.3. From User-Defined Features to New Types of Grids It is possible to directly use user-defined features, i.e., richer vector representations of word forms as input to the released programs. As an example, the feature vector shown in Formula (4) concatenates the two types of vectors presented in the previous sections. A = lemma = to walk(A) |A| a . . . |A| z is V B(A) is N N (A) is P RST (A) . . . is P T CP (A) walking = 1 1 . . . 0 1 0 0 . . . 1 (4) From such feature vectors, the programs output regular paradigm tables (see Figure 4 as an example). We hope that researchers will freely define their own types of feature vector representations to build analogical grids that corre- spond to their own needs. initiate : initiated : initiates stage : staged : elucidate : elucidated : assume : assumed : Figure 4: A regular paradigm table built from combining two previous feature vectors 2.4. An Example of How to Use the Tools In this section, we show an example of how to use the tools to produce analogical grids from a text. The following il- lustration is taken from (Fam and Lepage, 2017). Consider that we have a text as shown in the top of Figure 6. It is a forged example in Indonesian, a language known for its richness of derivational morphology. Using a tokenizer, we can get a list of tokens and then obtain a set of words 1061 RAW output: anto.grid.txt minum : meminum : diminum : minuman :: makan : memakan : dimakan : makanan :: main : None : None : mainan :: beli : None : dibeli : None Pretty print output: anto.prettygrid.txt # Grid no.: 1 - Attributes(length=4, width=4, size=16, filled=12, saturation=0.75) minum : meminum : diminum : minuman makan : memakan : dimakan : makanan main : : : mainan beli : : dibeli : Figure 5: Analogical grids in Indonesian extracted from a set of word in the bottom of Figure 6. It is shown as RAW output (top) and printed using pretty print option (bottom). Anto memakan nasi dan meminum air. Nasi itu dibeli di pasar. Di pasar, Anto melihat mainan. Anto senang main bola. Setelah main, Anto suka minum es dan makan cilok. Makanan dan minuman itu juga dia beli di pasar. Es dan cilok memang enak dimakan dan diminum selesai olahraga. air anto beli bola cilok dan di dia dibeli dimakan diminum enak es itu juga main mainan makan makanan melihat memakan memang meminum minum minuman nasi olahraga pasar selesai senang setelah suka Figure 6: A text in Indonesian (top) and the list of words extracted from it (bottom). Words appearing in Figure 5 are boldfaced. (types) from it as shown in the bottom part of Figure 6. Let us say that we prepare a file named anto.words.txt which contains all of the types, one type per line. We can then extract the analogical grids by running the following com- mand. 𝑝𝑦𝑡ℎ𝑜𝑛𝑊𝑜𝑟𝑑𝑠2𝐺𝑟𝑖𝑑𝑠.𝑝𝑦<𝑎𝑛𝑡𝑜.𝑤𝑜𝑟𝑑𝑠.𝑡𝑥𝑡>𝑎𝑛𝑡𝑜.𝑔𝑟𝑖𝑑.𝑡𝑥𝑡𝑇ℎ𝑒𝑜𝑢𝑡𝑝𝑢𝑡𝑜𝑓𝑡ℎ𝑒𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝑖𝑠𝑝𝑟𝑖𝑛𝑡𝑒𝑑𝑖𝑛𝑡𝑜𝑡ℎ𝑒𝑓𝑖𝑙𝑒𝑛𝑎𝑚𝑒𝑑𝑎𝑛𝑡𝑜.𝑔𝑟𝑖𝑑.𝑡𝑥𝑡.𝑇ℎ𝑒𝑐𝑜𝑛𝑡𝑒𝑛𝑡𝑜𝑓𝑡ℎ𝑒𝑓𝑖𝑙𝑒𝑖𝑠𝑠ℎ𝑜𝑤𝑛𝑎𝑡𝑡ℎ𝑒𝑡𝑜𝑝𝑜𝑓𝐹𝑖𝑔𝑢𝑟𝑒5.𝐹𝑜𝑟𝑡ℎ𝑒𝑠𝑎𝑘𝑒𝑜𝑓𝑑𝑒𝑣𝑒𝑙𝑜𝑝𝑚𝑒𝑛𝑡𝑜𝑓𝑡ℎ𝑒𝑚𝑜𝑑𝑢𝑙𝑒(𝑖𝑛𝑝𝑢𝑡−𝑜𝑢𝑡𝑝𝑢𝑡𝑏𝑒𝑡𝑤𝑒𝑒𝑛𝑚𝑜𝑑𝑢𝑙𝑒𝑠),𝑤𝑒𝑐ℎ𝑜𝑜𝑠𝑒𝑡𝑜𝑒𝑛𝑐𝑎𝑝𝑠𝑢𝑙𝑎𝑡𝑒𝑡ℎ𝑒𝑎𝑛𝑎𝑙𝑜𝑔𝑖𝑐𝑎𝑙𝑔𝑟𝑖𝑑𝑠𝑖𝑛𝑠𝑢𝑐ℎ𝑑𝑎𝑡𝑎𝑓𝑜𝑟𝑚𝑎𝑡𝑑𝑒𝑓𝑖𝑛𝑒𝑑𝑖𝑛𝑆𝑒𝑐−𝑡𝑖𝑜𝑛3.1.𝐶𝑎𝑢𝑡𝑖𝑜𝑛:𝑁𝑜𝑛𝑒𝑠𝑡𝑎𝑛𝑑𝑠𝑓𝑜𝑟𝑎𝑛𝑒𝑚𝑝𝑡𝑦𝑐𝑒𝑙𝑙.2.4.1.𝑉𝑖𝑠𝑢𝑎𝑙𝑖𝑧𝑖𝑛𝑔𝑡ℎ𝑒𝐴𝑛𝑎𝑙𝑜𝑔𝑖𝑐𝑎𝑙𝐺𝑟𝑖𝑑𝑠𝑇𝑜ℎ𝑎𝑣𝑒𝑎𝑏𝑒𝑡𝑡𝑒𝑟𝑣𝑖𝑒𝑤𝑜𝑛𝑡ℎ𝑒𝑎𝑛𝑎𝑙𝑜𝑔𝑖𝑐𝑎𝑙𝑔𝑟𝑖𝑑𝑠𝑝𝑟𝑜𝑑𝑢𝑐𝑒𝑑𝑏𝑦𝑡ℎ𝑒𝑡𝑜𝑜𝑙𝑠,𝑤𝑒𝑝𝑟𝑜𝑣𝑖𝑑𝑒𝑎𝑛𝑜𝑝𝑡𝑖𝑜𝑛𝑐𝑎𝑙𝑙𝑒𝑑𝑝𝑟𝑒𝑡𝑡𝑦𝑝𝑟𝑖𝑛𝑡.𝑅𝑢𝑛−𝑛𝑖𝑛𝑔𝑡ℎ𝑒𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔𝑐𝑜𝑚𝑚𝑎𝑛𝑑𝑤𝑖𝑙𝑙𝑝𝑟𝑖𝑛𝑡𝑡ℎ𝑒𝑎𝑛𝑎𝑙𝑜𝑔𝑖𝑐𝑎𝑙𝑔𝑟𝑖𝑑𝑠𝑖𝑛𝑎𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑒𝑓𝑖𝑙𝑒𝑛𝑎𝑚𝑒𝑑𝑎𝑛𝑡𝑜.𝑝𝑟𝑒𝑡𝑡𝑦𝑔𝑟𝑖𝑑.𝑡𝑥𝑡𝑤𝑖𝑡ℎ𝑎𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡𝑓𝑜𝑟𝑚𝑎𝑡𝑡ℎ𝑎𝑛𝑡ℎ𝑒𝑜𝑛𝑒𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑑𝑖𝑛𝑆𝑒𝑐𝑡𝑖𝑜𝑛3.1.
python Words2Grids.py < anto.words.id.txt --pretty-print anto.prettygrid.txt The attributes of the grids are printed on top of each grid. length: number of rows width: number of columns size: size of the analogical grid filled: number of non-empty cells saturation: saturation of the analogical grid The bottom of Figure 5 shows the analogical grid ex- tracted from the set of words in Figure 6 printed using the pretty print option. It has a size of 16 cells with 4 rows and 4 columns. 12 cells out of 16 cells are filled so that the saturation of the grid is 0.75. Further details about the attributes of analogical grids, like size and saturation, will be described in Section 3.3. and Section 3.4. 2.4.2. Extracting Analogical Grids Around Particular Word Forms We also offer a function to focus the study on one partic- ular word form: it is possible to deliver only those ana- logical grids which contain the particular word form under scrutiny. The Words2Grids program runs in fact in two steps. It first extracts analogical clusters from a list of word forms and then builds analogical grids from the extracted clus- ters. First, we extract all analogical clusters using the Words2Clusters program with a specific option called fo- cus. This option will only extract analogical clusters which contain a particular word given as parameter of the op- tion. We then use the Clusters2Grids program to produce the analogical grids from the previously extracted clusters. 1062 However, all of these procedures are performed seamlessly to the users by the Words2Grids program. For example, to produce all analogical grids which contain the word walk- ing, we can run the following command. $ python Words2Grids.py < anto.words.id.txt --focus walking By building analogical grids from those clusters, the user obtains all the analogical grids which contain that word form. It is then possible to characterize the actual produc- tivity of a particular word form by inspecting the size of the analogical grids that contain it. For further study, as pro- posed in (Hathout, 2009), one can then retrieve all the word forms that have the same relationship with the word form under scrutiny. These words are basically the neighboring word forms inside the analogical grids. 3. Languages and Data in the Released Resource By using the procedure described in Section 2.1., we produced analogical grids from all the words contained in 1,000 corresponding lines of the Europarl corpus v3 (Koehn, 2007) in all of the 11 European languages. The motivation for using a multilingual parallel corpus is to base on the analogical grids produced to perform comparative studies across these languages. We produced the analogi- cal grids for different sizes of N-grams, from unigrams to six-grams. Language # tokens (N ) # types (V ) Avg length of types da 27,034 5,304 9.064.22 de 27,042 5,753 9.694.19 el 28,559 6,397 16.456.21 en 28,594 4,305 7.352.77 es 29,974 5,300 8.182.91 fi 20,604 7,473 10.604.22 fr 31,257 5,184 8.253.01 it 28,269 5,425 8.082.84 nl 28,933 5,028 8.903.99 pt 29,342 5,472 8.323.04 sv 25,681 5,452 9.034.11 Table 1: Statistics on the first 1,000 lines of Europarl corpus Table 1 shows the statistics of the input data. French has the largest number of tokens with more than thirty thousand tokens. Although Finnish has the smallest number of to- kens with around twenty thousand tokens, it has the largest number of types followed by Greek. It is due to the char- acteristic of Finnish being an agglutinative language. This is also reflected by the average length of types. Finnish has the second longest average length (around 11 characters per type in average) of types after Greek (around 17 characters per type in average), contrary to the order of largest number of types. The other languages tend to have around twenty eight thousand tokens represented by around five thousand types with average length of 9 characters.'], 'seq': 2.0, 'all_headings': ['Main Usages of the Tools Released', 'From Word Forms to Analogical Grids']}, {'Conclusion': ['We released a Python module for the production of analog- ical grids from word forms contained in a corpus. Several additional functions are implemented for the sake of lan- guage productivity analysis and for the use of richer fea- tures than just character counts. In addition, we released a complete data set which contains analogical clusters and analogical grids built on 1,000 cor- responding lines in 11 European languages extracted from the Europarl corpus v.3. We hope that such module and data will be used by re- searchers in comparative linguistic studies, in re-inflection tasks or other tasks of Natural Language Processing. We hope that the tools provided in the module will be used for the study of other languages than those of the resource re- leased.'], 'all_headings': ['Conclusion'], 'label': ['CONC'], 'seq': 4.0}, {'all_headings': ['Acknowledgements'], 'label': ['CONC'], 'Acknowledgements': ['This work was supported by a JSPS Grant, Number 15K00317 (Kakenhi C), entitled Language productivity: efficient extraction of productive analogical clusters and their evaluation using statistical machine translation. 3 In (Chan, 2008, p. 79), saturation is the maximal proportion of word forms attested for any one lemma of a given paradigm. Here we use the term for each entire table. 6. Bibliographical References Chan, E. (2008). Structures and distributions in morphol- ogy learning. Ph.D. thesis, University of Pennsylvania. Fam, R. and Lepage, Y. (2016). Morphological pre- dictability of unseen words using computational analogy. In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-CA-16), pages 5160, Atlanta, Geor- gia. Fam, R. and Lepage, Y. (2017). A study of the saturation of analogical grids agnostically extracted from texts. In Proceedings of the Computational Analogy Workshop at the 25th International Conference on Case-Based Rea- soning (ICCBR-CA-17), pages 1120, Trondheim, Nor- way. Hathout, N. (2008). Acquisition of the morphological structure of the lexicon based on lexical similarity and formal analogy. In Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Lan- guage Processing, pages 18, Manchester, UK, August. Coling 2008 Organizing Committee. Hathout, N. (2009). Acquisition of morphological families and derivational series from a machine readable dictio- nary. CoRR, abs/0905.1609. Lepage, Y. (1998). Solving analogies on words: an algo- rithm. In Proceedings of the 17th international confer- ence on Computational linguistics (COLING 1998), vol- ume 1, pages 728734. Association for Computational Linguistics. Lepage, Y. (2014). Analogies between binary images: Ap- plication to Chinese characters. In Henri Prade et al., editors, Computational Approaches to Analogical Rea- soning: Current Trends, pages 2557. Springer, Berlin, Heidelberg. Mansion, J. E., (1981). Harraps New Shorter French and English Dictionary. George G. Harrap Co. Ltd, London, Paris, Stuttgart. Neuvel, S. and Fulop, S. A. (2002). Unsupervised learn- ing of morphology without morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phono- logical Learning, pages 3140. Association for Compu- tational Linguistics, July. Singh, R. and Ford, A. (2000). In praise of Sakatayana: some remarks on whole word morphology. In Rajendra Singh, editor, The Yearbook of South Asian Languages and Linguistics-200. Sage, Thousand Oaks. Sylak-Glassman, J., Kirov, C., Post, M., Que, R., and Yarowsky, D., (2015a). A Universal Feature Schema for Rich Morphological Annotation and Fine-Grained Cross-Lingual Part-of-Speech Tagging, pages 7293. Springer International Publishing, Cham. Sylak-Glassman, J., Kirov, C., Yarowsky, D., and Que, R. (2015b). A language-independent feature schema for in- flectional morphology. In Proceedings of the 53rd An- nual Meeting of the Association for Computational Lin- guistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 674680, Beijing, China, July. Association for Computational Linguistics. 1065 7. Language Resource References Christodouloupoulos, Christos. (2015). A massively paral- lel corpus: the Bible in 100 languages. Distributed via github 4 , bible-corpus v1.8. Christo Kirov and John Sylak-Glassman and Roger Que and David Yarowsky. (2016). UniMorph Data. Uni- Morph Project, distributed via web 5 . Koehn, Philipp. (2007). European Parliament Proceedings Parallel Corpus. Distributed via web 6 , Europarl v3. 4 github.com/christos-c/bible-corpus 5 unimorph.org 6 statmt.org/europarl/ 1066'], 'seq': 5.0}],
[{'abstract': 'Word embeddings learned on unlabeled data are a popular tool in semantics, but may not capture the desired semantics. We propose a new learning objective that in- corporates both a neural language model objective (Mikolov et al., 2013) and prior knowledge from semantic resources to learn improved lexical semantic embed- dings. We demonstrate that our embed- dings improve over those learned solely on raw text in three settings: language mod- eling, measuring semantic similarity, and predicting human judgements.', 'seq': 0.0}, {'all_headings': ['Introduction'], 'label': ['INT'], 'seq': 1.0, 'Introduction': ['Word embeddings are popular representations for syntax (Turian et al., 2010; Collobert and We- ston, 2008; Mnih and Hinton, 2007), semantics (Huang et al., 2012; Socher et al., 2013), morphol- ogy (Luong et al., 2013) and other areas. A long line of embeddings work, such as LSA and ran- domized embeddings (Ravichandran et al., 2005; Van Durme and Lall, 2010), has recently turned to neural language models (Bengio et al., 2006; Collobert and Weston, 2008; Turian et al., 2010). Unsupervised learning can take advantage of large corpora, which can produce impressive results. However, the main drawback of unsupervised learning is that the learned embeddings may not be suited for the task of interest. Consider se- mantic embeddings, which may capture a notion of semantics that improves one semantic task but harms another. Controlling this behavior is chal- lenging with an unsupervised objective. However, rich prior knowledge exists for many tasks, and there are numerous such semantic resources. We propose a new training objective for learn- ing word embeddings that incorporates prior This work was done while the author was visiting JHU. knowledge. Our model builds on word2vec (Mikolov et al., 2013), a neural network based language model that learns word embeddings by maximizing the probability of raw text. We extend the objective to include prior knowledge about synonyms from semantic resources; we consider both the Paraphrase Database (Ganitkevitch et al., 2013) and WordNet (Fellbaum, 1999), which an- notate semantic relatedness between words. The latter was also used in (Bordes et al., 2012) for training a network for predicting synset relation. The combined objective maximizes both the prob- ability of the raw corpus and encourages embed- dings to capture semantic relations from the re- sources. We demonstrate improvements in our embeddings on three tasks: language modeling, measuring word similarity, and predicting human judgements on word pairs.']}, {'all_headings': ['Learning Embeddings', 'Word2vec', 'Relation Constrained Model', 'Joint Model', 'Parameter Estimation'], 'label': ['MET'], 'Learning Embeddings': ['We present a general model for learning word em- beddings that incorporates prior knowledge avail- able for a domain. While in this work we con- sider semantics, our model could incorporate prior knowledge from many types of resources. We be- gin by reviewing the word2vec objective and then present augmentations of the objective for prior knowledge, including different training strategies.', 'Word2vec (Mikolov et al., 2013) is an algorithm for learning embeddings using a neural language model. Embeddings are represented by a set of latent (hidden) variables, and each word is rep- resented by a specific instantiation of these vari- ables. Training learns these representations for each word w t (the tth word in a corpus of size T ) so as to maximize the log likelihood of each token given its context: words within a window sized c: max 1 T T t=1 log p w t |w t+c tc , (1) 545 where w t+c tc is the set of words in the window of size c centered at w t (w t excluded). Word2vec offers two choices for modeling of Eq. (1): a skip-gram model and a continuous bag- of-words model (cbow). The latter worked better in our experiments so we focus on it in our presen- tation. cbow defines p(w t |w t+c tc ) as: exp e wt cjc,j =0 e w t+j w exp e w cjc,j =0 e w t+j , (2) where e w and e w represent the input and output embeddings respectively, i.e., the assignments to the latent variables for word w. While some learn a single representation for each word (e w e w ), our results improved when we used a separate em- bedding for input and output in cbow.', 'Suppose we have a resource that indicates rela- tions between words. In the case of semantics, we could have a resource that encodes semantic similarity between words. Based on this resource, we learn embeddings that predict one word from another related word. We define R as a set of rela- tions between two words w and w . R can contain typed relations (e.g., w is related to w through a specific type of semantic relation), and rela- tions can have associated scores indicating their strength. We assume a single relation type of uni- form strength, though it is straightforward to in- clude additional characteristics into the objective. Define R w to be the subset of relations in R which involve word w. Our objective maximizes the (log) probability of all relations by summing over all words N in the vocabulary: 1 N N i=1 wRw i log p (w|w i ) , (3) p(w|w i ) = exp e w T e w i / w exp e w T e w i takes a form similar to Eq. (2) but without the context: e and e are again the input and output embeddings. For our semantic relations e w and e w are symmetrical, so we use a single embedding. Embeddings are learned such that they are predic- tive of related words in the resource. We call this the Relation Constrained Model (RCM).', 'The cbow and RCM objectives use separate data for learning. While RCM learns embeddings suited to specific tasks based on knowledge re- sources, cbow learns embeddings for words not in- cluded in the resource but appear in a corpus. We form a joint model through a linear combination of the two (weighted by C): 1 T T t=1 log p w t |w t+c tc + C N N i=1 wRw i log p (w|w i ) Based on our initial experiments, RCM uses the output embeddings of cbow. We learn embeddings using stochastic gradient ascent. Updates for the first term for e and e are: e w cbow (f (w)) I [w=wt] t+c j=tc e w j e w j cbow w (f (w)) I [w=wt] e w , where (x) = exp{x}/(1 + exp{x}), I [x] is 1 when x is true, f (w) = e w t+c j=tc e w j . Second term updates are: e w RCM (f (w)) I [wRw i ] e w i e w i RCM w (f (w)) I [wRw i ] e w , where f (w) = e w e w i . We use two learning rates: cbow and RCM .', 'All three models (cbow, RCM and joint) use the same training scheme based on Mikolov et al. (2013). There are several choices to make in pa- rameter estimation; we present the best perform- ing choices used in our results. We use noise contrastive estimation (NCE) (Mnih and Teh, 2012), which approximately max- imizes the log probability of the softmax objec- tive (Eq. 2). For each objective (cbow or RCM), we sample 15 words as negative samples for each training instance according to their frequencies in raw texts (i.e. training data of cbow). Suppose w has frequency u(w), then the probability of sam- pling w is p(w) u(w) 3/4 . We use distributed training, where shared em- beddings are updated by each thread based on training data within the thread, i.e., asynchronous stochastic gradient ascent. For the joint model, we assign threads to the cbow or RCM objective with a balance of 12:1(i.e. C is approximately 1 12 ). We allow the cbow threads to control convergence; training stops when these threads finish process- ing the data. We found this an effective method 546 for balancing the two objectives. We trained each cbow objective using a single pass over the data set (except for those in Section 4.1), which we empir- ically verified was sufficient to ensure stable per- formances on semantic tasks. Model pre-training is critical in deep learning (Bengio et al., 2007; Erhan et al., 2010). We eval- uate two strategies: random initialization, and pre- training the embeddings. For pre-training, we first learn using cbow with a random initialization. The resulting trained model is then used to initialize the RCM model. This enables the RCM model to benefit from the unlabeled data, but refine the em- beddings constrained by the given relations. Finally, we consider a final model for training embeddings that uses a specific training regime. While the joint model balances between fitting the text and learning relations, modeling the text at the expense of the relations may negatively impact the final embeddings for tasks that use the embed- dings outside of the context of word2vec. There- fore, we use the embeddings from a trained joint model to pre-train an RCM model. We call this setting JointRCM.'], 'seq': 2.0}, {'all_headings': ['Evaluation'], 'label': ['RES'], 'Evaluation': ['For training cbow we use the New York Times (NYT) 1994-97 subset from Gigaword v5.0 (Parker et al., 2011). We select 1,000 paragraphs each for dev and test data from the December 2010 portion of the NYT. Sentences are tokenized using OpenNLP 1 , yielding 518,103,942 tokens for train- ing, 42,953 tokens for dev and 41,344 for test. We consider two resources for training the RCM term: the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013) and WordNet (Fell- baum, 1999). For each semantic pair extracted from these resources, we add a relation to the RCM objective. Since we use both resources for evaluation, we divide each into train, dev and test. PPDB is an automatically extracted dataset con- taining tens of millions of paraphrase pairs, in- cluding words and phrases. We used the lexi- cal version of PPDB (no phrases) and filtered to include pairs that contained words found in the 200,000 most frequent words in the NYT corpus, which ensures each word in the relations had sup- port in the text corpus. Next, we removed dupli- cate pairs: if <A,B> occurred in PPDB, we re- moved relations of <B,A>. PPDB is organized 1 https://opennlp.apache.org/ PPDB Relations WordNet Relations Train XL 115,041 Train 68,372 XXL 587,439 (not used in XXXL 2,647,105 this work) Dev 1,582 Dev 1,500 Test 1,583 Test 1,500 Table 1: Sizes of semantic resources datasets. into 6 parts, ranging from S (small) to XXXL. Division into these sets is based on an automat- ically derived accuracy metric. Since S contains the most accurate paraphrases, we used these for evaluation. We divided S into a dev set (1582 pairs) and test set (1583 pairs). Training was based on one of the other sets minus relations from S. We created similar splits using WordNet, ex- tracting synonyms using the 100,000 most fre- quent NYT words. We divide the vocabulary into three sets: the most frequent 10,000 words, words with ranks between 10,001-30,000 and 30,001- 100,000. We sample 500 words from each set to construct a dev and test set. For each word we sample one synonym to form a pair. The remain- ing words and their synonyms are used for train- ing. However we did not use the training data be- cause it is too small to affect the results. Table 1 summarizes the datasets.'], 'seq': 3.0}, {'Experiments': ['The goal of our experiments is to demonstrate the value of learning semantic embeddings with infor- mation from semantic resources. In each setting, we will compare the word2vec baseline embed- ding trained with cbow against RCM alone, the joint model and JointRCM. We consider three evaluation tasks: language modeling, measuring semantic similarity, and predicting human judge- ments on semantic relatedness. In all of our ex- periments, we conducted model development and tuned model parameters (C, cbow , RCM , PPDB dataset, etc.) on development data, and evaluate the best performing model on test data. The mod- els are notated as follows: word2vec for the base- line objective (cbow or skip-gram), RCM-r/p and Joint-r/p for random and pre-trained initializations of the RCM and Joint objectives, and JointRCM for pre-training RCM with Joint embeddings. Un- less otherwise notes, we train using PPDB XXL. We initially created WordNet training data, but found it too small to affect results. Therefore, we include only RCM results trained on PPDB, but show evaluations on both PPDB and WordNet. 547 Model NCE HS word2vec (cbow) 8.75 6.90 RCM-p 8.55 7.07 Joint-r (RCM = 1 10 2 ) 8.33 6.87 Joint-r (RCM = 1 10 3 ) 8.20 6.75 JointRCM 8.40 6.92 Table 2: LM evaluation on held out NYT data. We trained 200-dimensional embeddings and used output embeddings for measuring similarity. Dur- ing the training of cbow objectives we remove all words with frequencies less than 5, which is the default setting of word2vec.', 'Word2vec is fundamentally a language model, which allows us to compute standard evaluation metrics on a held out dataset. After obtaining trained embeddings from any of our objectives, we use the embeddings in the word2vec model to measure perplexity of the test set. Measuring perplexity means computing the exact probability of each word, which requires summation over all words in the vocabulary in the denominator of the softmax. Therefore, we also trained the language models with hierarchical classification (Mikolov et al., 2013) strategy (HS). The averaged perplexi- ties are reported on the NYT test set. While word2vec and joint are trained as lan- guage models, RCM is not. In fact, RCM does not even observe all the words that appear in the train- ing set, so it makes little sense to use the RCM em- beddings directly for language modeling. There- fore, in order to make fair comparison, for every set of trained embeddings, we fix them as input embedding for word2vec, then learn the remain- ing input embeddings (words not in the relations) and all the output embeddings using cbow. Since this involves running cbow on NYT data for 2 it- erations (one iteration for word2vec-training/pre- training/joint-modeling and the other for tuning the language model), we use Joint-r (random ini- tialization) for a fair comparison. Table 2 shows the results for language mod- eling on test data. All of our proposed models improve over the baseline in terms of perplexity when NCE is used for training LMs. When HS is used, the perplexities are greatly improved. How- ever in this situation only the joint models improve the results; and JointRCM performs similar to the baseline, although it is not designed for lan- guage modeling. We include the optimal RCM in the table; while set cbow = 0.025 (the default setting of word2vec). Even when our goal is to strictly model the raw text corpus, we obtain im- provements by injecting semantic information into the objective. RCM can effectively shift learning to obtain more informative embeddings.', 'Our next task is to find semantically related words using the embeddings, evaluating on relations from PPDB and WordNet. For each of the word pairs in the evaluation set <A,B>, we use the co- sine distance between the embeddings to score A with a candidate word B . We use a large sample of candidate words (10k, 30k or 100k) and rank all candidate words for pairs where B appears in the candidates. We then measure the rank of the cor- rect B to compute mean reciprocal rank (MRR). Our goal is to use word A to select word B as the closest matching word from the large set of candidates. Using this strategy, we evaluate the embeddings from all of our objectives and mea- sure which embedding most accurately selected the true correct word. Table 3 shows MRR results for both PPDB and WordNet dev and test datasets for all models. All of our methods improve over the baselines in nearly every test set result. In nearly every case, JointRCM obtained the largest improvements. Clearly, our embeddings are much more effective at capturing semantic similarity.', 'Our final evaluation is to predict human judge- ments of semantic relatedness. We have pairs of words from PPDB scored by annotators on a scale of 1 to 5 for quality of similarity. Our data are the judgements used by Ganitkevitch et al. (2013), which we filtered to include only those pairs for which we learned embeddings, yielding 868 pairs. We assign a score using the dot product between the output embeddings of each word in the pair, then order all 868 pairs according to this score. Using the human judgements, we compute the swapped pairs rate: the ratio between the number of swapped pairs and the number of all pairs. For pair p scored y p by the embeddings and judged y p by an annotator, the swapped pair rate is: p 1 ,p 2 D I[(y p 1 y p 2 ) ( y p 2 y p 1 ) < 0] p 1 ,p 2 D I[y p 1 = y p 2 ] (4) where I[x] is 1 when x is true. 548 PPDB WordNet Model Dev Test Dev Test 10k 30k 100k 10k 30k 100k 10k 30k 100k 10k 30k 100k word2vec (cbow) 49.68 39.26 29.15 49.31 42.53 30.28 10.24 8.64 5.14 10.04 7.90 4.97 word2vec (skip-gram) 48.70 37.14 26.20 - - - 8.61 8.10 4.62 - - - RCM-r 55.03 42.52 26.05 - - - 13.33 9.05 5.29 - - - RCM-p 61.79 53.83 40.95 65.42 55.82 41.20 15.25 12.13 7.46 14.13 11.23 7.39 Joint-r 59.91 50.87 36.81 - - - 15.73 11.36 7.14 13.97 10.51 7.44 Joint-p 59.75 50.93 37.73 64.30 53.27 38.97 15.61 11.20 6.96 - - - JointRCM 64.22 54.99 41.34 68.20 57.87 42.64 16.81 11.67 7.55 16.16 11.21 7.56 Table 3: MRR for semantic similarity on PPDB and WordNet dev and test data. Higher is better. All RCM objectives are trained with PPDB XXL. To preserve test data integrity, only the best performing setting of each model is evaluated on the test data. Model Swapped Pairs Rate word2vec (cbow) 17.81 RCM-p 16.66 Joint-r 16.85 Joint-p 16.96 JointRCM 16.62 Table 4: Results for ranking the quality of PPDB pairs as compared to human judgements. PPDB Dev Model Relations 10k 30k 100k RCM-r XL 24.02 15.26 9.55 RCM-p XL 54.97 45.35 32.95 RCM-r XXL 55.03 42.52 26.05 RCM-p XXL 61.79 53.83 40.95 RCM-r XXXL 51.00 44.61 28.42 RCM-p XXXL 53.01 46.35 34.19 Table 5: MRR on PPDB dev data for training on an increasing number of relations. Table 4 shows that all of our models obtain reductions in error as compared to the baseline (cbow), with JointRCM obtaining the largest re- duction. This suggests that our embeddings are better suited for semantic tasks, in this case judged by human annotations. PPDB Dev Model RCM 10k 30k 100k Joint-p 1 10 1 47.17 36.74 24.50 5 10 2 54.31 44.52 33.07 1 10 2 59.75 50.93 37.73 1 10 3 57.00 46.84 34.45 Table 6: Effect of learning rate RCM on MRR for the RCM objective in Joint models.', 'We conclude our experiments with an analysis of modeling choices. First, pre-training RCM models gives significant improvements in both measuring semantic similarity and capturing human judge- ments (compare p vs. r results.) Second, the number of relations used for RCM training is an important factor. Table 5 shows the effect on dev data of using various numbers of relations. While we see improvements from XL to XXL (5 times as many relations), we get worse results on XXXL, likely because this set contains the lowest quality relations in PPDB. Finally, Table 6 shows different learning rates RCM for the RCM objective. The baseline word2vec and the joint model have nearly the same averaged running times (2,577s and 2,644s respectively), since they have same number of threads for the CBOW objective and the joint model uses additional threads for the RCM objective. The RCM models are trained with sin- gle thread for 100 epochs. When trained on the PPDB-XXL data, it spends 2,931s on average.'], 'label': ['RES', 'MET'], 'seq': 4.0, 'all_headings': ['Experiments', 'Language Modeling', 'Measuring Semantic Similarity', 'Human Judgements', 'Analysis']}, {'Conclusion': ['We have presented a new learning objective for neural language models that incorporates prior knowledge contained in resources to improve learned word embeddings. We demonstrated that the Relation Constrained Model can lead to better semantic embeddings by incorporating resources like PPDB, leading to better language modeling, semantic similarity metrics, and predicting hu- man semantic judgements. Our implementation is based on the word2vec package and we made it available for general use 2 . We believe that our techniques have implica- tions beyond those considered in this work. We plan to explore the embeddings suitability for other semantics tasks, including the use of re- sources with both typed and scored relations. Ad- ditionally, we see opportunities for jointly learn- ing embeddings across many tasks with many re- sources, and plan to extend our model accordingly. Acknowledgements Yu is supported by China Scholarship Council and by NSFC 61173073. 2 https://github.com/Gorov/JointRCM 549 References Yoshua Bengio, Holger Schwenk, Jean-S ebastien Sen ecal, Fr ederic Morin, and Jean-Luc Gauvain. 2006. Neural probabilistic language models. In Innovations in Machine Learning, pages 137186. Springer. Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. 2007. Greedy layer-wise training of deep networks. In Neural Information Processing Systems (NIPS). Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2012. Joint learning of words and meaning representations for open-text semantic parsing. In International Conference on Artificial Intelligence and Statistics, pages 127135. Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Interna- tional Conference on Machine Learning (ICML). Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research (JMLR), 11:625660. Christiane Fellbaum. 1999. WordNet. Wiley Online Library. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In North American Chapter of the Asso- ciation for Computational Linguistics (NAACL). Eric H Huang, Richard Socher, Christopher D Man- ning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Association for Computational Lin- guistics (ACL), pages 873882. Minh-Thang Luong, Richard Socher, and Christo- pher D Manning. 2013. Better word representa- tions with recursive neural networks for morphol- ogy. In Conference on Natural Language Learning (CoNLL). Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeffrey Dean. 2013. Distributed represen- tations of words and phrases and their composition- ality. arXiv preprint arXiv:1310.4546. Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In International Conference on Machine Learning (ICML). Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426. Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. English gigaword fifth edi- tion. Technical report, Linguistic Data Consortium. Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and nlp: us- ing locality sensitive hash function for high speed noun clustering. In Association for Computational Linguistics (ACL). Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment tree- bank. In Empirical Methods in Natural Language Processing (EMNLP), pages 16311642. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Association for Computational Linguistics (ACL). Benjamin Van Durme and Ashwin Lall. 2010. On- line generation of locality sensitive hash signatures. In Association for Computational Linguistics (ACL), pages 231235. 550'], 'all_headings': ['Conclusion'], 'label': ['CONC'], 'seq': 5.0}],
[{'all_headings': ['Introduction and Relation to Prior Work'], 'label': ['INT'], 'seq': 1.0, 'Introduction and Relation to Prior Work': ['Dialogue system research, like much else in computational linguistics, has greatly benefited from corpora of natural speech. With notable exceptions (e.g. the Edinburgh Maptask, Anderson et al. [1991]), these corpora consist of samples annotated with linguistic properties (e.g. POS, syntax, discourse status) setting aside the visual and 206 Ross Hall, Iowa State University, Ames, Iowa 50011. E-mail: gregory.aist@alumni.cmu.edu. 240c Matthews Center, Arizona State University, Tempe, Arizona 85287. E-mail: ellen.campana@asu.edu. 721 CSB, University of Rochester, Rochester, New York 14627. Email: james@cs.rochester.edu. 732 CSB, University of Rochester, Rochester, New York 14627. E-mail: swift@cs.rochester.edu. 420 Meliora, University of Rochester, Rochester, New York 14627. E-mail: mtan@bcs.rochester.edu. Submission received: 24 August 2009; revised submission received: 6 May 2010; accepted for publication: 20 September 2010. 2012 Association for Computational Linguistics Computational Linguistics Volume 38, Number 3 pragmatic aspects of the context in which they occurred. In recent years natural lan- guage processing (NLP) researchers have been working to incorporate visual and other context into their models and systems (DeVault and Stone 2004; Gabsdil and Lemon 2004; Schuler, Wu, and Schwartz 2009). This is consistent with the growing evidence in psycholinguistics that human language production crucially depends on such aspects of context. To take this NLP research further, there is a need for more corpora that include both variation in, and annotation of, visual and pragmatic context. There are still many open questions that span computational linguistics and psycholinguistics concerning how natural language and context are related. One core question at the intersection of these areas is how the inherent difficulty of describing an end-goal (i.e., its codability) will affect the structure and content of referring expressions and the referential strategy speakers adopt. Referential strategies are a topic of grow- ing interest in natural language generation. In recent work, Viethen and Dale (2006) demonstrated that even when describing simple grid layouts, people adopt different referential strategies, due perhaps to proximity to landmarks (and hence codability): the orange drawer below the two yellow drawers, in contrast to the yellow drawer in the third column from the left second from the top. For systems to produce humanlike references in these situations, existing methods of reference generation will need to be modified or extended to include better models of the choice of referential strategies (Viethen and Dale 2006). Such models can also be expected to improve reference resolution: If better predictions can be made about what people will say in a given situation, automatic speech recognition language models can be tighter, NLP grammars can be smaller, and unlikely parses can be avoided, improving both speed and accuracy. Recent psycholinguistic research suggests that codability does play a role in human reference production (e.g., Cook, Jaeger, and Tanenhaus 2009). This work has largely focused on timing, signals of production difficulty (e.g., disfluency, gesture), and the content of referring expressions (e.g., adjectives, pronouns). There has been much less consideration of how entire referential strategies might systematically vary with codability. A corpus with the correct design and structure will allow for investigation of the more well-studied aspects as well as higher-level factors such as strategy choice, and possible interactions between them. With these considerations in mind, we designed a domain, Fruit Carts, and a set of corresponding tasks in order to elicit human language production for two pur- poses: 1) the testing of psycholinguistic hypotheses, specifically that object complexity modulates referential strategy, and more generally the exploration of the relationship between visual context and humanhuman dialogue, and 2) research and development of dialogue systems that understand language as it unfolds, taking pragmatic factors into account early in the recognition process. By designing with both fields in mind we hope to strengthen the long tradition of cross-fertilization between the disciplines (e.g., Brennan 1991), particularly for task- or game-oriented systems and domains, with a visual component. We identified four important features to build into the domain. First, the language produced should be completely unscripted: Participants should be able to perform the task with a general description of what to do (e.g., Give instructions on how to make the map on the screen look like the map in your hand) and zero prior examples of what to say. For psycholinguistics, this makes the language natural speech rather than speech that is restricted by the instructions or by prior examples. For dialogue systems, this makes the language untrained rather than the result of careful training, meaning that systems will be processing language that is representative of what speakers are likely to produce when they use the system, especially without extensive training. Second, 470 Aist et al. Fruit Carts Domain and Corpus the language should be fairly well constrained by context. For psycholinguistics, this makes the language more straightforward to analyze and also more directly tied to the visual context and thus amenable to visual world studies that use eye movements to examine real-time production (Griffin and Bock 2000) and comprehension (Tanenhaus et al. 1995). For dialogue systems, this makes the language more amenable to automatic processing and also facilitates the integration of different types of knowledge into the recognition process. Third, it should be possible to vary the difficulty of the tasks. For psycholinguistics, this makes hypotheses about the effect of task difficulty on language production amenable to study. For dialogue systems, this allows the resulting corpora to have a combination of relatively easy tasks (low-hanging fruit) and more difficult NLP challenges. Fourth, the domain should support the collection of dialogues that are separable into partially or semi-independent subdialogues, with limited need for ref- erence to previous subdialogues. For psycholinguistics, this makes each subdialogue a separate trial, allowing for analyses where trials are treated as random effects in mixed- effect regression models or repeated measures in ANOVAs. For dialogue systems, this limits the likelihood that errors in processing one subdialogue will spill over and affect processing of subsequent subdialogues. For both research areas, this separability constraint enables within-subject experiments with each subdialogue as a trial. In purpose and approach, Fruit Carts is most similar to the Map Task (Anderson et al. 1991); both are simultaneously a set of experiments on language and a corpus used for developing language processing systems. Map Task dialogues are unscripted [but] the corpus as a whole comprises a large, carefully controlled elicitation exercise (Anderson et al. 1991, page 352) that has been used in many computational endeavors as well. Fruit Carts was guided by our twin goals of furthering the development of spoken language systems, and providing a psycholinguistic test bed in which to test specific hypotheses about human language production. Fruit Carts differs from Map Task in terms of dynamic object properties and in terms of the information available to the speaker and hearer. In the Map Task, objects have fixed properties that differ between giver and follower, yet remain constant while the path is constructed. In Fruit Carts, objects have properties that can be changed: position, angle, and color. This allows for a wide variety of linguistic behavior which in turn supports detailed exploration of continuous understanding by humans and machines. In the Map Task, the participants screens differ, whereas in Fruit Carts the speaker and hearer share the same visual context, which simplifies the analysis and interpretation of results (Figure 1).']}, {'all_headings': ['Fruit Carts Domain and Tasks'], 'label': ['INT'], 'Fruit Carts Domain and Tasks': ['The Fruit Carts domain has three screen areas: a map, an object bin, and a controls panel. Each area was designed in part to elicit the types of expressions that require continuous understanding to approximate human behavior such as progressive restriction of a reference set throughout the utterance. The map contains named regions divided by solid lines, with three flags as land- marks. The region names did not appear on the screen, to preclude use of spelling in referring expressions (the C in Central Park). Names were chosen to be phonetically distinct. To support progressive restriction of potential regions, regions whose initial portions overlap are adjacent (Morn identifies Morningside and Morningside Heights) and some regions have flags and others not (put the square on the flag in... identifies the regions with flags.) No compass is displayed, in an attempt to limit the directions elicited to up, down, left, and right and not north, south, and so on. 471 Computational Linguistics Volume 38, Number 3 Figure 1 Example initial and final configurations for Fruit Carts domain and corpus. The region names were available to both director and actor (on paper) but were not shown on screen. The final configuration shown is the actual screen after the five dialogues from the participant whose third, fourth, and fifth dialogues are shown in Appendix A. The object bin contains fruits and carts, by analogy with food vendor carts (e.g., hot dog stands). The fruits are avocados, bananas, cucumbers, grapefruits, and tomatoes, all botanically fruits. We chose fruits because they were nameable, especially with a label, and visually different from the carts. The carts are either squares or triangles, in two sizes, with an optional tag that for squares is either a diamond or a heart and for triangles is either a star or a circle. Adjectives (e.g., large, small) are commonly used in natural language descriptions and there is a growing body of psycholinguistic research, mostly with scripted utterances, that has used adjectives to investigate real- time language processing (Sedivy et al. 1999; Brown-Schmidt, Campana, and Tanenhaus 2005). Here, to support progressive restriction of potential carts, each component is easy to name but the entire shape requires a complex description rather than a prenominal modifieror at least strongly prefers one, as no examples to the contrary were observed in the Fruit Carts corpus described later in this article. That is, whereas a square with stripes could be either the square with stripes or the striped square, a square with a diamond on the corner is the square with a diamond on the corner but not *the corner-diamonded square. The controls panel contains left and right rotation arrows and six paint colors (black, brown, orange, blue, pink, and purple) chosen to be distinct from the colors of the fruit. Five tasks are included in Fruit Carts, all performed by using a mouse. To CHOOSE a cart, the user clicks on it. To PLACE it on the map, the user drags it there. To PAINT the cart, the user clicks on the desired color. Painting is a uniformly easy control task. To ROTATE the cart, the user presses and holds down the left or right rotation button. 472 Aist et al. Fruit Carts Domain and Corpus The goal of the rotation tool was to allow arbitrary rotations and to elicit utterances that were in response to visual feedback, such as rotate it a little to the right, more, stop. Finally, to FILL the cart, the user drags fruit to it.'], 'seq': 2.0}, {'all_headings': ['Fruit Carts Corpus'], 'label': ['DATA'], 'Fruit Carts Corpus': ['For the dual goals of gathering a corpus of utterances for dialogue system research, and testing the hypothesis that object complexity modulates referential strategy in human language production, we designed a set of goal maps that systematically manipulated: POSITION. Each cart was in a high-codability easy position, such as centered on a flag or in a region; or a low-codability hard position, such as off-center. HEADING. Each cart was at an easy angle, an integer multiple of 45 degrees from its original orientation; or a hard angle, a non-multiple of 45 degrees. CONTENTS. Each cart contained an easy set of objects, fruit of the same type, such as three tomatoes; or a hard set of objects, such as two bananas and a grapefruit. COLOR. Each cart was painted a uniformly easy color to provide a control condition. One person (the director) gave directions to the other (the actor) on how to carry out the task. The director wore a headset microphone that collected speech data; the actor in this set-up wore a head-mounted eye-tracker that collected eye movements. The director (a subject) sat just behind the actor (a confederate); both viewed the same screen. Twelve subjects participated, each of whom specified twenty objects to place on the map; thus, a total of 240 dialogues were collected. The recordings were transcribed word-for-word by a professional transcription service that also provided sentence boundaries. The corpus has been labeled for referential strategy at the utterance level (Aist et al. 2005) and subsequently with referring expressions, spatial relations, and actions in order to support word-by-word incremental interpretation (Gallo et al. 2007); see Appendix A.'], 'seq': 3.0}, {'all_headings': ['Analysis with Respect to Desired Features'], 'label': ['DATA'], 'seq': 4.0, 'Analysis with Respect to Desired Features': ['How well does the Fruit Carts domain meet the desired features described earlier? 1. Unscripted. Subjects were generally able to complete the task with only the instruc- tions to make the screen match their paper map, and no prior examples of what to say, although one subject systematically did not issue instructions to paint the shapes. 2. Constrained. Generally speaking, subjects used the language we expected, such as square, triangle, and so forth, or high-frequency synonyms such as box for a square cart (from the first dialogue of the participant in Appendix A, omitted for space) or dot for a circle tag (Appendix A, [D3]). There were examples of participants using unexpected expressions, such as calling an avocado a lime, despite the on-screen label. Yet overall the language was well constrained by the context. 3. Support for varying of task difficulty. As the Fruit Carts corpus showed, location, heading, and contents of carts can be systematically varied; later corpora, outside the scope of this article, have varied the number of carts placed together in order to construct simple or compound objects, in order to test the hypothesis that higher-level 473 Computational Linguistics Volume 38, Number 3 task and goal knowledge (e.g. a tower is being built from several blocks) modulates language production, and to support further dialogue system research. 4. Support for collection of semi-independent subdialogues. Here the Fruit Carts domain excels. Due to the presence of multiple separate objects and regions, different subdialogues can make use of different objects, regions, properties, and so forth. By contrast, a domain revolving around construction of a single complex target, such as a landscaping plan, would have licensed substantial amounts of reference to previously placed objects including objects not in place at the time the dialogue beganmaking subdialogues dependent on each other in terms of accuracy, correctness, and so forth. As Appendix A illustrates, these Fruit Carts data contain relatively few such references. This is analogous to the difference between a math exercise set that contains several independent exercises, and a set where each exercise builds on previous answers.']}, {'all_headings': ['Use in Research'], 'label': ['DATA'], 'Use in Research': ['For dialogue systems research, the Fruit Carts domain has already been useful in de- veloping dialogue systems that understand language continuously while taking prag- matics into account. For example, using Fruit Carts, incorporating pragmatic feedback about the visual world early in the parsing process was shown to substantially improve parsing efficiency as well as allowing parsing decisions to accurately reflect the visual world (Aist et al. 2006). Also using Fruit Carts, a dialogue system using continuous understanding was shown to be faster than, and preferred to, a counterpart that used a traditional pipeline architecture but was otherwise identical (Aist et al. 2007). For psycholinguistic research, Fruit Carts has also been used for studying the relationship between bi-clausal structure and theme complexity (Gallo et al. 2008) and testing hypotheses regarding the relationship of information in a message, resource limitations, and sentence production (Gallo, Jaeger, and Smyth 2008).'], 'seq': 5.0}, {'all_headings': ['Discussion and Conclusions'], 'label': ['CONC'], 'Discussion and Conclusions': ['Fruit Carts also has a number of other advantages as well as some limitations. First, Fruit Carts provides ample temporary or local ambiguity in its utterances, a central challenge for continuous understanding systems and a classic target of research in psycholinguistics (for a review see Altmann [1998]). In a typical sequence such as okay take a ... small triangle with a dot on the corner (Appendix A, [D3]), most of the content words and some of the function words serve to resolve local ambiguity: okay take... uniquely identifies an action ...a ... small... restricts (partially disambiguates) referential domain to half of the shapes ...triangle... further restricts the referential domain to the triangles ...with... further restricts the referential domain to carts with tags ...a dot... further restricts the referential domain to carts with circles ...on the corner uniquely identifes one of the twenty carts Likewise, flag in right ... um ... side of the uh ... flag in pine tree mountain [D5] restricts regions to flagged regions. 474 Aist et al. Fruit Carts Domain and Corpus Second, Fruit Carts also elicits substantial variation in referential strategy. Some utterances could be grounded independent of context, up to pronominal reference. For example, the hypothetical utterance Move a large plain square to the flag in Central Park has a fully specified action, object, and goal, as do rotate it about 45 degrees (Appendix A, [D4]), and and um make that orange [D5]. We labeled this category all-at-once. For other utterances, grounding relied on the surrounding contextdialogue and/or task. For example, um a little to the left [D4] contains a direction (left) but might rely on the last action to identify the intended action as rotation or movement, and on the selected shape on the screen to identify the object. We labeled this category continuous. Some utterances exhibited both all-at-once and continuous properties, or properties of neither category. The continuous utterances contained 21% fewer words (mean, 8.72 vs. 6.85) than the all-at-once and contained shorter words, too (mean, 3.95 letters vs. 3.74). About one-third of the utterances were labeled as continuous; speakers produced more continuous utterances as task experience increased (Aist et al. 2005). Finally, Fruit Carts is relatively abstract: The carts are basic shapes such as squares and triangles, and the fruit are chosen for language research purposes. On the one hand, this is desirable because it reduces the possibility of confounding effects from prior knowledge. On the other hand, it would be interesting for future work to extend Fruit Carts-style domains to more realistic object construction and placement tasks.'], 'seq': 6.0}]