This page collects together the output from the ongoing study of Corpus Derived Semantic Representations by John Bullinaria and Joe Levy.
We have carried out systematic computational studies of the extraction and optimization of semantic representations derived from the word co-occurrence statistics of large text corpora, tested their performance on various applications, and have begun to explore how such representations might be related to patterns of brain activity.
Currently this page provides links to our published research papers in this area, and some of the key word sets and semantic vectors discussed in those papers.
Levy, J.P., Bullinaria, J.A. & McCormick, S. (2017). Semantic Vector Evaluation and Human Performance on a New Vocabulary MCQ Test. In: Proceedings of the Thirty-ninth Annual Conference of the Cognitive Science Society, 2549-2554. Austin, TX: Cognitive Science Society. (pdf)
Bullinaria, J.A. & Levy, J.P. (2013). Limiting Factors for Mapping Corpus Based Semantic Representations to Brain Activity. PLoS ONE, 8(3), e57191. (pdf)
Bullinaria, J.A. & Levy, J.P. (2012). Extracting Semantic Representations from Word Co-occurrence Statistics: Stop-lists, Stemming and SVD. Behavior Research Methods, 44, 890-907. (pdf)
Levy, J.P. & Bullinaria, J.A. (2012). Using Enriched Semantic Representations in Predictions of Human Brain Activity. In: E.J. Davelaar (Ed.), Connectionist Models of Neurocognition and Emergent Behavior: From Theory to Applications, 292-308. Singapore: World Scientific. (pdf)
Bullinaria, J.A. (2008). Semantic Categorization Using Simple Word Co-occurrence Statistics. In: M. Baroni, S. Evert & A. Lenci (Eds), Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, 1-8. Hamburg, Germany: ESSLLI. (pdf)
Bullinaria, J.A. & Levy, J.P. (2007). Extracting Semantic Representations from Word Co-occurrence Statistics: A Computational Study. Behavior Research Methods, 39, 510-526. (pdf)
Levy, J.P. & Bullinaria, J.A. (2001). Learning Lexical Properties from Word Usage Patterns: Which Context Words Should be Used? In: R.M. French & J.P. Sougne (Eds), Connectionist Models of Learning, Development and Evolution: Proceedings of the Sixth Neural Computation and Psychology Workshop, 273-282. London: Springer. (pdf)
Levy, J.P., Bullinaria, J.A. & Patel, M. (1998). Explorations in the Derivation of Semantic Representations from Word Co-occurrence Statistics. South Pacific Journal of Psychology, 10, 99-111. (pdf)
Patel, M., Bullinaria, J.A. & Levy, J.P. (1997). Extracting Semantic Representations from Large Text Corpora. In: J.A. Bullinaria, D.W. Glasspool & G. Houghton (Eds.), Fourth Neural Computation and Psychology Workshop: Connectionist Representations, 199-212. London: Springer. (pdf)
Bullinaria, J.A. & Huckle, C.C. (1997). Modelling Lexical Decision Using Corpus Derived Semantic Representations in a Connectionist Network. In: J.A. Bullinaria, D.W. Glasspool & G. Houghton (Eds.), Fourth Neural Computation and Psychology Workshop: Connectionist Representations, 213-226. London: Springer. (pdf)
To facilitate further research in this area, the key word sets and semantic vectors discussed in our papers are made available here. There are five tasks/word-sets, each with a word list in a plain text file and three corresponding sets of vectors in MATLAB formatted binary files (MAT-files). Each set of vectors is computed as described in the Bullinaria & Levy (2012), using an L+R word co-occurrence widow of size 1.
Task | Word set | Vectors |
---|---|---|
TOEFL | 400 words | PPMI - PC - Caron |
Distance | 400 words | PPMI - PC - Caron |
Semantic Categ. | 530 words | PPMI - PC - Caron |
Purity | 60 words | PPMI - PC - Caron |
PLoS ONE New | 60 words | PPMI - PC - Caron | New MCQ | 1000 words | - - |
PPMI = Positive Pointwise Mutual Information, 10000 context word frequency ordered components
PC = Principal Components (US from SVD), 10000 singular value ordered components (50k starting matrix)
Caron = Caron approach vectors (US^0.25 from SVD), 10000 singular value ordered components (50k starting matrix)
The TOEFL task was first used by Tom Landauer & Susan Dumais (1997), A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge, Psychological Review, 104, 211-240. It consists of 80 multiple-choice synonym questions. More details are provided by Bullinaria & Levy (2007, 2012), and other algorithms tested on it are listed on the ACL Wiki State-of-the-art page.
The Distance task was first used by Bullinaria & Levy (2007). This word set consists of 200 pairs of semantically related words. More details are provided by Bullinaria & Levy (2007, 2012).
The Semantic Categ. task was first used by Patel, Bullinaria & Levy (1997). This word set consists of 53 semantic categories of 10 words each. More details are provided by Bullinaria & Levy (2007, 2012).
The Purity task uses the word set of Tom Mitchell et al. (2008), Predicting human brain activity associated with the meanings of nouns, Science, 320, 1191-1195. It was used by Levy & Bullinaria (2012) in the brain activity prediction task, and by Bullinaria & Levy (2012) to test the clustering purity of 12 semantic categories of 5 words each.
The PLoS ONE New word set is a modified version of the Purity set that has 15 problematic words replaced to result in clustering with perfect purity. More details are provided by Bullinaria & Levy (2013).
The New MCQ word set consists of 200 multiple-choice synonym questions as described by Levy, Bullinaria & McCormick (2017). Each question corresponds to a set of five words: the test word, followed by the correct synonym, followed by three incorrect choices. The first 100 test words (i.e., abdomen to zucchini) are low frequency, the second 100 (i.e., ability to wedding) are high frequency. Corpus vectors and human performance scores will be posted here later.
All the vectors were generated using the two billion word web-crawled ukWaC corpus that is available from WaCky.