Bullinaria & Levy

Corpus Derived Semantic Representations

This page collects together the output from the ongoing study of Corpus Derived Semantic Representations by John Bullinaria and Joe Levy.

We have carried out systematic computational studies of the extraction and optimization of semantic representations derived from the word co-occurrence statistics of large text corpora, tested their performance on various applications, and have begun to explore how such representations might be related to patterns of brain activity.

Currently this page provides links to our research papers in this area, and some of the key word sets discussed in those papers.

Publications

Bullinaria, J.A. & Levy, J.P. (2026). Extracting Semantic Representations from Word Co-occurrence Statistics: Polysemes and Homonyms. Submitted. (PsyArXiv)

Levy, J.P. & Bullinaria, J.A. (2026). Designing Vocabulary Multiple-Choice Tests For Exploring Word Frequency Effects In Distributional Semantics. Submitted. (PsyArXiv)

Levy, J.P., Bullinaria, J.A. & McCormick, S. (2017). Semantic Vector Evaluation and Human Performance on a New Vocabulary MCQ Test. In: Proceedings of the Thirty-ninth Annual Conference of the Cognitive Science Society, 2549-2554. Austin, TX: Cognitive Science Society. (pdf)

Bullinaria, J.A. & Levy, J.P. (2013). Limiting Factors for Mapping Corpus Based Semantic Representations to Brain Activity. PLoS ONE, 8(3), e57191. (pdf)

Bullinaria, J.A. & Levy, J.P. (2012). Extracting Semantic Representations from Word Co-occurrence Statistics: Stop-lists, Stemming and SVD. Behavior Research Methods, 44, 890-907. (pdf)

Levy, J.P. & Bullinaria, J.A. (2012). Using Enriched Semantic Representations in Predictions of Human Brain Activity. In: E.J. Davelaar (Ed.), Connectionist Models of Neurocognition and Emergent Behavior: From Theory to Applications, 292-308. Singapore: World Scientific. (pdf)

Bullinaria, J.A. (2008). Semantic Categorization Using Simple Word Co-occurrence Statistics. In: M. Baroni, S. Evert & A. Lenci (Eds), Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, 1-8. Hamburg, Germany: ESSLLI. (pdf)

Bullinaria, J.A. & Levy, J.P. (2007). Extracting Semantic Representations from Word Co-occurrence Statistics: A Computational Study. Behavior Research Methods, 39, 510-526. (pdf)

Levy, J.P. & Bullinaria, J.A. (2001). Learning Lexical Properties from Word Usage Patterns: Which Context Words Should be Used? In: R.M. French & J.P. Sougne (Eds), Connectionist Models of Learning, Development and Evolution: Proceedings of the Sixth Neural Computation and Psychology Workshop, 273-282. London: Springer. (pdf)

Levy, J.P., Bullinaria, J.A. & Patel, M. (1998). Explorations in the Derivation of Semantic Representations from Word Co-occurrence Statistics. South Pacific Journal of Psychology, 10, 99-111. (pdf)

Patel, M., Bullinaria, J.A. & Levy, J.P. (1997). Extracting Semantic Representations from Large Text Corpora. In: J.A. Bullinaria, D.W. Glasspool & G. Houghton (Eds.), Fourth Neural Computation and Psychology Workshop: Connectionist Representations, 199-212. London: Springer. (pdf)

Bullinaria, J.A. & Huckle, C.C. (1997). Modelling Lexical Decision Using Corpus Derived Semantic Representations in a Connectionist Network. In: J.A. Bullinaria, D.W. Glasspool & G. Houghton (Eds.), Fourth Neural Computation and Psychology Workshop: Connectionist Representations, 213-226. London: Springer. (pdf)

Word Sets

To facilitate further research in this area, the key word sets discussed in our papers are made available here.

Word Sets/Tasks Word Lists

TOEFL 400 words

Distance Comparison 400 words

Semantic Categorization 530 words

Purity 60 words

New Purity 60 words

New VMCQ 1000 words

Triples and Distractors 450 words and 580 words

Homonyms 200 words

Word Sets/Tasks	Word Lists
TOEFL	400 words
Distance Comparison	400 words
Semantic Categorization	530 words
Purity	60 words
New Purity	60 words
New VMCQ	1000 words
Triples and Distractors	450 words and 580 words
Homonyms	200 words

The TOEFL task was first used by Tom Landauer & Susan Dumais (1997), A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge, Psychological Review, 104, 211-240. It consists of 80 multiple-choice synonym questions. More details are provided in Bullinaria & Levy (2007, 2012), and other algorithms tested on it are listed on the ACL Wiki State-of-the-art page.

The Distance Comparison task was first used by Bullinaria & Levy (2007). This word set consists of 200 pairs of semantically related words. Performance is measured by testing how many word vectors are closer to their synonym's vector than the vectors of unrelated words. More details are provided in Bullinaria & Levy (2007, 2012).

The Semantic Categorization task was first used by Patel, Bullinaria & Levy (1997). This word set consists of 53 semantic categories of 10 words each. Performance is measured by testing how many word vectors are closer to their own category center than any of the others. More details are provided in Bullinaria & Levy (2007, 2012).

The Purity task uses the word set of Tom Mitchell et al. (2008), Predicting human brain activity associated with the meanings of nouns, Science, 320, 1191-1195. It was used by Levy & Bullinaria (2012) in the brain activity prediction task, and by Bullinaria & Levy (2012) to test the clustering purity of 12 semantic categories of 5 words each.

The New Purity word set is a modified version of the Purity set that has 15 problematic words replaced to result in clustering with perfect purity. More details are provided in Bullinaria & Levy (2013).

The New VMCQ word set consists of 200 Vocabulary Multiple-Choice Questions (VMCQs) as described by Levy, Bullinaria & McCormick (2017). It also forms the basis of the study of Levy & Bullinaria (2026). Each question consists of a set of five words: the test word, followed by the correct synonym, followed by three distractors (incorrect choices). The first 100 test words (i.e., abdomen to zucchini) are low frequency, the second 100 (i.e., ability to wedding) are high frequency.

The Triples and Distractors word sets were used for creating and testing artificial polysemes and homonyms as described by Bullinaria & Levy (2026). They contain a set of 150 synonym triples, giving 450 words in total, and 580 non-overlapping distractor words for use in VMCQ tests.

The Homonyms word set contains 40 real homonyms, each followed by two synonyms for each of two senses, giving 200 words in total. This set was created by Bullinaria & Levy (2026) to allow comparisons with their artificial homonym results.

This page is maintained by John Bullinaria. Last updated on 14 May 2026.