The polysemy problem for word embeddings. Word embeddings, you'll recall, are those high-dimensional vectors that represent words that are called "embeddings" for some reason I was told once but can never remember. (Obviously it's not intuitive at all). Anyway, the classic example of word embeddings is if you take the difference in vector representations between "king" and "queen" you get the same vector as you get with "man" and "woman". So you see the idea is that the word embeddings contain a representation of the "meaning" that the word has and how that meaning relates to other words.
The problem is words don't have only one meaning. This is called "polysemy". If you're not familiar with the word "polysemy" but are familiar with "synonyms", "polysemy" is the opposite of "synonyms": with synonyms, multiple words have the same meaning, but with polysemy, multiple meanings have the same word.
Consider, for example, the word "mole", which this machine learning system identified as having 4 meanings: one is associated with words like "counterspy", "spy", "espionage", etc, one associated with "beautymark", "birthmark", "nevus", "pigment", "skin", etc, one associated with "mol", "unit", "gram", "molecule", etc, and one associated with "talpidae", "nocturnal", "digging", "mammal", etc.
Ok, so obviously they have a machine learning system to identify this -- how does it work? Believe it or not it comes from the mathematical field of topology. Basically they hypothesize that meanings lie on a manifold in "meaning space", which can't be directly observed, and words lie on a manifold in "word space", which can be observed. A polysemous word represents taking the manifold in "meaning space" and "pinching" it. Thus, they conjured up an algorithm to detect these "pinchings" in the manifold in word space.
The algorithm relies on something called topological data analysis, which is a technique whereby you take a bunch of vectors, which are points in a high-dimensional space, and figure out the most likely topological structure that would encompass those points.
To do this, they took a popular word embedding dataset called fastText, and did this topological data analysis on it. To detect the pinchings, they used something called Wasserstein distance. Wasserstein distance provides a notion of the distance between the topological structures outputted by the topological data analysis. They look at the topology in the neighborhood of a word, but excluding the word itself, and compare it with a non-pinched topology. The difference reveals how many polysemous meanings the word has.
To verify the system, they used a test devised in 2010 to test machine learning systems' ability to handle polysemy called the SemEval-2010 task on Word Sense Induction & Disambiguation. The task involves 8,915 sentences from various news sources like CNN and ABC of 100 different polysemous target words, 50 nouns and 50 verbs, and the goal is to cluster them based on context -- so all the instances with the same meaning get put in the same cluster but different meanings get put in a different cluster. The task comes with a training set with 65 million occurrences of 127,151 different words, and they used this corpus to produce their own embeddings using the fastText algorithm. When using the 10 nearest words to determine topology, their system correlated tightly with the "gold standard" for the test, the correct answers created by humans. When using more than the 10 nearest words, their system tended to find more
meanings, and the correlation went down.
The manifold hypothesis suggests that word vectors live on a submanifold within their ambient vector space. We argue that we should, more accurately, expect them to live on a pinched manifold: a singular quotient of a manifold obtained by identifying some of its points. The identified, singular points correspond to polysemous words, i.e. words with multiple meanings. Our point of view suggests that monosemous and polysemous words can be distinguished based on the topology of their neighbourhoods. We present two kinds of empirical evidence to support this point of view: (1) We introduce a topological measure of polysemy based on persistent homology that correlates well with the actual number of meanings of a word. (2) We propose a simple, topologically motivated solution to the SemEval-2010 task on Word Sense Induction & Disambiguation that produces competitive results.arxiv.org