Failing Word Analogies

Unfortunately, there's a big flaw in the linear projection trick.



Let's use whatlies to explore these analogies.

import numpy as np
import pandas as pd
from whatlies import Embedding, EmbeddingSet
from whatlies.transformers import Pca
from whatlies.language import FasttextLanguage, SpacyLanguage, BytePairLanguage
lang_ft = FasttextLanguage("cc.en.300.bin")
lang_sp = SpacyLanguage("en_core_web_md")

Similar to king

We can start by retreiving the most similar embeddings based on cosine distance.

lang_ft.score_similar(lang_ft['king'], n=10, metric='cosine')

This gives us these results.

[(Emb[king], 0.0),
 (Emb[kings], 0.2449641227722168),
 (Emb[queen], 0.2931479215621948),
 (Emb[King], 0.3408734202384949),
 (Emb[prince], 0.35047459602355957),
 (Emb[royal], 0.41696715354919434),
 (Emb[throne], 0.42722034454345703),
 (Emb[kingdom], 0.434279203414917),
 (Emb[emperor], 0.44683873653411865),
 (Emb[lord], 0.4479447603225708)]

Similar to king - man + woman

We can also expand the query by adding operations.

lang_ft.score_similar(lang_ft['king'] - lang_ft['man'] + lang_ft['woman'], n=10, metric='cosine')

This gives us these results.

[(Emb[king], 0.2713325619697571),
 (Emb[queen], 0.3457321524620056),
 (Emb[kings], 0.45897185802459717),
 (Emb[Queen], 0.49255800247192383),
 (Emb[royal], 0.49954700469970703),
 (Emb[King], 0.5179671049118042),
 (Emb[throne], 0.554189920425415),
 (Emb[princess], 0.5551300048828125),
 (Emb[prince], 0.6072607636451721),
 (Emb[palace], 0.623775064945221)]

Similar to king - slow + fast

Let's try another one.

lang_ft.score_similar(lang_ft['king'] - lang_ft['slow'] + lang_ft['fast'], n=10, metric='cosine')

This gives us these results.

[(Emb[king], 0.20691156387329102),
 (Emb[kings], 0.3835362195968628),
 (Emb[queen], 0.45022904872894287),
 (Emb[King], 0.45194685459136963),
 (Emb[prince], 0.48818516731262207),
 (Emb[royal], 0.5023854970932007),
 (Emb[kingdom], 0.5079109072685242),
 (Emb[throne], 0.5353788137435913),
 (Emb[emperor], 0.5441315174102783),
 (Emb[princess], 0.5490601658821106)]


It seems like king - man + woman is further away from queen than king - slow + fast.


There are many more of these examples worth exploring. In general though, it's safe to say that word analogies do not hold. If you're interested in exploring more, you may appreciate this helper function.

def to_dataf(emb_list_before, emb_list_after):
"""Turns before/after Embedding score-lists into a single dataframe."""
names_before = [_[0].name for _ in emb_list_before]
scores_before = [_[1] for _ in emb_list_before]
names_after = [_[0].name for _ in emb_list_after]
scores_after = [_[1] for _ in emb_list_after]
res = pd.DataFrame({'before_word': names_before,
'before_score': scores_before,
'after_word': names_after,
'after_score': scores_after})
return (res
.assign(before_score=lambda d: np.round(d['before_score'], 4))
.assign(after_score=lambda d: np.round(d['after_score'], 4)))
def retreive_most_similar(lang, start, positive=(), negative=(), orthogonal=(), unto=(), n=10, metric='cosine'):
"""Utility function to quickly perform arithmetic and get an overview."""
start_emb = lang[start]
base_dist = lang.score_similar(start_emb, n=10, metric=metric)
for pos in positive:
start_emb = start_emb + lang[pos]
for neg in negative:
start_emb = start_emb - lang[neg]
for ort in orthogonal:
start_emb = start_emb - lang[ort]
for un in unto:
start_emb = start_emb - lang[un]
proj_dist = lang.score_similar(start_emb, n=10, metric=metric)
return to_dataf(base_dist, proj_dist)
retreive_most_similar(lang_sp, start="king", positive=["woman"], negative=["man"])


Try to answer the following questions to test your knowledge.

  • What other analogies can you come up with besides king - man + woman? Can you varify if these hold?
  • If you like to do a small coding execise; can you confirm that analogies don't hold for BERT-kinds of models?

2016-2022 © Rasa.