Failing Word Analogies

Unfortunately, there's a big flaw in the linear projection trick.

Video


Code

Let's use whatlies to explore these analogies.

import numpy as np
import pandas as pd
from whatlies import Embedding, EmbeddingSet
from whatlies.transformers import Pca
from whatlies.language import FasttextLanguage, SpacyLanguage, BytePairLanguage
lang_ft = FasttextLanguage("cc.en.300.bin")
lang_sp = SpacyLanguage("en_core_web_md")

Similar to king

We can start by retreiving the most similar embeddings based on cosine distance.

lang_ft.score_similar(lang_ft['king'], n=10, metric='cosine')

This gives us these results.

[(Emb[king], 0.0),
 (Emb[kings], 0.2449641227722168),
 (Emb[queen], 0.2931479215621948),
 (Emb[King], 0.3408734202384949),
 (Emb[prince], 0.35047459602355957),
 (Emb[royal], 0.41696715354919434),
 (Emb[throne], 0.42722034454345703),
 (Emb[kingdom], 0.434279203414917),
 (Emb[emperor], 0.44683873653411865),
 (Emb[lord], 0.4479447603225708)]

Similar to king - man + woman

We can also expand the query by adding operations.

lang_ft.score_similar(lang_ft['king'] - lang_ft['man'] + lang_ft['woman'], n=10, metric='cosine')

This gives us these results.

[(Emb[king], 0.2713325619697571),
 (Emb[queen], 0.3457321524620056),
 (Emb[kings], 0.45897185802459717),
 (Emb[Queen], 0.49255800247192383),
 (Emb[royal], 0.49954700469970703),
 (Emb[King], 0.5179671049118042),
 (Emb[throne], 0.554189920425415),
 (Emb[princess], 0.5551300048828125),
 (Emb[prince], 0.6072607636451721),
 (Emb[palace], 0.623775064945221)]

Similar to king - slow + fast

Let's try another one.

lang_ft.score_similar(lang_ft['king'] - lang_ft['slow'] + lang_ft['fast'], n=10, metric='cosine')

This gives us these results.

[(Emb[king], 0.20691156387329102),
 (Emb[kings], 0.3835362195968628),
 (Emb[queen], 0.45022904872894287),
 (Emb[King], 0.45194685459136963),
 (Emb[prince], 0.48818516731262207),
 (Emb[royal], 0.5023854970932007),
 (Emb[kingdom], 0.5079109072685242),
 (Emb[throne], 0.5353788137435913),
 (Emb[emperor], 0.5441315174102783),
 (Emb[princess], 0.5490601658821106)]

Strange

It seems like king - man + woman is further away from queen than king - slow + fast.

Explore

There are many more of these examples worth exploring. In general though, it's safe to say that word analogies do not hold. If you're interested in exploring more, you may appreciate this helper function.

def to_dataf(emb_list_before, emb_list_after):
"""Turns before/after Embedding score-lists into a single dataframe."""
names_before = [_[0].name for _ in emb_list_before]
scores_before = [_[1] for _ in emb_list_before]
names_after = [_[0].name for _ in emb_list_after]
scores_after = [_[1] for _ in emb_list_after]
res = pd.DataFrame({'before_word': names_before,
'before_score': scores_before,
'after_word': names_after,
'after_score': scores_after})
return (res
.assign(before_score=lambda d: np.round(d['before_score'], 4))
.assign(after_score=lambda d: np.round(d['after_score'], 4)))
def retreive_most_similar(lang, start, positive=(), negative=(), orthogonal=(), unto=(), n=10, metric='cosine'):
"""Utility function to quickly perform arithmetic and get an overview."""
start_emb = lang[start]
base_dist = lang.score_similar(start_emb, n=10, metric=metric)
for pos in positive:
start_emb = start_emb + lang[pos]
for neg in negative:
start_emb = start_emb - lang[neg]
for ort in orthogonal:
start_emb = start_emb - lang[ort]
for un in unto:
start_emb = start_emb - lang[un]
proj_dist = lang.score_similar(start_emb, n=10, metric=metric)
return to_dataf(base_dist, proj_dist)
retreive_most_similar(lang_sp, start="king", positive=["woman"], negative=["man"])

Exercises

Try to answer the following questions to test your knowledge.

  • What other analogies can you come up with besides king - man + woman? Can you varify if these hold?
  • If you like to do a small coding execise; can you confirm that analogies don't hold for BERT-kinds of models?

2016-2022 © Rasa.