Failing Word Analogies
Unfortunately, there's a big flaw in the linear projection trick.
Video
Code
Let's use whatlies to explore these analogies.
import numpy as npimport pandas as pd
from whatlies import Embedding, EmbeddingSetfrom whatlies.transformers import Pcafrom whatlies.language import FasttextLanguage, SpacyLanguage, BytePairLanguage
lang_ft = FasttextLanguage("cc.en.300.bin")lang_sp = SpacyLanguage("en_core_web_md")
Similar to king
We can start by retreiving the most similar embeddings based on cosine distance.
lang_ft.score_similar(lang_ft['king'], n=10, metric='cosine')
This gives us these results.
[(Emb[king], 0.0),
(Emb[kings], 0.2449641227722168),
(Emb[queen], 0.2931479215621948),
(Emb[King], 0.3408734202384949),
(Emb[prince], 0.35047459602355957),
(Emb[royal], 0.41696715354919434),
(Emb[throne], 0.42722034454345703),
(Emb[kingdom], 0.434279203414917),
(Emb[emperor], 0.44683873653411865),
(Emb[lord], 0.4479447603225708)]
Similar to king - man + woman
We can also expand the query by adding operations.
lang_ft.score_similar(lang_ft['king'] - lang_ft['man'] + lang_ft['woman'], n=10, metric='cosine')
This gives us these results.
[(Emb[king], 0.2713325619697571),
(Emb[queen], 0.3457321524620056),
(Emb[kings], 0.45897185802459717),
(Emb[Queen], 0.49255800247192383),
(Emb[royal], 0.49954700469970703),
(Emb[King], 0.5179671049118042),
(Emb[throne], 0.554189920425415),
(Emb[princess], 0.5551300048828125),
(Emb[prince], 0.6072607636451721),
(Emb[palace], 0.623775064945221)]
Similar to king - slow + fast
Let's try another one.
lang_ft.score_similar(lang_ft['king'] - lang_ft['slow'] + lang_ft['fast'], n=10, metric='cosine')
This gives us these results.
[(Emb[king], 0.20691156387329102),
(Emb[kings], 0.3835362195968628),
(Emb[queen], 0.45022904872894287),
(Emb[King], 0.45194685459136963),
(Emb[prince], 0.48818516731262207),
(Emb[royal], 0.5023854970932007),
(Emb[kingdom], 0.5079109072685242),
(Emb[throne], 0.5353788137435913),
(Emb[emperor], 0.5441315174102783),
(Emb[princess], 0.5490601658821106)]
Strange
It seems like king - man + woman
is further away from queen
than king - slow + fast
.
Explore
There are many more of these examples worth exploring. In general though, it's safe to say that word analogies do not hold. If you're interested in exploring more, you may appreciate this helper function.
def to_dataf(emb_list_before, emb_list_after): """Turns before/after Embedding score-lists into a single dataframe.""" names_before = [_[0].name for _ in emb_list_before] scores_before = [_[1] for _ in emb_list_before] names_after = [_[0].name for _ in emb_list_after] scores_after = [_[1] for _ in emb_list_after] res = pd.DataFrame({'before_word': names_before, 'before_score': scores_before, 'after_word': names_after, 'after_score': scores_after}) return (res .assign(before_score=lambda d: np.round(d['before_score'], 4)) .assign(after_score=lambda d: np.round(d['after_score'], 4)))
def retreive_most_similar(lang, start, positive=(), negative=(), orthogonal=(), unto=(), n=10, metric='cosine'): """Utility function to quickly perform arithmetic and get an overview.""" start_emb = lang[start] base_dist = lang.score_similar(start_emb, n=10, metric=metric) for pos in positive: start_emb = start_emb + lang[pos] for neg in negative: start_emb = start_emb - lang[neg] for ort in orthogonal: start_emb = start_emb - lang[ort] for un in unto: start_emb = start_emb - lang[un] proj_dist = lang.score_similar(start_emb, n=10, metric=metric) return to_dataf(base_dist, proj_dist)
retreive_most_similar(lang_sp, start="king", positive=["woman"], negative=["man"])
Exercises
Try to answer the following questions to test your knowledge.
- What other analogies can you come up with besides
king - man + woman
? Can you varify if these hold? - If you like to do a small coding execise; can you confirm that analogies don't hold for BERT-kinds of models?