This is the first video on attention mechanisms. We'll start with self attention and try to explain why it's just a re-weighting tactic.
While making these videos we've found that these sources are very useful to have around. Not only because they help the conceptual understanding but also because some of them offer code examples.
Try to answer the following questions to test your knowledge.
- When we apply self attention to tokens, does the order of tokens matter?
- When we train a self attention system on sentences of 5 words, can we still apply the transformer on sentences with 6 words? Why?