Multi Head Attention

This is the third video on attention mechanisms. In the previous video we introduced keys, queries and values and in this video we're introducing the concept of multiple heads.

Video

Links

While making these videos we've found that these sources are very useful to have around. Not only because they help the conceptual understanding but also because some of them offer code examples.

Exercises

Try to answer the following questions to test your knowledge.

We argued that adding multiple heads to the system might allow us to learn more patterns. Will this hold true indefinately, or can we expect a diminishing return when we keep adding more attention heads to the system?
Does the shape of the output change when we add more attention heads?