Training Data & Rules

Video

When you're making an assistant with Rasa you'll need to have training data. This refers to the text data used to train any models or features you're using. This includes user generated text as well as conversational patterns. It could include customer support logs, assuming data collection & re-use is covered in your privacy policy, or user conversations with your assistant.

Although pre-existing logs offer a good place to start, data from actual users interacting with your assistant is always the best data to work with.

Let's now discuss different parts of the data that you'll provide.

Stories

Stories represent training data to teach your assistant what it should do next. If you already have conversational data it's good to start with the patterns you've found there. It's also possible to create your own patterns, but we recommend using interactive learning (via the rasa interactive command) to get started. You can start to create common flows and "happy flows" and then try to add some stories that contain common digressions.

Once you have some stories to start from, you can train a first model which will allow you to test your model with users to gain feedback.

Examples

Here's an example story;

stories: 
- story: happy path
  steps:
  - intent: greet
  - action: utter_greet
  - intent: mood_great
  - action: utter_happy

You can be quite expressive in a story file though. You could, for example, use or statements. The story below uses an or statement to indicate that a user can use either the affirm or the thanks intent to confirm a signup.

stories: 
- story: newsletter signup
  steps:
  - intent: signup_newsletter
  - action: utter_ask_confirm_signup
  - or:
    - intent: affirm
    - intent: thanks
  - action: action_signup_newsletter

You're also able to use checkpoints in your stories.

stories:
- story: beginning of conversation
  steps:
  - intent: greet
  - action: utter_greet
  - checkpoint: ask_feedback
- story: provide feedback
  - checkpoint: ask_feedback
  - action: utter_ask_feedback
  - intent: inform
  - action: utter_thank_you
  - action: utter_anything_else
- story: no feedback
  - checkpoint: ask_feedback
  - action: utter_ask_feedback
  - intent: deny
  - action: utter_no_problem
  - action: utter_anything_else

You can use or statements and checkpoints to modularize and simplify your training data. They can be useful, but do not overuse them. Using lots of checkpoints can quickly make your example stories hard to understand, and will slow down training.

Rules

Rules are a type of training data used to train your assistant's dialogue management model. Rules provide a way to describe short pieces of conversations that should always go the same way.

The main difference between a rule and a story is that a story can be seen as an example to learn from, while a rule is a pattern that the assistant must follow.

Here's an example of a rule that you may have in your rules.yml file.

rules:
  - rule: Greeting Rule
    steps:
    - intent: greet
    - action: utter_greet

This rule says "whenever I see a user use the greet intent, the response should always be the utter_greet response". We'll talk more indepth about rules in future videos because they also play a large role in using forms, but for now we can see it as another file that contains data for our assistant.

Intents

The examples for your intents are stored in your nlu.yml file. Here's an example of such a file:

nlu:
  - intent: greet_smalltalk
    examples: |
    - hi
    - hello
    - howdy
    - hey
    - sup 
    - how goes it
    - whats up?

In this example, we're giving many examples of the greet_smalltalk intent. This is training data for Rasa to learn from.

Some Tips

When you're adding training data. It helps to keep the following themes in mind.

When you're starting out with pre-existing logs we recommend to go through the logs to see if you can find examples that fit an intent by hand. You may be able to use machine learning techniques to help you find intents, but we recommend always keeping a human in the loop so that you can guarantee correctness.

If you don't have logs to start from, consider starting out with the most common intents and try to use domain knowledge or the experience of your colleagues to come up with sensible examples. You can always add an "out of scope"-intent for text that your assistant doesn't cover right away. Once you've got some basic intents, it's best to start sharing your assistant so that you can learn from user data.

User generated data is better than synthetic data. We're interested in learning how users interact with the assistant and we don't want to risk overfitting on synthetic data.

Each utterance should match exactly one intent in your training data. Rasa provides an end-to-end learning system that doesn't rely on intents for situations where an utterance may be ambigous. If you've got some ambious utterances, they can be added do a story like so:

stories:
  - story: happy path
    steps:
    - user: "Ciao!"
    - action: utter_greet
    - intent: mood_great
    - action: utter_happy

The reason we're using the example "Ciao!" here is because it can mean either "hello" or "goodbye". In the context of the shown story it means "hello", but we shouldn't add "Ciao!" as an example for the "hello"-intent because it is ambigous. It could also mean "goodbye".

Exercises

Try to answer the following questions to test your knowledge.

When does it make sense to use or statements in your stories.yml file?
When does it make sense to use checkpoint in your stories.yml file?
Is it a good idea to start making an assistant with lots of intents? Why (not)?