Towards end-to-end reinforcement learning of dialogue agents for information access
KB-InfoBot
In the multi-round dialogue model interacting with the knowledge base, the symbolic query statement is abandoned and the soft posterior distribution is used in the knowledge base to find the most probable information.
soft-KB lookup”
The so-called probability of each entity is the conditional probability that each entity is referred to based on all user inputs prior to time t. Compared to hard-KB lookup, it can learn better strategies or end 2end training.
Belief Trackers
infoBot There are M belief trackers (one belief tracker for each slot (each type of relationship), and the belief tracker takes the user input as input and the output as belief statE: a distribution (all possible slot values), a probability (whether the user knows the value of the slot). Because the size of the output is too large, a summary is made to improve efficiency.
Dialogue policy
This paper uses two strategies, one is regular and the other is neural network.
Training
In training, because reinforcement learning converges slowly, especially in random initialization, this paper initially uses imitation learning, that is, at the beginning, belief tracker and policy network imitate rule aGent.