From cars with voice control, to instructable robots that help with housework, to embodied characters that take website requests, technology is increasingly invading our daily lives and new paradigms are emerging to interact with it. Machines are no longer viewed as complicated yet dumb tools, but rather as smart companions that support task management and relieve us of tedious operation of appliances. Such agent-based interfaces complement traditional relationships and call for intuitive forms of human-machine communication.
The interfaces required must cooperate and communicate with people using socially adept conversation (see Figure 1). This has led to development of embodied conversational agents.1 However, this is a complex and daunting task. Behavioural traits must be implemented that convey complex meaning, support discourse, regulate conversation and deal with socio-emotive functions. In interactions, the tiniest features and even the absence of a certain kind of behaviour can convey possibly unwanted messages. Therefore, designers must account for many aspects in conversational agents, from collaborative task models, to social reasoning about mutual beliefs and intentions to real-time multimodal behaviour. Mapping these structures onto each other to produce and understand conversational contributions must also be done carefully.
Conversational agent ‘Vince’ in different setups. (a) Mobile phone. (b) Marker-free face-to-face conversation.
Existing work on agents focuses on one or a few aspects, such as eye gaze, prosody (rhythm, intonation, stress and related attributes) of speech or turn-taking. The relevant models (either explicit or implicit) are based more or less directly on empirical data and theoretical considerations. Two problems emerge. First, neither our empirical/theoretical findings nor our modelling methods are sufficiently advanced to ensure adequate results. Second, our rigid models quickly become insufficient as conversation unfolds as a joint task in which interlocutors do not adhere to their goals, beliefs, intentions or behavioural patterns. Instead, adaptation is pervasive in natural conversation and is assumed to serve communicative (for example, as shared basis for common ground2 and facilitating dialogue3) and social functions (for example, creating rapport and affiliation4,5). ‘Socially adaptive agents’—which must learn from and adjust to their users to establish and maintain successful interaction routines and signals—can offer a solution to both problems. We do not only need models of social behaviour, but we must also understand how this behaviour facilitates and is subject to personalized adaptation.
How can one build such agents? A minimal layout (see Figure 2) comprises modules for processing task-oriented and social goals (red box), content/actions (green) and behaviours (blue). Each stage must support both production and perception and needs to process the input/output from adjacent modules. Interpersonal adaptation is found in all aspects of human conversation and therefore affects all levels of the layout. Behaviours are adapted through mimicry, entrainment or interactional synchrony, beliefs, intentions and goals through backchannel feedback, negotiation or metacommunication. The challenge is to achieve every single one of these adaptations and coordinate them on a principled and controlled basis. We set out to build these capabilities bottom-up.
Outline of two conversational agents and the potential adaptations between them in social interaction.
We previously developed components laying the foundation for socially adaptive agents. We devised flexible representation and specification formats for communicative behaviour and its functions (incorporated in our ‘behaviour markup language’).6Our articulated communicator engine (ACE) can turn these into multimodal (verbal and nonverbal) behaviour on the spot.7 We are using this to drive conversational robots (such as Honda's ASIMO) and virtual characters like our sociable agent Vince in mobile, virtual/augmented reality or desktop applications. We designed a distributed architecture based on D-Bus8 to allow modules to emit and receive data continuously and incrementally.
One prerequisite for adaptivity is to adjust flexibly, and not being restricted to fixed repertoires of verbal phrases, gestures, facial expressions, discourse plans or scripts. Therefore, constructive models are required at each stage. Building on ACE's flexibility at the behavioural realization level, we developed a model that allows agents to autonomously plan, i.e., select the content and derive the form of coordinated language and gestures.9 It comprises specific real-time planners to formulate verbal and gestural behaviour. The former is done with a grammar-based sentence planner that is already socially adaptive because it incorporates priming of lexical or syntactical structures informed by frequency and recency of use.10 Gesture generation is done with probabilistic decision networks that map meaning (for example, visuo-spatial properties of the object shown) along with contextual factors (information structure, discourse status, previous gestures) onto gesture forms learned from empirical data and, in the future, adapted online to the user.
The next step is to tie these flexible generation models to perceptual processes and to extend them by learning. Vince already contains hierarchical sensorimotor structures for nonverbal behaviour. These levels exist for motor commands, programs and abstract schemes. They are equipped with forward models that enable continuous prediction-based behaviour perception. A probabilistic approach has been proposed for this, which also accounts for the vertical flow of evidence/predictions between levels. Inverse models (learned as self-organizing maps) are in charge of analysing novel behaviour and augmenting the levels with corresponding motor structures. In our current setup with marker-free time-of-flight cameras (see Figure 2), Vince can learn and cluster novel gestures (up to the level of motor programs) and recognize and imitate them already during observation.
In summary, we argue that a crucial next step for conversational interfaces is to become socially adaptive. This will foster acceptability and user satisfaction since interpersonal adaptation of social behaviour helps mitigate shortcomings of behavioural models and induces socially desirable qualities into human-machine interactions. Key challenges for the future include creating an integrated model of social adaptation that comprehends all arrows in Figure 2, and reducing the imbalance between what (virtual) agents can produce and what they can sense with current input processing technology.