Virtual Movement from Natural Language Text

Dissertation von Himangshu Sarma (2019)

It is a challenging task for machines to follow a textual instruction. Properly understanding and using the meaning of the textual instruction in some application area, such as robotics, animation, etc. is very difficult for machines. The interpretation of textual instructions for the automatic generation of the corresponding motions (e.g. exercises) and the validation of these movements are difficult tasks.


To achieve our initial goal of having machines properly understand textual instructions and generate some motions accordingly, we recorded five different exercises in random order with the help of seven amateur performers using a Microsoft Kinect device. During the recording, we found that the same exercise was interpreted differently by each human performer even though they were given identical textual instructions. We performed a quality assessment study based on the derived data using a crowdsourcing approach. Later, we tested the inter-rater agreement for different types of visualizations, and found the RGB-based visualization showed the best agreement among the annotators’ animation with a virtual character standing in second position. In the next phase we worked with physical exercise instructions. Physical exercise is an everyday activity domain in which textual exercise descriptions are usually focused on body movements. Body movements are considered to be a common element across a broad range of activities that are of interest for robotic automation.
Our main goal is to develop a text-to-animation system which we can use in different application areas and which we can also use to develop multiple-purpose robots whose operations are based on textual instructions. This system could be also used in different text to scene and text to animation systems. To generate a text-based animation system for physical exercises the process requires the robot to have natural language understanding (NLU) including understanding non-declarative sentences. It also requires the extraction of semantic information from complex syntactic structures with a large number of potential interpretations. Despite a comparatively high density of semantic references to body movements, exercise instructions still contain large amounts of underspecified information. Detecting, and bridging and/or filling such underspecified elements is extremely challenging when relying on methods from NLU alone. However, humans can often add such implicit information with ease due to its embodied nature.


We present a process that contains the combination of a semantic parser and a Bayesian network. In the semantic parser, the system extracts all the information present in the instruction to generate the animation. The Bayesian network adds some brain to the system to extract the information that is implicit in the instruction. This information is very important for correctly generating the animation and is very easy for a human to extract but very difficult for machines. Using crowdsourcing, with the help of human brains, we updated the Bayesian network. The combination of the semantic parser and the Bayesian network explicates the information that is contained in textual movement instructions so that an animation execution of the motion sequences performed by a virtual humanoid character can be rendered. To generate the animation from the information we basically used two different types of Markup languages. Behaviour Markup Language is used for 2D animation. Humanoid Animation uses Virtual Reality Markup Language for 3D animation.

Thesis online