00:21 - 00:25

we present a complete system that

00:23 - 00:28

synthesizes physically plausible human

00:25 - 00:30

object interactions from Human level

00:28 - 00:33

instructions given AB ract human level

00:30 - 00:35

instructions we generate synchronized

00:33 - 00:38

object motion human motion and finger

00:35 - 00:40

motion here we instruct the human to

00:38 - 00:44

arrange the boxes to represent move and

00:40 - 00:44

they successfully completed the

00:46 - 00:51

task our system combines a highlevel

00:49 - 00:53

planner with a low-level motion

00:51 - 00:55

generator to produce detailed movements

00:53 - 00:58

followed by a physics tracker to ensure

00:55 - 01:00

the Motions are physically plausible

00:58 - 01:02

first we use a llm based highle planner

01:00 - 01:04

to reason about the human level

01:02 - 01:07

instructions and generate a scene map

01:04 - 01:09

and an execution plan then a low-level

01:07 - 01:12

motion generator generates synchronized

01:09 - 01:15

object motion full body human motion and

01:12 - 01:17

finger motion finally a physics tracker

01:15 - 01:20

employs reinforcement learning to

01:17 - 01:22

imitate the motion ensuring it remains

01:20 - 01:22

physically

01:25 - 01:29

plausible We compare our low-level

01:27 - 01:32

motion generator with the Baseline CNET

01:29 - 01:34

Plus plus grip while the Baseline fails

01:32 - 01:36

to produce precise hand and finger

01:34 - 01:40

motions our method successfully

01:36 - 01:40

generates accurate hand object

01:52 - 01:59

interactions we also compare our system

01:54 - 02:01

with two ablations CNET and C plus rnet

01:59 - 02:03

our full system generates natural finger

02:01 - 02:06

movements making the interaction much

02:03 - 02:08

more realistic while the ablations fail

02:06 - 02:10

to produce finger movements and exhibit

02:08 - 02:14

severe artifacts for further details on

02:10 - 02:14

the ablations please refer to our

02:22 - 02:27

paper we compare the kinematic motion

02:25 - 02:29

generated by the motion generator with

02:27 - 02:30

the motion tracked by the physics

02:29 - 02:32

tracker

02:30 - 02:34

our results demonstrate that the physics

02:32 - 02:37

tracker effectively corrects artifacts

02:34 - 02:40

in the kinematic motion such as foot

02:37 - 02:40

floating and hand object

02:45 - 02:50

penetration next we present some results

02:48 - 02:53

of long sequence generation using our

02:50 - 02:53

system

03:12 - 03:18

here we instruct the agent to clean the

03:15 - 03:19

area in front of the TV the VM agent

03:18 - 03:22

correctly identifies the floor lamp and

03:19 - 03:24

trash can which occupy the space in

03:22 - 03:27

front of the TV and moves them to

03:24 - 03:27

another location

03:59 - 04:03

the next Tas is to prepare Christmas

04:01 - 04:06

presents and make the Christmas tree

04:03 - 04:08

brighter the agent uses Common Sense

04:06 - 04:11

recognizing that presents should be

04:08 - 04:13

placed around the Christmas tree it also

04:11 - 04:15

understands that a lamp can be used to

04:13 - 04:18

shine light onto an object to make it

04:15 - 04:18

brighter

04:50 - 04:56

we now ask the agent to set up a seat

04:53 - 04:57

the agent correctly Associates the chair

04:56 - 04:59

with the command and understands that

04:57 - 05:04

the vase on top of the chair must must

04:59 - 05:04

be moved before the chair can be safely

05:27 - 05:34

relocated now the task is to stack the

05:30 - 05:35

boxes in the most stable way the agent

05:34 - 05:38

understands the physical concept of

05:35 - 05:40

stability in the context of box stacking

05:38 - 05:42

and correctly determines that a smaller

05:40 - 05:45

box should be placed on top of a larger

05:42 - 05:45

one

06:18 - 06:22

now the agent is asked to set up a

06:19 - 06:24

workspace the agent identifies that the

06:22 - 06:27

Monitor and chair are needed to complete

06:24 - 06:30

the task and suggests moving the chair

06:27 - 06:32

later to avoid obstruction it also uses

06:30 - 06:35

Common Sense by orienting the chair and

06:32 - 06:35

monitor toward each

06:35 - 06:40

other after arranging the workspace the

06:38 - 06:43

agent receives a new task to organize

06:40 - 06:48

the shoes it identifies the loose shoe

06:43 - 06:48

on the floor and returns it to the shoe

06:51 - 06:57

cabinet finally we ask the agent to do

06:54 - 06:58

laundry the agent understands that the

06:57 - 07:00

instruction involves bringing the

06:58 - 07:02

laundry back basket to the washing

07:00 - 07:05

machine and successfully completes the

07:02 - 07:05

task

Synthesizing Human Object Interactions with High-level Instructions

In this comprehensive system, we showcase the synthesis of physically plausible human-object interactions based on high-level instructions, emphasizing the importance of generating synchronized object, human, and finger motions. By instructing a human to arrange boxes, our system successfully orchestrates detailed movements, blending a high-level planner with a low-level motion generator to ensure the physical plausibility of actions.

Elements of the System

Our system intricately weaves together different components to achieve seamless interactions. Firstly, a high-level planner based on LLM processes human-level instructions, leading to the creation of a scene map and an execution plan. Subsequently, a low-level motion generator harmonizes object, human body, and finger motions in a synchronized manner. To guarantee realistic outcomes, a physics tracker, employing reinforcement learning techniques, validates and refines the generated motions for physical accuracy.

Performance Evaluation

Comparisons with baseline models such as CNET Plus highlight the superiority of our methodology in generating precise hand and finger motions. In contrast, the baselines struggle to replicate natural interactions, showcasing the effectiveness of our system in producing authentic finger movements. Additionally, comparisons with ablations like CNET and C+RNET underscore the realism our system offers, particularly in simulating intricate finger motions absent in the ablations.

Long Sequence Generation

The complexity of human-object interactions is further exemplified through long sequence tasks. From cleaning an area to preparing Christmas presents and organizing a workspace, our system showcases a deep understanding of contextual tasks. By accurately identifying objects in various scenarios, our system proficiently executes tasks, highlighting a remarkable grasp of common sense and physical concepts.

Conclusion

Through a blend of high-level planning, motion generation, and physics tracking, our system demonstrates a remarkable ability to synthesize human-object interactions authentically. From stacking boxes to arranging workspaces and organizing shoes, the system showcases a nuanced understanding of tasks, enriching the realm of human-AI interactions with practical applications that mirror real-world scenarios.