00:21 - 00:25

we present a complete system that

00:23 - 00:28

synthesizes physically plausible human

00:25 - 00:30

object interactions from Human level

00:28 - 00:33

instructions given AB ract human level

00:30 - 00:35

instructions we generate synchronized

00:33 - 00:38

object motion human motion and finger

00:35 - 00:40

motion here we instruct the human to

00:38 - 00:44

arrange the boxes to represent move and

00:40 - 00:44

they successfully completed the

00:46 - 00:51

task our system combines a highlevel

00:49 - 00:53

planner with a low-level motion

00:51 - 00:55

generator to produce detailed movements

00:53 - 00:58

followed by a physics tracker to ensure

00:55 - 01:00

the Motions are physically plausible

00:58 - 01:02

first we use a llm based highle planner

01:00 - 01:04

to reason about the human level

01:02 - 01:07

instructions and generate a scene map

01:04 - 01:09

and an execution plan then a low-level

01:07 - 01:12

motion generator generates synchronized

01:09 - 01:15

object motion full body human motion and

01:12 - 01:17

finger motion finally a physics tracker

01:15 - 01:20

employs reinforcement learning to

01:17 - 01:22

imitate the motion ensuring it remains

01:20 - 01:22

physically

01:25 - 01:29

plausible We compare our low-level

01:27 - 01:32

motion generator with the Baseline CNET

01:29 - 01:34

Plus plus grip while the Baseline fails

01:32 - 01:36

to produce precise hand and finger

01:34 - 01:40

motions our method successfully

01:36 - 01:40

generates accurate hand object

01:52 - 01:59

interactions we also compare our system

01:54 - 02:01

with two ablations CNET and C plus rnet

01:59 - 02:03

our full system generates natural finger

02:01 - 02:06

movements making the interaction much

02:03 - 02:08

more realistic while the ablations fail

02:06 - 02:10

to produce finger movements and exhibit

02:08 - 02:14

severe artifacts for further details on

02:10 - 02:14

the ablations please refer to our

02:22 - 02:27

paper we compare the kinematic motion

02:25 - 02:29

generated by the motion generator with

02:27 - 02:30

the motion tracked by the physics

02:29 - 02:32

tracker

02:30 - 02:34

our results demonstrate that the physics

02:32 - 02:37

tracker effectively corrects artifacts

02:34 - 02:40

in the kinematic motion such as foot

02:37 - 02:40

floating and hand object

02:45 - 02:50

penetration next we present some results

02:48 - 02:53

of long sequence generation using our

02:50 - 02:53

system

03:12 - 03:18

here we instruct the agent to clean the

03:15 - 03:19

area in front of the TV the VM agent

03:18 - 03:22

correctly identifies the floor lamp and

03:19 - 03:24

trash can which occupy the space in

03:22 - 03:27

front of the TV and moves them to

03:24 - 03:27

another location

03:59 - 04:03

the next Tas is to prepare Christmas

04:01 - 04:06

presents and make the Christmas tree

04:03 - 04:08

brighter the agent uses Common Sense

04:06 - 04:11

recognizing that presents should be

04:08 - 04:13

placed around the Christmas tree it also

04:11 - 04:15

understands that a lamp can be used to

04:13 - 04:18

shine light onto an object to make it

04:15 - 04:18

brighter

04:50 - 04:56

we now ask the agent to set up a seat

04:53 - 04:57

the agent correctly Associates the chair

04:56 - 04:59

with the command and understands that

04:57 - 05:04

the vase on top of the chair must must

04:59 - 05:04

be moved before the chair can be safely

05:27 - 05:34

relocated now the task is to stack the

05:30 - 05:35

boxes in the most stable way the agent

05:34 - 05:38

understands the physical concept of

05:35 - 05:40

stability in the context of box stacking

05:38 - 05:42

and correctly determines that a smaller

05:40 - 05:45

box should be placed on top of a larger

05:42 - 05:45

one

06:18 - 06:22

now the agent is asked to set up a

06:19 - 06:24

workspace the agent identifies that the

06:22 - 06:27

Monitor and chair are needed to complete

06:24 - 06:30

the task and suggests moving the chair

06:27 - 06:32

later to avoid obstruction it also uses

06:30 - 06:35

Common Sense by orienting the chair and

06:32 - 06:35

monitor toward each

06:35 - 06:40

other after arranging the workspace the

06:38 - 06:43

agent receives a new task to organize

06:40 - 06:48

the shoes it identifies the loose shoe

06:43 - 06:48

on the floor and returns it to the shoe

06:51 - 06:57

cabinet finally we ask the agent to do

06:54 - 06:58

laundry the agent understands that the

06:57 - 07:00

instruction involves bringing the

06:58 - 07:02

laundry back basket to the washing

07:00 - 07:05

machine and successfully completes the

07:02 - 07:05

task

Revolutionizing Human-Object Interactions with Physically Plausible Movements

Imagine instructing a computer to arrange objects like a human would, and witnessing it flawlessly carry out the task. This is now a reality with a cutting-edge system that synthesizes human-object interactions based on abstract human-level instructions. By combining a high-level planner with a low-level motion generator and a physics tracker, this system not only produces detailed, synchronized movements but also ensures their physical plausibility.

The Synthesis Process

At the core of this revolutionary system is the fusion of technology and human-like intuition. The high-level planner deciphers abstract instructions, generating a scene map and an execution plan. The low-level motion generator then kicks into action, creating synchronized motions for objects, full-body human movements, and intricate finger motions. To guarantee physical accuracy, a sophisticated physics tracker, powered by reinforcement learning, fine-tunes the generated motions.

Unleashing the Power of Comparison

To showcase the system's prowess, it was pitted against various benchmarks, including the Baseline CNET Plus. While the Baseline struggled with precision in hand and finger motions, our method excelled in generating realistic hand-object interactions. Furthermore, comparisons with ablations like CNET and C+RNET highlighted the unparalleled naturalness of finger movements in our full system, setting it apart from the rest.

From Corrections to Innovations

A standout feature of this system is its ability to correct motion artifacts through the physics tracker. By aligning the kinematic motion with the tracked motion, issues like foot floating and hand-object penetration are effectively addressed, ensuring a seamless and realistic user experience.

Realizing Concept Understanding through Tasks

Through a series of complex tasks assigned to the agent, we witness the system's remarkable ability to understand concepts and execute tasks with finesse. From cleaning the area in front of the TV to stacking boxes for stability, the agent showcases a profound grasp of spatial relationships and physical principles.

Towards a Future of Intelligent Automation

As the agent seamlessly navigates through tasks like setting up workspaces, organizing shoes, and doing laundry, it becomes evident that the future of intelligent automation is upon us. The convergence of human-level instructions and AI-driven action opens up a world where machines not only assist but also comprehend and execute tasks with human-like precision.

In a world where human-object interactions are redefined by the symbiosis of technology and intuition, this system stands as a beacon of innovation and a testament to the limitless possibilities of AI-driven automation.