00:01 - 00:07

want to thank the

00:03 - 00:10

organizers for choosing a paper for this

00:07 - 00:13

award it was very

00:10 - 00:16

nice and I also want to thank my

00:13 - 00:20

incredible co-authors and collaborators

00:16 - 00:22

oral vineel and qule who stood right

00:20 - 00:26

before you a moment

00:22 - 00:31

ago and what you have here

00:26 - 00:31

is an image a screenshot

00:34 - 00:40

from a similar talk 10 years ago at new

00:37 - 00:43

RPS in 2014 in

00:40 - 00:46

Montreal and it was a much more innocent

00:43 - 00:49

time here we are shown in the

00:46 - 00:52

photos this is the before here's the

00:49 - 00:52

after by the

00:53 - 01:00

way and now we've got more

00:56 - 01:01

experienced hopefully wiser

01:00 - 01:06

but here I'd like to talk a little bit

01:01 - 01:08

about the work itself and maybe a 10year

01:06 - 01:12

retrospective on

01:08 - 01:16

it because a lot of the things in this

01:12 - 01:18

work were correct but some not so much

01:16 - 01:20

and we can review them and we can see

01:18 - 01:23

what happened and how it gently flowed

01:20 - 01:27

to where we are

01:23 - 01:30

today so let's begin by talking about

01:27 - 01:32

what we did and the way we'll do it

01:30 - 01:37

is by showing

01:32 - 01:40

slides from the same talk 10 years

01:37 - 01:42

ago but the summary of what we did is

01:40 - 01:45

the following three bullet points it's

01:42 - 01:47

an auto regressive model train on text

01:45 - 01:50

it's a large neural network and it's a

01:47 - 01:54

large data set and that's it now let's

01:50 - 01:58

dive in into the details a little bit

01:54 - 02:01

more so this was a slide 10 years ago

01:58 - 02:04

not too bad the Deep load

02:01 - 02:07

hypothesis and what we said here is that

02:04 - 02:08

if you have a large neural network with

02:07 - 02:11

10

02:08 - 02:14

layers then it can do anything that a

02:11 - 02:16

human being can do in a fraction of a

02:14 - 02:19

second like why did we have this

02:16 - 02:21

emphasis emphasis on things that human

02:19 - 02:25

beings can do in a fraction of a second

02:21 - 02:28

why this thing specifically well if you

02:25 - 02:30

believe the Deep learning Dogma so to

02:28 - 02:31

say that artificial neurons and

02:30 - 02:33

biological neurons are similar or at

02:31 - 02:35

least not too

02:33 - 02:37

different and you believe that real

02:35 - 02:40

neurons are slow than anything that we

02:37 - 02:42

can do quickly by we I mean human beings

02:40 - 02:44

I even mean just one human in the entire

02:42 - 02:46

world if there is one human in the

02:44 - 02:48

entire world that can do some task in a

02:46 - 02:50

fraction of a second then a 10 layer

02:48 - 02:52

neural network can do it too right it

02:50 - 02:54

follows you just take their connections

02:52 - 02:57

and you embed them inside your neuronet

02:54 - 02:58

the artificial one so this was the

02:57 - 03:00

motivation anything that a human being

02:58 - 03:01

can do in a fraction of a second second

03:00 - 03:04

a big 10 10 layer neural network can do

03:01 - 03:06

too we focused on 10 layer neural

03:04 - 03:08

networks because this was the neural

03:06 - 03:09

networks we knew how to train back in

03:08 - 03:12

the

03:09 - 03:15

day if you could go beyond in your

03:12 - 03:17

layers somehow then you could do more

03:15 - 03:20

but back then we could only do 10 layers

03:17 - 03:23

which is why we emphasized whatever

03:20 - 03:25

human beings can do in a fraction of a

03:23 - 03:28

second a different slide from the talk a

03:25 - 03:29

slide which says our main idea and you

03:28 - 03:31

may be able to recognize two things or

03:29 - 03:33

at least one thing you might be able to

03:31 - 03:36

recognize that something Auto regressive

03:33 - 03:38

is going on here what is it saying

03:36 - 03:41

really what does this slide really say

03:38 - 03:42

this slide says that if you have an auto

03:41 - 03:45

regressive

03:42 - 03:48

model and it predicts the next token

03:45 - 03:50

well enough then it will in fact grab

03:48 - 03:52

and capture and grasp the correct

03:50 - 03:54

distribution over whatever over

03:52 - 03:57

sequences that come next and this was a

03:54 - 03:58

relatively new thing it wasn't literally

03:57 - 04:01

the first

03:58 - 04:03

ever Auto regressive neural network but

04:01 - 04:05

I would argue it was the first Auto

04:03 - 04:07

regressive neural network where we

04:05 - 04:09

really believed that if you train it

04:07 - 04:12

really well then you will get whatever

04:09 - 04:14

you want in our case back then was the

04:12 - 04:16

humble today humble then incredibly

04:14 - 04:18

audacious task of

04:16 - 04:20

translation now I'm going to show you

04:18 - 04:22

some ancient history that many of you

04:20 - 04:24

might have never seen before it's called

04:22 - 04:24

the

04:24 - 04:31

lstm to those unfamiliar an lstm is the

04:28 - 04:32

things that poor deplan researchers did

04:31 - 04:35

before

04:32 - 04:40

Transformers and it's basically a res

04:35 - 04:43

net but rotated 90° so that's an

04:40 - 04:45

lsdm and it came before it's

04:43 - 04:47

like it's like kind of like a slightly

04:45 - 04:51

more complicated reset you can see there

04:47 - 04:53

is your integrator which is now called

04:51 - 04:55

the residual stream but you've got some

04:53 - 04:55

multiplication going on it's a little

04:55 - 04:58

bit more

04:55 - 05:01

complicated but that's what we did it

04:58 - 05:04

was a reset Ro

05:01 - 05:05

90° another cool feature from that Old

05:04 - 05:09

talk that I want to highlight is that we

05:05 - 05:11

used parallelization but not just any

05:09 - 05:14

parallelization we used

05:11 - 05:16

pipelining as witnessed by this one

05:14 - 05:19

layer per

05:16 - 05:22

GPU was it wise to pipeline as we now

05:19 - 05:25

know pipelining is not wise but we were

05:22 - 05:28

not as wise back then so we used that

05:25 - 05:31

and we got a 3.5x speed up using eight

05:28 - 05:31

gpus

05:32 - 05:36

and the conclusion slide in some sense

05:35 - 05:39

the conclusion slide from the talk from

05:36 - 05:41

back then is the most important slide

05:39 - 05:45

because it spelled out what could

05:41 - 05:47

arguably be the beginning of the scaling

05:45 - 05:49

hypothesis right that if you have a very

05:47 - 05:51

big data set and you train a very big

05:49 - 05:54

neural network then success is

05:51 - 05:58

guaranteed and one can argue if one is

05:54 - 06:01

charitable that this indeed has been

05:58 - 06:05

what's been happening

06:01 - 06:07

I want to mention one other idea and

06:05 - 06:09

this is I claim the idea that truly

06:07 - 06:11

stood the test of time it's the core

06:09 - 06:14

idea of deploying itself it's the idea

06:11 - 06:17

of connectionism it's the idea that if

06:14 - 06:21

you allow yourself to

06:17 - 06:23

believe that an artificial neuron is

06:21 - 06:26

kind of sort

06:23 - 06:29

of like a biological

06:26 - 06:32

neuron right if you believe that one is

06:29 - 06:32

kind of sort like the

06:32 - 06:37

other then it gives you the confidence

06:36 - 06:39

to believe that very large neural

06:37 - 06:41

networks they don't need to be literally

06:39 - 06:44

human brain scale they might be a little

06:41 - 06:46

bit smaller but you could configure them

06:44 - 06:50

to do pretty much all the

06:46 - 06:52

things that we do human beings there's

06:50 - 06:55

still a difference oh I forgot to the

06:52 - 06:57

end there is still a difference because

06:55 - 07:00

the human brain also figures out how to

06:57 - 07:02

reconfigure itself whereas we are using

07:00 - 07:04

the best learning algorithms that we

07:02 - 07:07

have which

07:04 - 07:09

require as many data points as there are

07:07 - 07:10

parameters human beings are still better

07:09 - 07:16

in this

07:10 - 07:19

regard but what this led so I claim

07:16 - 07:21

arguably to the age of pre-training and

07:19 - 07:25

the age of pre-training is what we might

07:21 - 07:28

say the gpt2 model the gpt3 model the

07:25 - 07:31

scaling laws and I want to specifically

07:28 - 07:36

call out my former collaborators Alec

07:31 - 07:36

Radford also Jared Kaplan Dario

07:36 - 07:41

mode for really making this

07:39 - 07:43

work but that led to the age of

07:41 - 07:45

pre-training and this is what's been the

07:43 - 07:48

driver of all of progress all the

07:45 - 07:49

progress that we see today extra large

07:48 - 07:51

neural

07:49 - 07:55

networks extraordinar large neural

07:51 - 07:58

networks trained on huge data

07:55 - 08:00

sets but pre-training as we know it will

07:58 - 08:04

unquestionably end

08:00 - 08:06

pre-training will end why will it end

08:04 - 08:09

because while computers growing through

08:06 - 08:12

better Hardware better

08:09 - 08:15

algorithms and logic clusters right all

08:12 - 08:17

those things keep increasing your

08:15 - 08:20

compute all these things keep increasing

08:17 - 08:23

your compute the data is not

08:20 - 08:26

growing because we have but one

08:23 - 08:28

internet we have but one

08:26 - 08:31

internet you could even say you can even

08:28 - 08:33

go as far as to say that data is the

08:31 - 08:36

fossil fuel of

08:33 - 08:39

AI it was like created

08:36 - 08:44

somehow and now we use

08:39 - 08:46

it and we've achieved Peak data and

08:44 - 08:48

there'll be no more we have to deal with

08:46 - 08:52

the data that we have now it still still

08:48 - 08:56

let us go quite far but this

08:52 - 09:00

is there's only one

08:56 - 09:02

internet so here I'll take um a bit of

09:00 - 09:04

Liberty to speculate about what comes

09:02 - 09:06

next actually I don't need to speculate

09:04 - 09:08

because many people are speculating too

09:06 - 09:10

and I'll mention their

09:08 - 09:13

speculations you may have heard the

09:10 - 09:16

phrase agents it's common and I'm sure

09:13 - 09:18

that eventually something will happen

09:16 - 09:20

but people feel like something agents is

09:18 - 09:23

the

09:20 - 09:26

future more concretely but also a little

09:23 - 09:28

bit vaguely synthetic data but what does

09:26 - 09:30

synthetic data mean figuring this out is

09:28 - 09:32

a big challenge

09:30 - 09:35

and I'm sure that different people have

09:32 - 09:37

all kinds of interesting progress there

09:35 - 09:39

and an inference time compute or maybe

09:37 - 09:42

what's been most recently most vividly

09:39 - 09:45

seen in 01 the o1 model these are all

09:42 - 09:46

examples of things of people trying to

09:45 - 09:48

figure

09:46 - 09:50

out what to do after

09:48 - 09:52

pre-training and those are all very good

09:50 - 09:56

things to

09:52 - 09:58

do I want to mention one other example

09:56 - 10:00

from biology which I think is really

09:58 - 10:04

cool

10:00 - 10:07

and the example is this so about many

10:04 - 10:09

many years ago at this conference also I

10:07 - 10:12

saw a talk where someone presented this

10:09 - 10:16

graph but the graph showed the

10:12 - 10:18

relationship between the size of the

10:16 - 10:21

body

10:18 - 10:24

of the size of the body of a mammal and

10:21 - 10:27

the size of their brain in this case

10:24 - 10:29

it's in mass and the that talk I

10:27 - 10:31

remember vividly they were saying look

10:29 - 10:33

it's in biology everything is so messy

10:31 - 10:36

but here you have one rare example where

10:33 - 10:38

there is a very tight relationship

10:36 - 10:41

between the size of the body of the

10:38 - 10:43

animal and their brain and totally

10:41 - 10:46

randomly I became curious at this graph

10:43 - 10:48

and one of the early one of the early so

10:46 - 10:50

I went to Google to do research to to

10:48 - 10:53

look for this graph and one of the

10:50 - 10:55

images and Google Images was this and

10:53 - 10:58

the interesting thing in this

10:55 - 11:00

image is you see like I don't know is

10:58 - 11:02

the mouse work working oh yeah the mouse

11:00 - 11:05

is working great so you've got this

11:02 - 11:07

mammals right all the different

11:05 - 11:10

mammals then you've got nonhuman

11:07 - 11:12

primates it's basically the same thing

11:10 - 11:16

but then you've got the hominids and to

11:12 - 11:19

my knowledge hominids are like close

11:16 - 11:22

relatives to the humans in

11:19 - 11:22

evolution like the

11:24 - 11:29

neand there's a bunch of them like it's

11:27 - 11:33

called homohabilis maybe there a whole

11:29 - 11:35

bunch and they're all here and what's

11:33 - 11:38

interesting is that they have a

11:35 - 11:40

different slope on their brain to body

11:38 - 11:43

scaling

11:40 - 11:46

exponent so that's pretty cool what that

11:43 - 11:49

means is that there is a precedent there

11:46 - 11:53

is an example

11:49 - 11:56

of biology figuring out some kind of

11:53 - 11:58

different scaling something clearly is

11:56 - 12:00

different so I think that is cool and by

11:58 - 12:02

the way I want to highlight highl light

12:00 - 12:06

this xaxis is log scale you see this is

12:02 - 12:12

100 this is a th000 10,000 100,000 and

12:06 - 12:12

likewise in grams 1 g 10 G 100 g th000

12:13 - 12:17

g

12:15 - 12:19

so it is possible for things to be

12:17 - 12:20

different the things that we are doing

12:19 - 12:22

the things that we've been scaling so

12:20 - 12:25

far is actually the first thing that we

12:22 - 12:28

figured out how to scale and without

12:25 - 12:30

doubt the field everyone who's working

12:28 - 12:33

here will figure out

12:30 - 12:36

what to do but I want to talk here I

12:33 - 12:38

want to take a few minutes and speculate

12:36 - 12:40

about the longer term the longer term

12:38 - 12:42

where are we all headed right we're

12:40 - 12:47

making all this progress it's an it's

12:42 - 12:49

astounding progress It's really I

12:47 - 12:51

mean those of you who' have been in the

12:49 - 12:55

field 10 years ago and you remember just

12:51 - 12:58

how incapable everything has

12:55 - 13:00

been like yes you can say even if you

12:58 - 13:02

kind of say of course learning still to

13:00 - 13:04

see it is just

13:02 - 13:06

unbelievable it's

13:04 - 13:07

completely I can't convey that feeling

13:06 - 13:09

to

13:07 - 13:11

you you know if you joined the field in

13:09 - 13:13

the last two years then of course you

13:11 - 13:15

speak to computers and they talk back to

13:13 - 13:18

you and they disagree and that's what

13:15 - 13:21

computers are but it hasn't always been

13:18 - 13:24

the case but I want to talk to a little

13:21 - 13:27

bit about super intelligence just a bit

13:24 - 13:29

cuz that is obviously where this field

13:27 - 13:32

is headed this is obviously what's being

13:29 - 13:33

built here and the thing about super

13:32 - 13:37

intelligence is that it will be

13:33 - 13:40

different qualitatively from what we

13:37 - 13:42

have and my goal in the next minute to

13:40 - 13:45

try to give

13:42 - 13:47

you some concrete intuition of how it

13:45 - 13:48

will be different so that you yourself

13:47 - 13:51

could reason about

13:48 - 13:53

it so right now we have our incredible

13:51 - 13:54

language models and the unbelievable

13:53 - 13:57

chat bot and they can even do things but

13:54 - 13:59

they're also kind of strangely

13:57 - 14:01

unreliable and they get confused

13:59 - 14:04

when while also

14:01 - 14:06

having dramatically superhuman

14:04 - 14:09

performance on evals so it's really

14:06 - 14:13

unclear how to reconcile this but

14:09 - 14:15

eventually sooner or later the following

14:13 - 14:17

will be achieved those systems are

14:15 - 14:19

actually going to be agentic in a real

14:17 - 14:22

ways whereas right now the systems are

14:19 - 14:24

not agents in any meaningful sense just

14:22 - 14:27

very that might be too strong they're

14:24 - 14:30

very very slightly agentic just

14:27 - 14:31

beginning it will actually reason and by

14:30 - 14:35

the way I want to mention something

14:31 - 14:37

about reasoning is that a system that

14:35 - 14:40

reasons the more it reasons the more

14:37 - 14:41

unpredictable it becomes the more it

14:40 - 14:43

reasons the more unpredictable it

14:41 - 14:45

becomes all the Deep learning that we've

14:43 - 14:47

been used to is very predictable because

14:45 - 14:49

if you've been working on replicating

14:47 - 14:53

human intuition essentially it's like

14:49 - 14:55

the gut fi if you come back to the 0.1

14:53 - 14:59

second reaction time what kind of

14:55 - 15:02

processing we do in our brains well

14:59 - 15:05

it's our intuition so we've endowed

15:02 - 15:07

ouris with some of that intuition but

15:05 - 15:10

reasoning you're seeing some early signs

15:07 - 15:11

of that reasoning is unpredictable and

15:10 - 15:14

one reason to see that is because the

15:11 - 15:16

chess AIS the really good ones are

15:14 - 15:20

unpredictable to the best human chess

15:16 - 15:23

players so we will have to be dealing

15:20 - 15:25

with AI systems that are incredibly

15:23 - 15:27

unpredictable they will understand

15:25 - 15:29

things from limited data they will not

15:27 - 15:31

get confused all the things which are

15:29 - 15:34

really big limitations I'm not saying

15:31 - 15:36

how by the way and I'm not saying when

15:34 - 15:38

I'm saying that it

15:36 - 15:40

will and when all those things will

15:38 - 15:42

happen together with

15:40 - 15:45

self-awareness because why not

15:42 - 15:48

self-awareness is useful it is part your

15:45 - 15:50

ourselves are parts of our own world

15:48 - 15:52

models when all those things come

15:50 - 15:54

together we will have systems of

15:52 - 15:56

radically different qualities and

15:54 - 15:58

properties that exist today and of

15:56 - 16:00

course they will have incredible and

15:58 - 16:01

amazing capabili is but the kind of

16:00 - 16:03

issues that come up with systems like

16:01 - 16:06

this and I'll just leave it as an

16:03 - 16:08

exercise to um

16:06 - 16:12

imagine it's very different from what we

16:08 - 16:15

used to

16:12 - 16:18

and I would say that it's definitely

16:15 - 16:22

also impossible to predict the future

16:18 - 16:25

really all kinds of stuff is possible

16:22 - 16:30

but on this uplifting note I will

16:25 - 16:30

conclude thank you so much um

16:31 - 16:35

[Applause]

16:33 - 16:41

[Music]

16:35 - 16:41

[Applause]

16:44 - 16:47

thank

16:45 - 16:50

you um now in

16:47 - 16:52

2024 are there other biological

16:50 - 16:54

structures that are part of human

16:52 - 16:57

cognition that you think are worth

16:54 - 17:00

exploring in a similar way or that

16:57 - 17:00

you're interested in anyway

17:03 - 17:10

so the way I'd answer this

17:05 - 17:12

question is that if you are or someone

17:10 - 17:16

is a person who has a specific insight

17:12 - 17:18

about hey we are all being extremely

17:16 - 17:19

silly because clearly the brain does

17:18 - 17:21

something and we are

17:19 - 17:25

not and that's something that can be

17:21 - 17:28

done they should pursue it I personally

17:25 - 17:28

don't

17:29 - 17:32

well depends on the level of abstraction

17:31 - 17:34

you're looking at maybe I'll answer it

17:32 - 17:38

this way like there's been a lot of

17:34 - 17:40

desire to make biologically inspired Ai

17:38 - 17:42

and you could argue on some level that

17:40 - 17:44

biologically inspired AI is incredibly

17:42 - 17:46

successful which is all of the learning

17:44 - 17:48

biologically inspired AI but on the

17:46 - 17:50

other hand the biological inspiration

17:48 - 17:53

was very very very modest it's like

17:50 - 17:55

let's use neurons this is the full

17:53 - 17:56

extent of the biological inspiration

17:55 - 17:59

let's use

17:56 - 18:01

neurons and more detailed bi iCal

17:59 - 18:04

inspiration has been very hard to come

18:01 - 18:06

by but I wouldn't rule it out I think if

18:04 - 18:08

someone has a special Insight they might

18:06 - 18:10

be able to to see something and that

18:08 - 18:10

would be

18:11 - 18:18

useful I have a question for you um

18:14 - 18:19

about sort of autocorrect um so here is

18:18 - 18:23

here's the question you mentioned

18:19 - 18:25

reasoning as being um one of the core

18:23 - 18:29

aspects of maybe the modeling in the

18:25 - 18:31

future and maybe a differentiator

18:29 - 18:33

um what we saw in some of the poster

18:31 - 18:36

sessions is that hallucinations in

18:33 - 18:37

today's models are the way we're

18:36 - 18:39

analyzing I mean maybe you correct me

18:37 - 18:41

you're the expert on this but the way

18:39 - 18:43

we're analyzing whether a model is

18:41 - 18:45

hallucinating today without because we

18:43 - 18:47

know of the dangers of models not being

18:45 - 18:50

able to reason that we're using a

18:47 - 18:51

statistical analysis let's say some

18:50 - 18:54

amount of standard deviations or

18:51 - 18:57

whatever away from the mean in the

18:54 - 19:00

future wouldn't it would do you think

18:57 - 19:01

that a model given reasoning will be

19:00 - 19:04

able to correct itself sort of

19:01 - 19:07

autocorrect itself and that will be a

19:04 - 19:09

core feature of Future model so that

19:07 - 19:11

there won't be as many hallucinations

19:09 - 19:13

because the model will recognize when I

19:11 - 19:15

maybe that's too esoteric of a question

19:13 - 19:16

but the model will be able to reason and

19:15 - 19:18

understand when a Hallucination is

19:16 - 19:21

occurring does the question make sense

19:18 - 19:24

yes and the answer is also yes I think

19:21 - 19:27

what you described is extremely highly

19:24 - 19:29

plausible yeah I mean you should check I

19:27 - 19:31

mean for yeah it's I wouldn't I wouldn't

19:29 - 19:33

rule out that it might already be

19:31 - 19:36

happening with some of the you know

19:33 - 19:39

early reasoning models of today I don't

19:36 - 19:41

know but longer term why not yeah I mean

19:39 - 19:43

it's part part of like Microsoft Word

19:41 - 19:46

like autocorrect it's a you know it's a

19:43 - 19:48

it's a core feature yeah I just I mean I

19:46 - 19:51

think calling it autocorrect is really

19:48 - 19:55

doing any disservice I think you are

19:51 - 19:58

when you say autocorrect you evoke

19:55 - 19:59

like it's far grander than autocorrect

19:58 - 20:03

but other than but you know this point

19:59 - 20:03

aside the answer is yes thank

20:03 - 20:11

you hiia I loved the ending uh

20:08 - 20:13

mysteriously uh leaving out do they

20:11 - 20:15

replace us or are they you know Superior

20:13 - 20:19

do they need rights you know it's a new

20:15 - 20:21

species of homo sapien spawned

20:19 - 20:24

intelligence so maybe they need I mean

20:21 - 20:26

uh I think the RL guy uh thinks they

20:24 - 20:30

think uh you know we need rights for

20:26 - 20:32

these things I have a UNR question to

20:30 - 20:35

that how do you how do you create the

20:32 - 20:39

right incentive mechanisms for Humanity

20:35 - 20:42

to actually create it in a way that

20:39 - 20:45

gives it the freedoms that we have as

20:42 - 20:45

Homo

20:45 - 20:50

sapiens you know I feel like this in

20:48 - 20:52

some in some in some sense those are

20:50 - 20:55

those are the kind of questions that

20:52 - 20:59

people should be uh reflecting on

20:55 - 20:59

more but

21:05 - 21:10

to your question about what incentive

21:07 - 21:12

structure should we create I I don't

21:10 - 21:15

feel that I know I don't feel confident

21:12 - 21:17

answering questions like this

21:15 - 21:20

because uh it's like you're talking

21:17 - 21:23

about creating some kind of a top down

21:20 - 21:26

structure government thing I don't know

21:23 - 21:29

it could be a cryptocurrency too yeah I

21:26 - 21:32

mean there's bit tensor you know those

21:29 - 21:34

things I don't feel like I am the right

21:32 - 21:38

person to comment on

21:34 - 21:38

cryptocurrency but

21:39 - 21:43

but you know there is a chance by the

21:42 - 21:46

way what what you're describing will

21:43 - 21:49

happen that indeed we will have you know

21:46 - 21:53

in some sense it's it's it's not a bad

21:49 - 21:55

end result if you have AIS and all they

21:53 - 21:58

want is to coexist with

21:55 - 22:00

us and also just to have rights maybe

21:58 - 22:03

that will be fine

22:00 - 22:05

it's but I don't know I mean I think

22:03 - 22:06

things are so incredibly unpredictable I

22:05 - 22:10

I hesitate to comment but I encourage

22:06 - 22:12

the speculation thank you uh and uh yeah

22:10 - 22:15

thank you for the talk it's really

22:12 - 22:17

awesome hi there thank you for the great

22:15 - 22:20

talk my name is shalev liit from

22:17 - 22:22

University of Toronto working with

22:20 - 22:26

Sheila thanks for all the work you've

22:22 - 22:29

done I wanted to ask do you think llms

22:26 - 22:31

generalize multihop Reon reasoning out

22:29 - 22:31

of

22:32 - 22:39

distribution so okay the question

22:36 - 22:41

assumes that the answer is yes or no but

22:39 - 22:44

the question should not be answered with

22:41 - 22:46

yes or no because what does it mean out

22:44 - 22:48

of distribution generalization what does

22:46 - 22:49

it mean what does it mean in

22:48 - 22:52

distribution and what does it mean out

22:49 - 22:56

of distribution because it's a test of

22:52 - 22:59

time talk I'll say that long long

22:56 - 23:01

ago before people were using deep

22:59 - 23:04

learning they were using things like

23:01 - 23:06

string matching engrams for machine

23:04 - 23:09

translation people were using

23:06 - 23:11

statistical phrase tables can you

23:09 - 23:15

imagine they had tens of thousands of

23:11 - 23:16

code of complexity which was I mean it's

23:15 - 23:20

it was truly

23:16 - 23:23

unfathomable and back then

23:20 - 23:25

generalization meant is it literally not

23:23 - 23:29

in this the same phrasing as in the data

23:25 - 23:32

set now we may say well my model

23:29 - 23:34

achieves this high score on um I don't

23:32 - 23:37

know math competitions but maybe the

23:34 - 23:39

math maybe some discussion in some Forum

23:37 - 23:41

on the internet was about the same ideas

23:39 - 23:43

and therefore it's memorized well okay

23:41 - 23:44

you could say maybe it's in distribution

23:43 - 23:46

maybe it's

23:44 - 23:47

memorization but I also think that our

23:46 - 23:50

standards for what counts as

23:47 - 23:52

generalization have increased really

23:50 - 23:54

quite substantially dramatically

23:52 - 23:56

unimaginably if you keep

23:54 - 24:00

track

23:56 - 24:02

and so I think then answer is to some

24:00 - 24:04

degree probably not as well as human

24:02 - 24:08

beings I think it is true that human

24:04 - 24:10

beings generalize much better but at the

24:08 - 24:12

same time they definitely generalize out

24:10 - 24:16

of distribution to some

24:12 - 24:18

degree I hope it's a useful topological

24:16 - 24:22

answer thank

24:18 - 24:23

you and unfortunately we're out of time

24:22 - 24:26

for this session I have a feeling we

24:23 - 24:28

could go on for the next six hours uh

24:26 - 24:32

but thank you so much Ilia for the talk

24:28 - 24:32

thank you wonderful

24:32 - 24:38

[Applause]

24:34 - 24:38

[Music]

Exploring the Future of AI: Retrospective and Speculation

In this insightful talk, the speaker reflects on a paper chosen for an award and the evolution of AI over the past decade. The focus is on the key elements of the work - an auto-regressive model trained on text, a large neural network, and a large dataset. Taking a journey back to a time when 10-layer neural networks were cutting edge, the talk delves into the origins of concepts like the Deep Load Hypothesis and auto-regressive models. Retrospective analysis reveals both accurate and incorrect predictions from the past. The discussion extends to the use of LSTMs, parallelization strategies like pipelining, and the scaling hypothesis - the idea that success in AI is guaranteed with big data and big neural networks. Moving beyond pre-training, the talk touches upon the future of AI, hinting at challenges regarding data availability.

The speaker speculates on what lies ahead, mentioning concepts like "agents" and synthetic data, indicating a shift from the current pre-training era. The idea of deploying connectionism, where artificial neurons mirror biological neurons, is highlighted as a core concept that has stood the test of time and led to advancements like GPT models and scaling laws. The talk concludes on a thought-provoking note about the potential of superintelligent AI, emphasizing the need for reasoning abilities and the unpredictable nature of future AI systems.

In the Q&A session, topics like hallucinations in AI models, incentive mechanisms for creating AI, generalization capabilities of language models, and the future coexistence of AI and humanity are explored. The speaker's responses underscore the complexity and uncertainty surrounding the future of AI, emphasizing the need for thoughtful consideration and speculation in navigating this evolving landscape.

As we ponder the trajectory of AI and its implications for society, one thing is certain - the future holds a realm of possibilities, challenges, and ethical considerations that we are only beginning to grasp. Embracing the unknown with curiosity and caution will be key as we venture further into the realms of artificial intelligence and its impact on our world.


Key Points:

  • AI Evolution Over the Decade
  • Concepts: Auto-regressive Models, Neural Networks, Big Data
  • Retrospective Analysis and Predictions
  • Speculation on the Future of AI
  • Implications for Society and Ethical Considerations