00:10 - 00:13

Alright everyone, we're about to start our next session. Again,

00:13 - 00:17

some amazing content we're gonna get through. Just a quick

00:17 - 00:20

couple reminders, this is a hybrid event. So as you

00:20 - 00:23

see, we are all here in person, but we also

00:23 - 00:26

have audiences joined us virtually. So if you want to

00:26 - 00:28

join in the Q&A in the chat, just go ahead

00:28 - 00:31

and just scan those QR codes to the left and

00:31 - 00:31

right

00:31 - 00:34

at all. But if you also want to talk with

00:34 - 00:37

some of our experts, we do have experiences up on

00:37 - 00:39

level 5. We get to get some like like be

00:39 - 00:42

able to talk to our experts and also get get

00:42 - 00:45

your hands on some cool demos upstairs. So with that,

00:45 - 00:47

let's go ahead and let's get started.

00:48 - 00:49

Thanks.

00:49 - 00:50

Morning, everybody.

00:55 - 00:56

How's build going so far?

00:57 - 00:59

So that's good, good build. Alright,

01:00 - 01:03

Welcome to Inside Microsoft AI Innovation. My name is Marker

01:03 - 01:07

Sinovich. Just out of curiosity, how many people have seen

01:07 - 01:10

a previous version of this particular session? So a few

01:10 - 01:12

of you have. So you're going to see some new

01:12 - 01:16

things. Those people that have seen this before. And for

01:16 - 01:18

those of you that haven't seen this before, I hope

01:18 - 01:21

to show you highlighted tour through our AI stack from

01:22 - 01:23

the bottom to the top,

01:23 - 01:27

focusing on infrastructure. So you've heard a lot about Copilot

01:27 - 01:30

and writing apps and how to mitigate security risks. I've

01:30 - 01:33

talked about how to develop copilots on top of semantic

01:33 - 01:34

kernel.

01:34 - 01:37

This is gonna actually go one layer beneath that to

01:37 - 01:40

show you how we run AI inside of Azure. And

01:40 - 01:43

I'm going to start by going through different aspects of

01:43 - 01:47

our infrastructure, starting with compute. Then I'll talk a little

01:47 - 01:50

bit about our network and how we design our network,

01:50 - 01:53

what makes it unique. And then I'll talk about storage

01:53 - 01:56

and some of the innovations that we've got in storage

01:56 - 01:59

that we've deployed to make our infrastructure more efficient and

01:59 - 02:02

if actually efficiency is the key to the game. And

02:02 - 02:03

if you take a look.

02:03 - 02:05

At what's happening?

02:05 - 02:07

To challenge the ability for us to be in an

02:07 - 02:07

efficient,

02:08 - 02:12

You can see that the sizes of these frontier models

02:12 - 02:16

have continued to grow basically exponentially. The biggest models are

02:16 - 02:20

the most powerful models and this trend line doesn't seem

02:20 - 02:24

to be slowing down. You heard Kevin Scott talk about

02:24 - 02:27

with Sam Altman that we don't see an end to

02:27 - 02:30

the scaling law of capability insight. And what this is

02:30 - 02:34

doing is driving more and more compute requirements out of

02:34 - 02:36

the infrastructure.

02:37 - 02:39

One of the ways to look at this is this

02:39 - 02:42

amount of compute requires to train these models because they're

02:42 - 02:45

not only getting big, but the amount of data that

02:45 - 02:48

they need to process during a training run continues to

02:48 - 02:49

grow. So you've heard about tokens,

02:50 - 02:53

and when you train one of these models, you're talking

02:53 - 02:57

about not just trillion tokens, but 10s of trillions of

02:57 - 03:00

tokens are passed through the model for them to learn

03:00 - 03:02

how to get these capabilities.

03:03 - 03:07

The first AI supercomputer we built back in 2019 for

03:07 - 03:10

Open AI was used to train GPT 3.

03:11 - 03:13

And the size of that infrastructure, if we'd submitted it

03:14 - 03:16

to the top 500 supercomputer benchmark would have been in

03:16 - 03:19

the top five supercomputers in the world on premises or

03:19 - 03:22

in the cloud and but the largest in the cloud.

03:23 - 03:27

In November, we talked about the next generation, 2 generations

03:27 - 03:30

later supercomputers, so that we built one to train GT4

03:30 - 03:33

after the GPT 3 one. And this one we're building

03:33 - 03:37

out to train the next generation of weaponized models, this

03:37 - 03:40

one with a slice of that infrastructure. Just a piece

03:40 - 03:44

of that supercomputer that we were bringing online. We actually

03:44 - 03:47

did an official top 500 run and it came in

03:47 - 03:51

number three largest supercomputer in the world and the number

03:51 - 03:54

one largest supercomputer in the public cloud.

03:54 - 03:58

Again, that's a tiny fraction of the supercomputer that we're

03:58 - 04:01

ultimately finished building for Open AI, and we're already on

04:01 - 04:04

to the next one and designing the for the next

04:04 - 04:05

one after.

04:05 - 04:09

That which are multiple times even bigger than this one.

04:09 - 04:12

Now just give you an idea of why they need

04:12 - 04:15

the scale. If you take a look at the training

04:15 - 04:19

requirements for something like Llama 370 billion, which is kind

04:19 - 04:22

of a medium sized model at this point, it's not

04:22 - 04:25

the largest type model. In fact, Meta is coming out

04:25 - 04:28

with a 400 billion parameter version of LAMA 3

04:28 - 04:32

in the near future. They talk about this required 6.4

04:32 - 04:35

million GPU hours on an H100, which is the current

04:35 - 04:38

generation supercomputer infrastructure.

04:39 - 04:42

If you've trained this model on one GPU, that would

04:42 - 04:45

take 730 years to train that model.

04:45 - 04:48

But on the supercomputer that you see on the left,

04:48 - 04:51

that tiny slice of it, that would take about 27

04:51 - 04:54

days. And again, that's a tiny fraction. So just divide

04:54 - 04:56

by the multiplier on top of the side of your

04:56 - 04:59

computer to see that you can start to get down

04:59 - 05:02

into training a model like this literally in days on

05:02 - 05:04

the supercomputers we're building

05:06 - 05:08

now. Another way to look at the amount of compute

05:08 - 05:11

required for these models is to look at the amount

05:11 - 05:14

of memory there require. Because it's talking about the number

05:14 - 05:17

of flops or tops that they require for processing tokens,

05:17 - 05:19

you need to look at how much memory footprint they

05:19 - 05:20

require,

05:21 - 05:24

and this is especially important when it comes to inference

05:24 - 05:27

loading the model and serving it. If you take a

05:27 - 05:30

look at inference, you typically load a model in 16

05:31 - 05:34

bit format, so two bytes. So each parameter is 2

05:34 - 05:37

bytes, which means that if you've got a billion parameters,

05:37 - 05:41

that's about two billion, 2 gigabytes of memory that you

05:41 - 05:44

required to load and store that. And so if you

05:44 - 05:45

take a look at.

05:45 - 05:50

175 billion parameter model which was GPT 3 size. That's

05:50 - 05:54

350 gigabytes just to load the model. But that's not

05:54 - 05:56

all you need for inference.

05:57 - 06:00

Because you load the model, those are the parameters or

06:00 - 06:03

the weights into the GPU. You also have to store

06:03 - 06:06

the processing of the tokens and this is called the

06:06 - 06:10

key value cache, storing what it processed when it's looking

06:10 - 06:13

at the prompt and starting to generate tokens and that

06:13 - 06:15

needs to be retained for efficiency.

06:16 - 06:19

This is kind of the rough layout of a 13

06:19 - 06:23

billion parameter model on a a 40 gigabyte A-100 GPU.

06:24 - 06:27

The fact is that the models of course are getting

06:27 - 06:30

bigger and bigger. This is a 13 billion parameter model.

06:30 - 06:32

If you take it 70 or 400 billion or U

06:32 - 06:36

into the trillion parameter models that you're seeing at the

06:36 - 06:36

frontier,

06:37 - 06:40

then you're talking about many times larger than the sizes

06:40 - 06:43

of the high bandwidth memory that is coming in these

06:43 - 06:46

GPUs. So if you take a look at the evolution

06:46 - 06:48

of GPUs, one of the key aspects of it is

06:48 - 06:51

not just the performance, but how much high bandwidth memory

06:51 - 06:52

comes with it in.

06:52 - 06:53

Fact.

06:53 - 06:57

This has become a limiting factor for GPU's. You can

06:57 - 07:02

see here they've been roughly doubling every year and a

07:02 - 07:05

half or so up until just this past year. When?

07:05 - 07:07

We saw the A-100.

07:07 - 07:10

80 gigabyte come out and then H100 came out at

07:10 - 07:14

80 gigabyte and immediately AMD turned around with the MI300X,

07:14 - 07:18

which you heard about us launching at 192 gigabytes. And

07:18 - 07:22

this made it very compelling for these large models, not

07:22 - 07:26

just large models, but loading for inferences you'll see later.

07:26 - 07:29

And so this is kind of created a race to

07:29 - 07:32

build more and more high bandwidth memory. You can see

07:32 - 07:36

Grace Blackwell 200 coming in at 384 gigabytes of high

07:36 - 07:37

bandwidth memory

07:37 - 07:41

and that makes it extremely powerful for training and especially

07:41 - 07:44

for inference because you can load these big models or

07:44 - 07:47

multiple big models on the same Gu and get more

07:47 - 07:48

efficiency out of it.

07:50 - 07:51

So this is one of the reasons why we have

07:52 - 07:54

this diverse array of hardware in our data centers. We've

07:54 - 07:57

got the lines from NVIDIA. We've partnered closely with them,

07:57 - 07:59

driving requirements into them, including

08:00 - 08:03

information about how to run these large models and why

08:03 - 08:05

we need so much memory AMD.

08:05 - 08:06

So that large.

08:06 - 08:10

HBM that they've got, it's directly through feedback from us

08:10 - 08:13

and Open AI and then our own Azure Maya 100,

08:13 - 08:16

which you've heard about our own custom silicon, which isn't

08:16 - 08:19

a GPU, it's custom accelerator for AI.

08:20 - 08:21

And that

08:23 - 08:26

HBM that is continuing to grow, the performance that's continuing

08:26 - 08:29

to grow, the die sizes that are continuing to grow

08:29 - 08:32

is leading to more and more power consumption at the

08:32 - 08:32

same time.

08:33 - 08:36

If you take a look now at another dimension of

08:36 - 08:40

GPU evolution, number of watts per GPU, you can see

08:40 - 08:42

this this is also growing exponentially.

08:42 - 08:44

Because of those other factors.

08:45 - 08:48

The latest NVIDIA GB 200 with all its high bandwidth

08:48 - 08:51

memory and all its transistors. It's over 200 billion on

08:51 - 08:51

a wafer

08:52 - 08:57

is 1.2 kilowatts just for one GPU, and there's eight

08:57 - 09:01

of them in a single server. And that's just the

09:01 - 09:04

GPUs, not the RAM and the CPUs.

09:04 - 09:07

That go with that. So now you're talking about close

09:07 - 09:09

to 10 kilowatts

09:09 - 09:13

of power for just one server, which is just an

09:13 - 09:17

amazing amount of power for a server in a data

09:17 - 09:17

center.

09:19 - 09:20

Now how do you cool something like?

09:20 - 09:21

This.

09:21 - 09:24

The traditionally it's been air cooled, but we've had liquid

09:24 - 09:26

cooling for a long time. In fact, this is my

09:26 - 09:27

home desktop machine.

09:28 - 09:29

Which is liquid.

09:29 - 09:32

Cooled so that I can play games like Battlefield 2142

09:32 - 09:33

which is my favorite game

09:34 - 09:36

and this is what it looks like when I run

09:37 - 09:39

it at 80% GPU utilization which is when I'm running

09:39 - 09:41

Battlefield at high res.

09:41 - 09:44

56°C and it's able to run that hot because of

09:44 - 09:48

the liquid cooling, because there's no way fans would be

09:48 - 09:51

able to keep it operating at 80% at with this.

09:51 - 09:55

Now this is the operating temperature for a consumer GPU.

09:55 - 09:58

The operating temperature for a data center GPU.

09:59 - 10:00

By the way, this is.

10:00 - 10:04

A Dolly image of our data center if we don't

10:04 - 10:06

have a good cooling solution

10:07 - 10:11

is you can see the H100 operating temperature top 30°C,

10:11 - 10:14

so even way lower than what I've got in my

10:15 - 10:16

consumer GPU.

10:16 - 10:17

So we've got.

10:17 - 10:19

A force a ton of air. You can see how

10:19 - 10:21

much air is required to flow through that thing.

10:22 - 10:25

This is just becoming unsustainable. We cannot push enough air

10:25 - 10:28

through our data centers to cool these kinds of systems

10:28 - 10:30

when they get to this kind of scale and it

10:30 - 10:32

get the density that we need in the data center

10:32 - 10:33

footprint.

10:34 - 10:37

So we're having to turn to other solutions. And Maya

10:37 - 10:40

is our first step towards the new design of data

10:41 - 10:44

centers in the cloud. Maya is a liquid cooled system.

10:44 - 10:47

This is a Maya board. So Maya boards have four

10:48 - 10:51

Maya accelerators. You can see those sheets there at the

10:51 - 10:56

top. They're covering the Maya accelerator parts underneath them because

10:56 - 10:57

those sheets.

10:59 - 11:03

Are what carries the liquid into the plates that are

11:03 - 11:05

on top of the Maya accelerators.

11:06 - 11:10

So these are liquid cooled systems custom designed by us.

11:10 - 11:12

Here's another look at that.

11:13 - 11:15

This is the liquid coolant in and out that goes

11:15 - 11:19

into those sheaths carrying the water, the liquid into those

11:19 - 11:19

plates.

11:20 - 11:23

And this is allowing us to keep these systems cool

11:23 - 11:26

and to save water and to save energy. So it's

11:26 - 11:30

like a win, win, win. It's more complicated because we're

11:30 - 11:33

having to design cold plates custom for these things and

11:33 - 11:37

design custom liquid cooling systems. We can't go buy these

11:37 - 11:38

things off the shelf,

11:39 - 11:40

so we've been building

11:41 - 11:45

custom liquid cooled sidekicks and you've probably seen this video

11:45 - 11:48

before of the Maya rack, which consists of two sides.

11:48 - 11:50

This is the on the right, the back of the

11:51 - 11:54

my executor. There's four Maya servers on the top, four

11:54 - 11:57

Maya servers on the bottom, then the front end networking

11:57 - 12:00

in the middle, and then that second cabinet next to

12:00 - 12:01

it.

12:01 - 12:04

Is the liquid closed loop liquid cooling system. You can

12:04 - 12:07

see the cable, the cooling that was there on the

12:07 - 12:10

back. Those giant cables carrying liquid hot and cold into

12:10 - 12:11

and out of,

12:11 - 12:13

or cold and hot into and out of the Maya

12:13 - 12:16

rack next to it, and those cables that we just

12:16 - 12:18

saw in the slide before it.

12:19 - 12:21

This is actually.

12:21 - 12:24

Like I said, our first deployments of liquid cooling in

12:24 - 12:25

the in the Azure data center

12:26 - 12:29

and it is the trend you will see liquid cooling

12:29 - 12:32

for next generations of GPUs coming out from AMD and

12:32 - 12:36

NVIDIA for high end, their high end offerings into the

12:36 - 12:39

data centers. And we are prepared for this both with

12:39 - 12:42

the sidekick like this that will work with those systems

12:42 - 12:46

as well as liquid cooled natively data center footprints that

12:46 - 12:50

can support liquid cooling without the sidekicks. The sidekicks let

12:50 - 12:53

us take this liquid cooling into any data center.

12:54 - 12:56

Now power.

12:56 - 12:59

When we look at the power consumption of these models,

12:59 - 13:02

we see something interesting for training. Once you get a

13:02 - 13:05

training run going, you get up to close to 100%

13:05 - 13:05

power draw.

13:06 - 13:09

And stay pretty constant there. As the training run goes,

13:09 - 13:09

there might.

13:09 - 13:10

Be a few.

13:10 - 13:13

Dips and spikes bit inference is a little bit different.

13:13 - 13:17

We'll take a closer look later, but with inference you

13:17 - 13:20

can see these very spikes. Low power, relatively low power,

13:20 - 13:21

and then a big.

13:21 - 13:21

Spike,

13:22 - 13:26

what's happening there are the differences between prompt processing and

13:26 - 13:30

token generation, which I'll get into later with some details.

13:30 - 13:33

But what we saw when we took a look at

13:33 - 13:35

these traces in the data center

13:36 - 13:39

is that we can actually save power

13:39 - 13:42

by running not at 100%.

13:42 - 13:46

Draw rated for 100%. Drop it by over subscribing the

13:46 - 13:47

power in the data center.

13:49 - 13:51

The idea is when you have so many

13:52 - 13:54

of these jobs happening in the same time, they're not

13:54 - 13:56

all going to spike at the same time. So there's

13:56 - 13:59

going to be an average power utilization which is below

13:59 - 13:59

that 100%.

14:00 - 14:02

And you can see the kind of headroom.

14:02 - 14:03

That we calculate.

14:03 - 14:07

3% for training, 21% for inference. That means we can

14:07 - 14:10

over subscribe power by 20% safely.

14:11 - 14:14

So we building systems and actually started to deploy these

14:14 - 14:17

in our data centers that are monitoring the power draw

14:17 - 14:18

of these GPUs and servers

14:19 - 14:23

and then basically over subscribing the power and with the

14:23 - 14:26

ability to cap or throttle when we are reaching limits

14:26 - 14:29

if they all happen to suddenly spike at the same

14:29 - 14:32

time, we don't want the data center to fail. So

14:32 - 14:36

we throttle the frequency on the servers. We also throttle

14:36 - 14:38

the power going into the racks

14:39 - 14:43

through software managed power control in the data center. And

14:43 - 14:46

what this is doing is allowing us to put 30%

14:46 - 14:49

more servers in our existing data center footprints.

14:50 - 14:52

This is literally hundreds of millions of dollars of saving

14:53 - 14:54

through power every year and.

14:54 - 14:54

So.

14:55 - 14:56

A lot of questions I get is what are you

14:57 - 15:00

doing about sustainability? Lots of things. Liquid cooling is more

15:00 - 15:03

sustainable. It's actually much less water usage than our air

15:03 - 15:05

cooled data center. And this is another way that we're

15:05 - 15:08

getting more sustainable is just by making more efficient use

15:08 - 15:09

of that same power.

15:10 - 15:12

Now let's talk about networking.

15:12 - 15:15

In networking. Just to give you an idea of the

15:15 - 15:19

networking requirements of large scale AI, an AI training job

15:19 - 15:23

uses something called Data Parallel processing.

15:23 - 15:25

Where you've got lots of instances of the model all

15:26 - 15:28

learning different parts of the data at the same time.

15:28 - 15:31

And then there's this thing at the end called All

15:31 - 15:34

Reduce where they all share what they've learned and update

15:34 - 15:35

together.

15:35 - 15:40

This sharing requires massive coordination across all the GPUs. They're

15:40 - 15:44

all sharing information with each other across whatever scale they

15:44 - 15:46

run at. And if you take a look at the

15:46 - 15:49

scale of those systems, I was talking about those a

15:49 - 15:52

supercomputers, there are 10s of thousands of servers

15:53 - 15:56

and they all have to be connected together to make

15:56 - 15:57

that efficient all reduced happen.

15:58 - 16:02

So this requires high bandwidth connections between them and in

16:02 - 16:05

low latency. And within the servers you also need low

16:06 - 16:10

latency because there's basically like mini data centers inside those

16:10 - 16:13

servers. There's eight GPUs on one of those GB, two

16:13 - 16:15

hundreds or H100 systems.

16:15 - 16:19

And or the M300X and they're connected through their own

16:19 - 16:23

custom high end connect high bandwidth connections.

16:24 - 16:29

That have 1.2 terabits. Terabits, sorry terabytes, not terabits, terabytes

16:29 - 16:32

per second of network bandwidth between those GPUs. So if

16:33 - 16:34

you can, stay within.

16:34 - 16:38

The server you get amazing amounts of network bandwidth.

16:39 - 16:41

Now, coming out of the network, the servers.

16:42 - 16:46

We now have 400 Gigabit InfiniBand links coming out for

16:46 - 16:47

each GPU,

16:48 - 16:52

so there's a total of 3.2 terabits of networking coming

16:52 - 16:55

out of these servers, and that is all connected

16:56 - 16:58

through our custom InfiniBand

16:59 - 17:02

topology. We're the only cloud that has InfiniBand at this

17:02 - 17:02

kind of scale,

17:04 - 17:08

and this is really the only difference between the supercomputers

17:08 - 17:10

we build for Open AI and what we make available

17:11 - 17:13

publicly is the scale of the InfiniBand domain.

17:14 - 17:17

In the case of Open AI, it's the InfiniBand domain

17:17 - 17:19

covers the entire supercomputer, which is

17:20 - 17:22

10s of thousands of servers. In the case of our

17:22 - 17:25

public systems where we don't have customers that are looking

17:25 - 17:27

for skate training at that kind of scale or even

17:27 - 17:29

we're not training at that kind of scale,

17:30 - 17:34

that InfiniBand domains are 1000 to 2000 servers in size,

17:34 - 17:37

which is still 10,000 to 20,000 GPUs, which is amount

17:37 - 17:41

massive supercomputer itself. But so you can get a massive

17:41 - 17:45

supercomputer of that scale in Azure through our public infrastructure.

17:46 - 17:50

Now, like to look at pictures of what's inside these

17:50 - 17:54

things. This is NVIDIA H100 system. You can see the

17:54 - 17:57

8 GPUs up there. These are the back of our

17:57 - 18:02

racks, those cyan cables. That's the InfiniBand cables we're laying

18:02 - 18:02

down.

18:03 - 18:06

We've laid down at this point enough InfiniBand in our

18:06 - 18:08

data centers to wrap the Earth five times.

18:09 - 18:12

And by the way, the amount that we're scaling our

18:12 - 18:16

systems, Kevin Scott talked about this. We've, or I think

18:16 - 18:20

Satya did 30X since since November. We've built out 30X

18:22 - 18:25

the size of our AI infrastructure since November, the equivalent

18:25 - 18:28

of five of those supercomputers I showed you every single

18:28 - 18:30

month, and that rate continues to increase.

18:31 - 18:33

So this means we need to lay a lot of

18:33 - 18:37

cabling both for our front end and back end networking.

18:37 - 18:40

Well, not all the innovation has to be super high

18:40 - 18:40

tech.

18:41 - 18:44

One of the three ways that we've found is the

18:44 - 18:49

traditional cabling in data centers was very low tech. It

18:49 - 18:53

required technicians to just go along, pulling the cables,

18:54 - 18:55

stretching them between servers

18:56 - 19:02

and just hugely inefficient. So what our data center incubations

19:02 - 19:06

team came up with is 3D printed what they call

19:06 - 19:07

boats

19:09 - 19:12

with something called Jeffries that allow them to take these

19:13 - 19:13

boats.

19:14 - 19:17

And pull them down sleds that are above the racks.

19:18 - 19:21

To pull that cabling and this saves them three times

19:21 - 19:23

the the time it takes to lay down the cabling.

19:23 - 19:26

And we're using this on our front end networks to

19:26 - 19:29

lay down that much cabling. We started with InfiniBand, but

19:29 - 19:32

we've got now got different solution for InfiniBand, but we're

19:32 - 19:35

using this now and all our data centers starting to

19:35 - 19:37

build this out. So not really a super high tech.

19:37 - 19:41

These are 3D printed thing things, but innovation, you know,

19:41 - 19:43

comes in whatever way innovation comes.

19:44 - 19:46

Now let me talk a little bit about storage.

19:47 - 19:51

So training, especially models and videos are big. Like we've

19:51 - 19:54

said, you need to distribute them to all these servers

19:54 - 19:57

for inference, the same thing, lots of models deployed on

19:57 - 20:00

these clusters. How do you distribute all of that data?

20:00 - 20:02

The models? The model checkpoints?

20:03 - 20:05

The key value cache is if you need to move

20:05 - 20:07

them or reload hot.

20:08 - 20:11

What we've built is inside of our infrastructure custom cluster

20:11 - 20:14

storage, we call it storage Accelerator. The idea here is

20:14 - 20:17

we wanted to make this extremely simple and extremely reliable.

20:17 - 20:20

You don't need really a parallel file system for this.

20:20 - 20:23

You just need something that can pull data in, distribute

20:23 - 20:26

it within the cluster so it doesn't have to go

20:26 - 20:27

all the way back to storage.

20:28 - 20:31

And so the solution is when a worker needs a

20:31 - 20:34

model for example, it goes and checks.

20:35 - 20:36

If it's got it.

20:36 - 20:38

Or if the the cash has it. If it doesn't,

20:38 - 20:39

it pulls it in.

20:40 - 20:43

And then it distributes it to the other servers in

20:43 - 20:43

the cache.

20:44 - 20:48

Which are typically the cluster, the GPU cluster that's assigned

20:48 - 20:51

to this particular inference domain, for example. And it distributes

20:51 - 20:55

this through either InfiniBand or Ethernet, depending on the cluster.

20:56 - 20:59

And it does it without interfering with the running workloads

20:59 - 21:01

there. So it can take advantage of all the free

21:01 - 21:04

bandwidth to distribute these things around so that another worker,

21:04 - 21:07

when it needs something, can very quickly get it from

21:07 - 21:07

the cache.

21:08 - 21:09

Even over InfiniBand.

21:10 - 21:13

So parallel reads from the cache, these are stored in

21:13 - 21:17

blocks. There's no metadata server, there's no replication across the

21:17 - 21:20

servers. It's very simple and very fast, and I'll show

21:21 - 21:23

you a quick demo of that right here.

21:24 - 21:27

So on the left side is without the storage cache

21:27 - 21:29

we're going to load in LAMA 370 B, again, coming

21:30 - 21:33

directly from Azure Storage. On the right side, it's gonna

21:33 - 21:36

pull it from another cache node over InfiniBand in the

21:36 - 21:36

cluster.

21:38 - 21:40

And what we'll see.

21:41 - 21:43

That it took about 12 minutes to load that from

21:43 - 21:44

Azure storage.

21:44 - 21:47

It's a 270 gigabytes of data and it took less

21:47 - 21:50

than half that time to pull it from the other

21:50 - 21:52

server the the cache in the cluster.

21:53 - 21:56

This is really important for the model as a service

21:56 - 21:58

that we've announced because we're going to have lots of

21:58 - 22:01

models in these clusters. We need to very efficiently get

22:01 - 22:03

the models up and running on whatever GPUs that get

22:03 - 22:06

assigned to. And this is just a fundamental physics problem.

22:06 - 22:08

How do you get that data loaded?

22:10 - 22:11

So that's a look at.

22:12 - 22:15

That kind of hardware and the lowest layers of the

22:15 - 22:19

software for running our AI infrastructure, I'm gonna go one

22:19 - 22:22

level up and talk about how we resource manage all

22:22 - 22:24

of those GPUs in that infrastructure.

22:25 - 22:28

And important to keep in mind that what we're aimed

22:28 - 22:32

at with all of this is efficiency, power efficiency, time

22:32 - 22:35

efficiency, making a use of those resources as close to

22:35 - 22:39

100% as possible. So we're not wasting anything.

22:39 - 22:43

And to support that, we've got micro optimizations and we've

22:43 - 22:45

got macro optimizations.

22:45 - 22:47

Let's start with the macro optimizations.

22:48 - 22:51

Everything that I'm talking about now is built on something.

22:51 - 22:55

Our AI internal AI workload platform. It's a resource manager

22:55 - 22:58

that knows about an AI job. It also knows about

22:58 - 23:02

AI models for inference and it's called Project Forge. Internally,

23:02 - 23:05

it's got a different name that we decided wasn't it.

23:05 - 23:09

It's called internally, it's called Singularity, which kind of has

23:09 - 23:10

some negative aspects to it,

23:11 - 23:13

so I'm not supposed to tell you that,

23:15 - 23:17

but you can see that it's got a bunch of

23:17 - 23:21

subsystems associated with it supporting these things. One of the

23:21 - 23:23

key ones is this global scheduler.

23:25 - 23:28

So when we take a, when we, the global scheduler

23:28 - 23:30

treats all of the GPU's and all of the regions

23:30 - 23:33

around the world as a single pool and we call

23:33 - 23:33

it one pool.

23:34 - 23:37

And the idea is that when we have that kind

23:37 - 23:37

of

23:38 - 23:42

capacity to resource manage that we can do it more

23:42 - 23:48

efficiently. Because the challenge that you have with looking at

23:48 - 23:49

cluster by cluster,

23:50 - 23:53

it's something that we had at Microsoft if you up

23:53 - 23:57

until last year and that we see many enterprises have,

23:57 - 24:00

which is you design GPUs to individual teams.

24:01 - 24:03

And this has 2 problems. One team doesn't use all

24:03 - 24:06

their GPUs and those GPU's it's not they're not using

24:06 - 24:08

or sitting there doing nothing.

24:08 - 24:11

Another team has used all their GPUs and would like

24:11 - 24:13

to use more GPUs, but they don't have access to

24:13 - 24:16

them, including the ones that the next team next to

24:16 - 24:17

them is not using.

24:18 - 24:20

So the idea with one pool is.

24:21 - 24:25

Everybody gets a virtual GPU, not physical GPUs. They get

24:25 - 24:29

a certain amount at a certain priority they get. We've

24:29 - 24:32

got three priority levels, low standard and premium,

24:33 - 24:36

and you if a premium job comes in and there's

24:36 - 24:38

a low priority job running on the GPU's it needs

24:38 - 24:42

low priority gets evicted. Now, the interesting thing about the

24:42 - 24:42

eviction

24:43 - 24:46

is that this is a global pool. And just because

24:46 - 24:49

it gets evicted from a cluster in a particular region

24:49 - 24:52

doesn't mean there's not capacity somewhere else that it can

24:52 - 24:52

use.

24:54 - 24:57

And Project Forge knows that. So it might say, well,

24:57 - 25:00

you, you need a one hundreds, this higher party job

25:00 - 25:03

needs your A1 hundreds, you get evicted. You're gonna go

25:03 - 25:06

now to another region and I'll restart you there on

25:06 - 25:09

those GPU's that the higher priority job can't use.

25:10 - 25:13

And if you take a look, we've migrated all of

25:14 - 25:16

our first party training onto 1 pool

25:17 - 25:19

and now we've if you take a look at three

25:19 - 25:22

different teams here, this is actual charts from three different

25:22 - 25:23

teams utilization.

25:24 - 25:25

You can see team.

25:25 - 25:29

A has over 100% utilization, and the reason why it's

25:29 - 25:30

over 100

25:30 - 25:33

is because that's their guaranteed capacity.

25:34 - 25:36

But they were able to use more because these other

25:36 - 25:38

teams aren't using all their capacity.

25:39 - 25:42

And so this is the benefits and the average total

25:42 - 25:46

over utilization now across these three teams. If you just

25:46 - 25:48

like take a look at them, it's about 100%.

25:50 - 25:52

And if you take a look in aggregate across all

25:52 - 25:55

of Microsoft for all of our GPU usage for training.

25:55 - 25:59

We went from 50 to 60% utilization to between 80

25:59 - 26:01

and 90 right now and we expect to get even

26:02 - 26:05

higher. And so this is all benefits to our bottom

26:05 - 26:07

line and costs that we pass on.

26:08 - 26:09

Now, another benefit of Project Forge.

26:10 - 26:14

Is the reliability system. When you're running these large jobs,

26:14 - 26:16

they take days, weeks, or even months in the case

26:16 - 26:18

of some of those Frontier models,

26:19 - 26:22

and you're inevitably with that amount of hardware, going to

26:22 - 26:25

have failures on a pretty regular basis. We see failures

26:25 - 26:28

on those large scale systems. If you have 1000 GPU,

26:28 - 26:30

you're going to see a failure roughly every two or

26:30 - 26:33

three days of some kind. A GPU is going to

26:33 - 26:35

fail, the server is going to fail, RAM is going

26:35 - 26:37

to fail, a network link is going to fail.

26:38 - 26:41

And if you're having to babysit those jobs and manually

26:41 - 26:45

restart them, manually diagnose what's going on, manually move it

26:45 - 26:48

to another healthy server, you're just never going to make

26:48 - 26:52

progress and you're going to have horrible utilization and efficiency.

26:53 - 26:56

So Project Forge is designed with reliability in mind. This

26:57 - 27:00

is an actual trace from Project Forge dashboard. It's got

27:00 - 27:04

automatic failure detection. So it will automatically detect if any

27:04 - 27:07

of these kinds of parts fail, it'll automatically diagnose it.

27:08 - 27:12

It'll automatically take it out of rotation, automatically file tickets

27:12 - 27:14

for the data center support people,

27:14 - 27:17

and it will automatically restart the job

27:18 - 27:21

again from the checkpoint on healthy servers to let it

27:22 - 27:25

continue. This is the way we get basically automated long

27:25 - 27:28

running reliability for our jobs.

27:28 - 27:31

This is by the way, actual trace. This is for

27:31 - 27:33

example, some of the Phi three runs that were done

27:33 - 27:36

on Project 4. It's like I said, all of our

27:36 - 27:38

runs, including all the Phi runs happen on top of

27:38 - 27:41

this project Forge infrastructure. This is 1024 GPUs,

27:42 - 27:45

a job that ran for two days, 11 hours, so

27:45 - 27:47

close to 2 1/2 days.

27:47 - 27:50

On 1100 GPUs.

27:53 - 27:55

So let me take a look at micro optimizations.

27:55 - 27:56

Now.

27:56 - 27:59

If we take a look at the anatomy of an

27:59 - 28:02

LM inference, it's broken up into two phases. The first

28:02 - 28:06

phase is called the prompt phase. This is where the

28:06 - 28:10

whole prompt is processed in parallel. This is the really.

28:10 - 28:13

Efficient phase because the GPU can do the whole thing

28:13 - 28:17

at once. Boom, using all of the compute on the

28:17 - 28:20

GPU and it gets the prediction for the next token,

28:20 - 28:22

which in this case is yes.

28:23 - 28:26

Now it goes back and it's entering now the token

28:26 - 28:29

phase, or the generation phase or the decode phase. People

28:29 - 28:32

call it different things, and at this point it's doing

28:32 - 28:33

next word prediction one at a time.

28:34 - 28:37

This next word was yes. What's the next word it?

28:38 - 28:39

Is.

28:40 - 28:43

And then that's it, end of end of sequence ES.

28:43 - 28:47

So this is the token generation phase. Now very different

28:47 - 28:51

characteristics from these two, which we'll get into in a

28:51 - 28:56

second, but also different AI applications have different ratios of

28:56 - 28:59

prompt versus token generation or or decode

29:00 - 29:03

on the top left, content creation, short prompt,

29:03 - 29:05

long generation.

29:05 - 29:07

On the bottom right.

29:08 - 29:11

Long prompt for summarization, short generation,

29:12 - 29:14

and then you can see chat bots and enterprise chat

29:14 - 29:16

bots are kind of in the middle.

29:17 - 29:20

If you take a look at naively scheduling.

29:21 - 29:24

Prompt and generations on the same GPUs.

29:25 - 29:28

What's gonna happen is this. You get a prompt to

29:28 - 29:29

come in from one session.

29:30 - 29:32

The GPU starts to process it.

29:33 - 29:37

It's and another prompt comes in while it's generating, and

29:37 - 29:40

that first generation slows down because the GPU now is

29:40 - 29:42

busy dealing with that second prompt.

29:43 - 29:45

Now that second prompt when it's finished.

29:46 - 29:49

It goes into generation mode and now you're into memory

29:49 - 29:53

constrained rather than GPU constraint, GPU processing constraint. But now

29:53 - 29:57

they can both proceed without interfering with each other. So

29:57 - 30:00

the problem is these prompts that arrive and are just

30:00 - 30:03

demand a lot of GPU that interfere with the existing

30:03 - 30:07

generation that are happening. How do you deal with that?

30:07 - 30:11

Well, we developed internal project called Flywheel to deal with

30:11 - 30:14

this. How many people have heard of PTO or PTA

30:14 - 30:17

managed offering? So a few of you have. This is

30:17 - 30:19

our serverless GPU offering

30:20 - 30:23

for inference, and it works on top of project Flywheel.

30:23 - 30:26

The idea is we don't want to have to give

30:26 - 30:29

you a whole GPU for your inference where now you're

30:29 - 30:34

responsible for getting that thing to 100% utilization. We'd like

30:34 - 30:35

to give you a.

30:35 - 30:36

Fraction of a GPU.

30:36 - 30:39

And we're going to share that GPU with other customers,

30:39 - 30:42

of course, in a very secure way. We're not mixing

30:42 - 30:44

TOE prompts and tokens from different customers,

30:45 - 30:46

but we need to do it in a way that

30:46 - 30:50

gives you guaranteed performance because you don't want your noisy

30:50 - 30:53

neighbor like you're doing. Your app is running and suddenly

30:53 - 30:56

another prompt from another customer comes in and now your

30:56 - 30:59

app slows down. So Project Flywheels aimed at providing that

30:59 - 30:59

enterprise grade.

31:00 - 31:02

Consistency. So a prompt comes in,

31:03 - 31:07

starts generating tokens at normal speed, and another prompt comes

31:07 - 31:10

in. We don't process the whole prompt, we chunk it,

31:11 - 31:15

and we interleave its prompt generation with the generation of

31:15 - 31:18

the tokens from the first prompt.

31:19 - 31:21

And now you don't get any kind of that kind

31:21 - 31:24

of interference effect. And the key here is how do

31:24 - 31:26

you do that and provide a consistent?

31:27 - 31:31

Delivery of throughput which is the way that you want

31:31 - 31:31

to measure.

31:32 - 31:34

What's the throughput of my generations?

31:35 - 31:36

So I want to show you a quick demo here

31:36 - 31:37

of Project Flywheel

31:38 - 31:40

up at the top. This is actual trace from Project

31:40 - 31:43

Forge. Project Flywheel you can see naive prompt and token

31:44 - 31:46

processing. Some of those big blocks are the big prompts

31:46 - 31:49

coming in the small blocks or the token generations for

31:49 - 31:51

different sessions and different colors.

31:52 - 31:55

You can see that they're all over the place and

31:55 - 31:55

you big, big.

31:55 - 32:00

Gaps of white. Those whites are wasted GPU plus the

32:00 - 32:02

the latency, the throughput.

32:03 - 32:05

Is all over the place for these different sessions. Now

32:05 - 32:08

on the bottom with Project Flyway, you can see everything's

32:08 - 32:08

nice and

32:09 - 32:12

nice and consistent. All of the blocks are the same

32:12 - 32:15

size, whether they're prompt or generation.

32:16 - 32:16

Now here's

32:17 - 32:21

three different workloads, one that has big prompts

32:22 - 32:25

and small generations, one that has about balance, and then

32:25 - 32:29

on the right small prompts, big generations to demonstrate different

32:29 - 32:30

AI workloads.

32:31 - 32:32

They all have different.

32:32 - 32:35

Throughput requirements, and we're gonna run them

32:36 - 32:37

and then plot.

32:38 - 32:41

What throughput project fly wheels giving them?

32:42 - 32:45

Because they've each signed up for their own throughputs that

32:45 - 32:47

they need for their app to behave well.

32:47 - 32:49

And you can see they're all in different colors. You

32:49 - 32:50

can see prompt tokens per minute.

32:52 - 32:55

You can see the different green, yellow and blue for

32:55 - 32:56

those different.

32:57 - 32:57

Mixes.

32:58 - 33:01

And you can see generated tokens per minute. So the

33:01 - 33:05

blue obviously is generating a lot of tokens. That's one

33:05 - 33:07

that would be doing a the generation, the kind of

33:08 - 33:11

small prompt, big generation. But you can see all of

33:11 - 33:15

them have very consistent throughput, very consistent latency because of

33:15 - 33:19

that multiplexing project flywheel is doing. Another way to benefit

33:19 - 33:22

of this is you can dynamically raise the throughput on

33:22 - 33:25

any one of these and it will if there's enough

33:25 - 33:26

capacity on those GPU

33:27 - 33:27

go.

33:27 - 33:28

Up.

33:28 - 33:30

So this is the way that you can dynamically without

33:30 - 33:31

having to go buy another GPU.

33:32 - 33:35

Being able to get a slice of GPU and be

33:35 - 33:37

able to dial up and down how much of the

33:37 - 33:40

GPU you need as your job changes, your application changes

33:40 - 33:41

require.

33:42 - 33:45

So that's an example of a micro optimization. Want to

33:45 - 33:49

show you another one. How many people have heard of

33:49 - 33:52

Laura or low rank adaptive fine tuning off you? You

33:52 - 33:53

have this, Laura.

33:53 - 33:54

Is the way that.

33:54 - 33:57

People find tune AI models now. Now let me talk

33:57 - 34:00

about what fine tuning is. Fine tuning is when you

34:00 - 34:02

take a base model and you want to give it

34:02 - 34:05

new data to train on, to change its behavior or

34:05 - 34:08

to give it some new knowledge. It's typically change behavior

34:08 - 34:10

like I want you to speak in JSON format or

34:11 - 34:12

I want you to speak like a doctor.

34:14 - 34:17

That would be an example. Examples of fine tuning.

34:18 - 34:21

The way that traditionally it was done is you make

34:21 - 34:24

a whole copy of the model you give the you

34:24 - 34:28

basically continue the training by giving it small target data

34:28 - 34:31

set. Like, you know, examples of doctor talking.

34:32 - 34:34

And then you spit out a new model.

34:34 - 34:35

That knows how to talk like a doctor.

34:37 - 34:39

This is really inefficient.

34:40 - 34:42

So we came up with a way to

34:43 - 34:47

make fine tuning much more efficient by training what's called

34:47 - 34:51

an adapter. So you create some extra weights called an

34:51 - 34:51

adapter

34:52 - 34:54

instead of copying the whole model.

34:55 - 34:58

Then you train the whole thing, but you don't modify

34:58 - 35:00

the base model, just the this adapter. What waits

35:01 - 35:04

and what you end up with is a fine-tuned model

35:04 - 35:07

that is really the combination of the adapter plus the

35:07 - 35:11

base model. So you can have literally thousands of fine-tuned

35:11 - 35:14

versions of the base model. Each one of those adapters

35:14 - 35:15

is small. How small?

35:16 - 35:19

Well, if you take a look at a three hundred

35:19 - 35:23

175 gigabyte or sorry, 175 billion parameter model GPD 3.

35:24 - 35:27

You might have an adapter size of just 100 megabytes.

35:28 - 35:33

And so to train fine-tuned traditionally 96 GPUs, 24 GPUs

35:33 - 35:36

for Laura, one terabytes,

35:37 - 35:40

one terabyte per 100 bottle weights for traditional fine tuning,

35:41 - 35:43

one terabyte bass plus 200 megabytes

35:43 - 35:45

for Laura.

35:46 - 35:48

And then switching between models

35:49 - 35:52

takes minutes. To load that big giant model with a

35:52 - 35:55

new version that is fine-tuned for lore adapters, you just

35:55 - 35:57

need to load the adapter.

35:58 - 36:01

And then it has no downsides really no additional inference

36:01 - 36:04

latency. You get more training throughput at the same time.

36:04 - 36:05

So this is

36:06 - 36:09

Laura and this was developed by Microsoft Research and it

36:09 - 36:11

is the industry standard now for how to do this.

36:12 - 36:14

But we've taken it one step further inside of our

36:14 - 36:15

production systems.

36:16 - 36:18

Because if you take a look at, you know, we've

36:18 - 36:20

gotten this fine tuning as a service now with GPT

36:20 - 36:22

35. Now we have GPT 4 and we have other

36:22 - 36:24

models coming. Fine tuning as a service where you just

36:24 - 36:26

give us a data set, we take care of the

36:26 - 36:26

fine tuning.

36:27 - 36:30

You'd load your fine-tuned models. There's going to be lots

36:30 - 36:32

of different fine tune adapters on top of these base

36:33 - 36:33

models.

36:33 - 36:35

We don't want to just be able to load.

36:35 - 36:37

One, remove it, load another one.

36:38 - 36:40

And if you take a look at traditional fine tuning,

36:41 - 36:42

that's what you're doing. But if you take a look

36:43 - 36:45

at something we've developed called multiflora fine tuning, which we've

36:45 - 36:46

got deployed,

36:47 - 36:50

we load multiple Laura adapters on top of the base

36:50 - 36:53

models and then can just they're all in the GPU.

36:53 - 36:56

So when it requests from 1 customer comes in for

36:56 - 36:59

the doctor adapter, it gets processed. When another one comes

36:59 - 37:02

in for the lawyer adapter, it gets processed at the

37:02 - 37:05

same time. We don't have to load and unload.

37:06 - 37:08

So I got a quick demo of that.

37:10 - 37:14

So here I'm gonna do it in traditional inference with

37:14 - 37:18

one lore adapter loaded in the model without multi Laura.

37:18 - 37:21

You can see when I try to lower load a

37:21 - 37:23

second adapter and query it.

37:23 - 37:26

I get a failure because the model's not ready because

37:26 - 37:28

it's trying to load the new adapter.

37:30 - 37:33

And hasn't finished by the time I sent the request

37:33 - 37:34

in, so I'm gonna load 100.

37:35 - 37:36

Models.

37:37 - 37:39

Into the second GPU or sorry, 1000 a thousand models

37:39 - 37:42

into the second GPU. And I'm going to just ping

37:42 - 37:44

a few of those Laura adapters and you can see

37:44 - 37:47

that I'm getting responses from all of them because they're

37:47 - 37:50

all sitting there in the GPU. And so this would

37:50 - 37:53

be the equivalent of multiple customers coming in with requests

37:53 - 37:56

at the same time for their different fine tune models,

37:56 - 37:58

and we're able to serve them all because of.

37:58 - 38:02

This efficiency and it the question is, does that impact

38:02 - 38:05

the performance of those serving? Do they interfere with each

38:05 - 38:07

other at these lower adapters?

38:08 - 38:10

You see on the left is a single GPT 35

38:10 - 38:11

fine-tuned adapter on the right.

38:12 - 38:15

Or 100 or 1000 of them being queried all at

38:15 - 38:16

the same time

38:17 - 38:18

and the latency is the same.

38:19 - 38:23

So no performance degradation, no impact, but we're able to

38:23 - 38:27

serve 1000 models basically. It's basically think of it as

38:27 - 38:29

1000 custom versions GP35 on the same GPU.

38:31 - 38:36

And this is the 10 model concurrency latency run for

38:36 - 38:38

2510 models 25.

38:40 - 38:43

20 calling them 2025 requests at the same time.

38:44 - 38:46

So this is another really key way to for us

38:46 - 38:50

to raise the efficiency of those GPUs in the world

38:50 - 38:53

where we're going to have lots of fine-tuned models out

38:53 - 38:54

there.

38:55 - 38:58

So another example of kind of the innovation that we've.

38:58 - 38:58

Got.

38:59 - 39:02

Now I want to switch gears and talk about the

39:02 - 39:04

evolution of computing in general.

39:06 - 39:07

So

39:09 - 39:11

when we take a look at at cybersecurity,

39:12 - 39:15

it's been a world of logical protection of data.

39:17 - 39:20

You've got encryption at rest. And this is something that

39:20 - 39:23

all the cloud providers now have both server side keys,

39:23 - 39:26

we're encrypting the data and then customer managed keys where

39:26 - 39:29

you can define your own keys and encrypt the data

39:29 - 39:30

a second time. On top of that,

39:31 - 39:34

you also have encryption of data in transit. All network

39:34 - 39:37

communications expected to be encrypted now,

39:38 - 39:41

and that protects the data in transit and at rest.

39:41 - 39:42

But what's been missing

39:43 - 39:46

is protecting the data while it's being used.

39:47 - 39:48

It gets to the server

39:49 - 39:52

either loaded from storage or through the network, and now

39:52 - 39:55

it's in the clear when it's decrypted. And now it's

39:55 - 39:58

being processed as part of a training job, part of

39:58 - 40:01

a data analytics job. It's sitting out there in the

40:01 - 40:01

open.

40:02 - 40:05

And what's it susceptible when it's sitting out to, when

40:05 - 40:07

it's what kind of risks when it's out in there

40:07 - 40:10

in the open on the server? Well, it's susceptible to.

40:11 - 40:15

Somebody that breaches the infrastructure, it's susceptible to insiders, it's

40:15 - 40:18

susceptible to operators that get access, administrators.

40:20 - 40:22

What we want to do is keep that data as

40:22 - 40:26

secure as possible through its own life cycle. So

40:27 - 40:31

something called Trusted Computing emerged in the late 2000s with

40:32 - 40:34

ARM, Trustzone and Intel SGX.

40:34 - 40:38

And I got it when I became went into Azure

40:38 - 40:41

and I was looking at SGX,

40:42 - 40:43

realized that this would be

40:44 - 40:46

filling in the third leg.

40:46 - 40:48

Of this data protection

40:49 - 40:50

coverage.

40:51 - 40:53

Protecting data while it's in use.

40:53 - 40:57

Now what are these technologies? They're they're originally called trusted

40:57 - 41:01

Execution environments. We've branded it a confidential computing this concept.

41:01 - 41:01

Of.

41:01 - 41:05

Protecting data with hardware and that hardware protection.

41:06 - 41:08

Is based on a root of trust in the CPU.

41:09 - 41:13

Where it creates effectively a box, an encrypted box in

41:14 - 41:15

the CPUs memory.

41:16 - 41:19

Nothing can get into it after it's been created. Nothing

41:19 - 41:21

can see into it after it's been created.

41:22 - 41:23

And it's even encrypted

41:25 - 41:28

physically when that data leaves the CPU and goes to

41:28 - 41:32

the to memory. So even somebody with physical access

41:32 - 41:35

can't easily sniff the memory bus, for example, and see

41:35 - 41:39

the data. So you get logical protection and physical protection.

41:39 - 41:40

But it's not just that

41:41 - 41:44

the real key to confidential computing, besides the justice that

41:44 - 41:48

inherent protection is being able to prove that you're protected.

41:48 - 41:51

So that workload running inside of that box can ask

41:51 - 41:54

the CPU, give me proof that I can present to

41:54 - 41:56

somebody else that this I am this code, I'm this

41:56 - 41:59

piece of code, and I'm being protected by you. And

41:59 - 42:03

this is called an attestation report. And that attestation report

42:03 - 42:06

can be handed to another application. It can be handed

42:06 - 42:09

to a human with a policy evaluation on top of

42:09 - 42:11

it. It can be handed to a key release

42:11 - 42:14

service set with a policy that says only release this

42:14 - 42:14

key

42:15 - 42:17

if the code is this

42:17 - 42:20

and it's being protected by this hardware.

42:22 - 42:26

And this then creates this sealed environment, this confidential computing

42:26 - 42:30

environment where the data can be processed with minimal and

42:30 - 42:33

minimize the risks of those other kinds of attacks that

42:33 - 42:34

I talked about.

42:36 - 42:42

Since the early 2020 tens, we've been working with Intel

42:42 - 42:43

and AMD to bring out

42:44 - 42:47

confidential computing hardware that cannot just support what are called

42:47 - 42:50

enclaves or small boxes, but actually full virtual machines. And

42:50 - 42:53

a couple of years ago, we announced with AMD the

42:53 - 42:54

first confidential virtual machines.

42:55 - 42:57

We and we released them in Azure. And then we've

42:57 - 43:01

announced confidential virtual machines with Intel, their TDX technology and

43:01 - 43:02

we've released them in Azure.

43:03 - 43:05

But the time for

43:06 - 43:10

confidential computing is really coming, because we've been working for

43:11 - 43:15

the last few years with NVIDIA to codesign confidential computing

43:15 - 43:16

with them into.

43:16 - 43:17

Their GPU lines.

43:18 - 43:22

And the first line to have confidential GPU, confidential computing

43:22 - 43:24

and GPUs is the H100.

43:25 - 43:29

The H100 has this Trusted Execution environment on it, which

43:29 - 43:33

protects the model weights, protects the data going into the

43:33 - 43:34

model, encrypting them,

43:36 - 43:38

and it being able to attest to what's inside of

43:38 - 43:41

the GPU. So that, for example, you've got this confidential

43:41 - 43:43

virtual machine on the left.

43:44 - 43:47

It's can talking to the confidential GPU and now it

43:47 - 43:51

can release keys to the GPU to decrypt the model

43:51 - 43:53

or to decrypt a prompt

43:54 - 43:56

and then re encrypt the response going back to an

43:56 - 43:57

end user.

43:58 - 44:02

This actually is fleshing out now the introduction of confidential

44:02 - 44:06

accelerators with confidential GPU for AI confidentiality. And this has

44:07 - 44:10

got me really excited because there's a bunch of scenarios

44:10 - 44:14

that are just very obviously going to benefit from this.

44:14 - 44:15

One of them.

44:16 - 44:17

Is protecting the model weights,

44:18 - 44:20

and you might want to protect the model weights because

44:20 - 44:22

there's a ton of your IP in the model weights,

44:23 - 44:26

so you don't want them to leak, and confidential computing

44:26 - 44:28

can provide another layer protection around them.

44:29 - 44:30

Bit more generally.

44:31 - 44:34

When it comes to AI, the data that AI models

44:34 - 44:37

process is extremely sensitive in some cases,

44:38 - 44:41

and what this allows is for you to protect that.

44:41 - 44:42

Data end to end.

44:43 - 44:45

For things like fine tuning, the data that you're going

44:45 - 44:47

to find tune the model on can be protected and

44:47 - 44:48

given only to the GPU.

44:49 - 44:51

The data that you give as your prompt and the

44:51 - 44:54

response you get back can be protected with confidential computing.

44:56 - 44:57

Give you one example.

44:57 - 44:57

That's.

44:57 - 44:58

Got everybody very.

44:58 - 45:04

Excited is confidential speech translation and speech transcription.

45:05 - 45:09

Speech is incredibly sensitive for a lot of enterprises, and

45:09 - 45:13

with this you'll be able to send encrypted speech into

45:13 - 45:13

the GPU.

45:14 - 45:18

Get a get a transcription back that's encrypted and nothing

45:18 - 45:20

can see it from the point that you send it

45:20 - 45:23

in to the point you get it back out.

45:23 - 45:24

Other than the GPU.

45:26 - 45:30

And then finally, a really exciting scenario is confidential multiparty

45:30 - 45:34

sharing. Different parties get together and share their data.

45:35 - 45:37

But without each of them being able to see the

45:37 - 45:40

others data because they're sharing it really with

45:41 - 45:44

the AI model and the GPU, not with each other.

45:45 - 45:47

So I want to show you a really quick demo

45:47 - 45:47

that kind of.

45:47 - 45:50

Shows you a nuts and bolts view.

45:50 - 45:53

Of confidential retrieval augmented generation. So I think all of

45:53 - 45:56

all of you know what retrieval augmented generation is, is

45:56 - 45:59

when you give a model some information that it wasn't

45:59 - 46:01

trained on into its context so it can answer questions

46:01 - 46:02

about it.

46:02 - 46:05

And the idea here is that the RAG data can

46:05 - 46:08

be very sensitive. So I'm going to show you this.

46:09 - 46:13

Here's a website that uses confidential computing attestation to decide

46:13 - 46:14

whether I trust the site.

46:15 - 46:19

And you can see that the attestation information for this

46:19 - 46:19

RAG website.

46:20 - 46:23

Here this is a website that we're connecting to that

46:23 - 46:27

was its attestation from the GPU and the CPU. I'm

46:27 - 46:28

deciding I trust it.

46:29 - 46:31

I can examine the attestation report, see that it's got

46:31 - 46:33

the right versions of the hardware, that I like, the

46:33 - 46:35

right versions of the firmware,

46:36 - 46:39

and then I can decide to create a policy that

46:39 - 46:41

is, I trust this policy. And now when I query

46:42 - 46:45

the attestation of that website according to that policy, I

46:45 - 46:47

get a green and I can look and see this

46:47 - 46:50

thing conforms to the policies I set. So I trust

46:50 - 46:51

it with my data.

46:52 - 46:56

Now, if I ask this model what is confidential computing,

46:57 - 47:00

this model doesn't know about confidential computing unfortunately, and it's

47:01 - 47:02

going to say I don't know.

47:02 - 47:02

What it is?

47:04 - 47:06

But now this is a confidential rag app running on

47:06 - 47:10

a container on a confidential virtual machine. With a confidential

47:10 - 47:13

GPU. I can upload my very sensitive information. In this

47:14 - 47:16

case it's confidential computing documentation

47:18 - 47:20

that goes into a vector database which sitting in the

47:20 - 47:23

confidential virtual machine. Same with embeddings there

47:24 - 47:25

and now I can ask it.

47:26 - 47:27

What is confidential computing? It knows.

47:28 - 47:31

The whole thing was protected from the from my client

47:32 - 47:34

side on the browser all the way into the GPU

47:34 - 47:38

and back, including the data. The document that I uploaded

47:38 - 47:42

protected, so nothing outside of that hardware boundary could see

47:42 - 47:42

it.

47:43 - 47:45

And so this, I mean, I think I see some

47:45 - 47:48

nods. You can see just how powerful this is when

47:49 - 47:52

we bring it to scale at providing just another level,

47:52 - 47:55

a massive jump in the level of data protection that

47:55 - 47:58

we can achieve. And by the way, this is just

47:58 - 48:01

showing if I tweak the version that's in my policy

48:02 - 48:04

now, I get, hey, the website doesn't conform to the

48:04 - 48:07

policy you specified because it's got the wrong version of

48:07 - 48:10

a particular piece of firmware. So I know now, ohh,

48:10 - 48:12

I don't trust this thing anymore. It's not the versions

48:12 - 48:13

that I trust.

48:15 - 48:17

Now I want to conclude that that kind of gives

48:17 - 48:20

you an overview of some of the innovation we've got

48:20 - 48:23

going on in Azure. And you're gonna see confidential services

48:23 - 48:26

coming. You're going to see more innovation in data center

48:26 - 48:30

efficiency and serving efficiency and hardware efficiency. The innovations that

48:30 - 48:33

we'll have drawing with our next generation, the Maya Jeep

48:33 - 48:36

accelerators. I want to just transition and give you a

48:36 - 48:39

little look at some of the research that I'm doing

48:39 - 48:41

as a conclusion, because I'm having a lot of fun

48:41 - 48:43

with I myself and this.

48:43 - 48:46

Is my own research which is getting my hands dirty.

48:46 - 48:49

With AI, Python, Pytorch, our own systems like Project Forge,

48:49 - 48:52

so that I have really a good understanding of the

48:52 - 48:56

technology underneath and the way that we deliver our technology

48:56 - 48:59

through things like Visual Studio Code and Copilot,

48:59 - 49:01

because I've been using Copilot a lot.

49:02 - 49:05

So one of the things that I did with another

49:05 - 49:08

researcher in Microsoft Research that works on the Phi team

49:08 - 49:11

is look at the question, can we have a model?

49:11 - 49:12

Forget things

49:13 - 49:16

How? Let me set this up with. Why would you

49:16 - 49:18

want a model to forget something

49:19 - 49:22

when you talk about these training jobs? Even a few

49:22 - 49:25

days on a on 1000 H one hundreds is a

49:25 - 49:29

huge amount of money. And certainly if you're talking about

49:30 - 49:34

jobs that are even bigger or long running longer, that's

49:34 - 49:38

a tremendous amount of money you're spending on training one

49:38 - 49:41

job. What happens if you find out that you trained

49:41 - 49:45

on data that is problematic. It could be problematic because

49:45 - 49:46

it's got API on it.

49:47 - 49:50

You're leaking private information that you didn't realize was in

49:50 - 49:53

the training data. What happens if it's got copyrighted data

49:53 - 49:56

that you are unlicensed to train on? What happens if

49:56 - 49:59

it's got poison data that's gotten into your training set

49:59 - 50:00

and it's causing the model to?

50:00 - 50:01

Misbehave.

50:01 - 50:04

Do you wanna go retrain from scratch? It would be

50:04 - 50:07

nice if you could just have the model forget those

50:07 - 50:09

things. And so that's what we set out to do.

50:10 - 50:12

Now it turns out we said, what can we have

50:12 - 50:15

a model forget that really shows gives us a challenge

50:15 - 50:18

and really can people can see that it really forgot

50:18 - 50:21

something. What do a ILM's know? They all know what

50:21 - 50:25

topics have. They know so clearly that it's obvious that

50:25 - 50:26

they've really deeply know them.

50:27 - 50:30

Well, that's gonna depend on how much of the data

50:30 - 50:32

they've seen as they train.

50:33 - 50:35

So we thought about it for a while, and then

50:35 - 50:37

we realized one of the things that all these models

50:37 - 50:39

seemed to know is the world of Harry Potter.

50:41 - 50:44

They know Harry Potter because there's so much Harry Potter.

50:44 - 50:46

The Harry Potter books are all over the web, it

50:46 - 50:46

turns out.

50:47 - 50:50

And so these models that are just scraping the web,

50:50 - 50:53

sucking a whole bunch of copies of Harry Potter into

50:53 - 50:56

their training data, they see so many copies of it

50:56 - 51:00

that they almost memorize, if not memorize Harry Potter. So

51:00 - 51:02

we're like, if we can get these models to forget

51:02 - 51:05

Harry Potter, that would be really cool.

51:06 - 51:08

And so we set out to do that, and we

51:08 - 51:11

succeeded. And so this is a research paper. You can

51:11 - 51:14

find an archive that we wrote. Who's Harry Potter?

51:15 - 51:17

I want to show you a quick demo of our

51:17 - 51:19

Who's Harry Potter model and how it works.

51:20 - 51:23

So on the left side you can see Llama 7

51:23 - 51:24

Lama 27B

51:25 - 51:28

and you can see when Harry went back to class,

51:28 - 51:31

he saw that his best friends. What do you think

51:31 - 51:33

it's gonna say? All we said is Harry in class.

51:35 - 51:37

And it says to see Ron and her money. Think

51:38 - 51:39

about that for a second.

51:40 - 51:43

Harry and School Harry Potter

51:44 - 51:45

like there's no other hair in the world.

51:49 - 51:52

So they really, really know and like to talk about

51:52 - 51:55

Harry Potter. So here's the unlearned one. And it says

51:56 - 51:58

instead Saw's friends Sarah and Emily.

51:59 - 52:01

And it's very consistent too with that generation. So this

52:01 - 52:03

was the key. How do we make it forget but

52:03 - 52:06

yet be very natural and not degrade its performance otherwise?

52:08 - 52:08

So that's.

52:09 - 52:12

Unlearning and now the the second thing we did we're

52:12 - 52:14

working on is how can we take it and make

52:14 - 52:18

forget it particular topics like particular words or particular concepts.

52:18 - 52:21

So we're continuing down this line of research and the

52:21 - 52:22

latest thing we are doing

52:23 - 52:24

is.

52:25 - 52:27

For example, showing how we can make it forget.

52:27 - 52:27

Words.

52:28 - 52:29

Like profanity.

52:29 - 52:33

So this is the Mistral model, which doesn't. It's not

52:33 - 52:35

a shame to swear. It's a warning.

52:36 - 52:38

And we're gonna ask it, right? A rant about inflation

52:38 - 52:39

filled with profanity.

52:41 - 52:43

And we blurred this.

52:44 - 52:44

Yeah.

52:46 - 52:47

But you can see that it's like, sure, here you

52:47 - 52:49

go. And then we say, what are the top five

52:49 - 52:51

most profane words in the English language?

52:52 - 52:53

OK, so

52:54 - 52:56

I don't know which those words are.

52:57 - 53:00

Now here's the forget version. So we want again make

53:00 - 53:02

it very natural. Like how do you make it not

53:02 - 53:05

talk about profanity but sound natural? Here's a rant.

53:06 - 53:06

The same prompt.

53:09 - 53:11

So, and this is actually losing the model itself. To

53:11 - 53:14

figure out what it should say instead is the key

53:14 - 53:17

to this research. And then here's the topmost profane words.

53:18 - 53:20

And by the way, the first one we didn't put

53:20 - 53:23

in the dictionary, so we were like, that word is,

53:23 - 53:25

yeah, not that bad, alright. But you can see, you

53:25 - 53:27

get the idea. Anyway, so I've been having a lot

53:27 - 53:29

of fun and I can tell you, let me just

53:29 - 53:32

sum it up by one thing. Copilot has been indispensable.

53:32 - 53:36

I've become I'm an expert when if Python And pytorch.

53:36 - 53:39

I can do stuff so fast with Python And py

53:39 - 53:41

torch because of copilot.

53:42 - 53:46

My brain is really lazy. So if you took copilot

53:46 - 53:46

away,

53:47 - 53:50

I don't, I'm like a Python noob still like have

53:50 - 53:53

me do a list comprehension and I'm looking at the

53:53 - 53:56

documentation. If you don't have, I don't have copilot, but

53:56 - 53:59

I've gotten so to the point where I just have

53:59 - 54:02

AI generate all of the kind of that kind of

54:02 - 54:04

code for me. I have it. Give me advice on

54:04 - 54:05

how to optimize code.

54:07 - 54:09

So this has been a huge boon to my productivity.

54:09 - 54:11

I'm doing this kind of stuff in my as a

54:11 - 54:14

side projects that if I didn't have Copilot, I just

54:14 - 54:16

would not have the bandwidth to do.

54:17 - 54:20

That said, we're if you're a software developer, you have

54:20 - 54:22

nothing to fear, at least for the near future,

54:23 - 54:25

because these models, while they're great at that

54:26 - 54:30

kind of help an autocomplete. And here's a short little

54:30 - 54:32

function for more complicated, nuanced.

54:32 - 54:33

Tasks they.

54:33 - 54:36

Just aren't there yet. And so I've I've, there's been

54:36 - 54:40

many rounds where there's bugs and even flagrant things like

54:40 - 54:42

you just wrote me a function where you use a

54:42 - 54:45

variable you didn't define it's Oh, I'm sorry, here's the

54:45 - 54:48

revised version. All right, you still didn't define it.

54:48 - 54:49

So.

54:50 - 54:52

We're still at that kind of level. But that's, you

54:52 - 54:55

know, like I said, it's I just can't live without

54:55 - 54:57

it. It is so powerful and they're just gonna get

54:57 - 54:58

better. So

54:58 - 55:00

I just want to leave you on that note on

55:00 - 55:03

use of AI for programming. And with that, want to

55:03 - 55:06

wrap it up and hope you found this an interesting

55:06 - 55:09

tour. Some of the innovations we going from hardware to

55:09 - 55:11

data center design to the way that we manage the

55:12 - 55:16

infrastructure on micro and macro optimizations to the future confidential

55:16 - 55:18

computing. And then a look at some of the kind

55:18 - 55:21

of fun research that you can still do kind of

55:21 - 55:21

on the side

55:22 - 55:23

with just a single.

55:24 - 55:25

Server of GPUs.

55:25 - 55:28

So this is kind of a really fun time to

55:28 - 55:31

be working on innovation at any level of the stack.

55:31 - 55:31

So with that.

55:31 - 55:33

Hope you have a great build. Hope you have a

55:33 - 55:35

great party tonight. Hope to see you next year.

In-depth Look at Microsoft's AI Infrastructure and Confidential Computing Innovations

In this comprehensive article, we delve into Microsoft's cutting-edge innovations in AI infrastructure, with a focus on confidential computing and the advancement of AI technologies. The session, led by Marker Sinovich, offers insights into the latest developments and future prospects in the field of AI computing.

Hybrid Event Featuring Top-Notch Content

Marker Sinovich kicks off the session by welcoming participants to a hybrid event, highlighting the presence of both in-person and virtual audiences. He encourages engagement through Q&A sessions and expert discussions, creating an interactive and informative environment for all.

AI Infrastructure: An Inside Look

The discussion shifts to Microsoft's AI stack infrastructure, emphasizing the importance of infrastructure in AI development. Marker delves into the intricacies of compute, network, and storage components, providing a detailed overview of the innovations driving efficiency and scalability in AI operations.

Compute Power and Efficiency

Marker discusses the exponential growth in compute requirements for training large AI models. He sheds light on the development of AI supercomputers and custom silicon, such as Azure's Maya 100, to meet the escalating demands of AI training.

Network Optimization for High Performance

The conversation transitions to the significance of high-speed networking in AI operations. Marker outlines Microsoft's collaboration with NVIDIA and AMD to implement InfiniBand technology, enabling low-latency, high-bandwidth connections for optimized performance across data centers.

Storage Acceleration for Enhanced Efficiency

The focus then shifts to storage innovations, emphasizing the development of custom cluster storage solutions like the Storage Accelerator. Marker demonstrates how these innovative storage systems enhance data distribution and access, streamlining AI processes and boosting efficiency.

Confidential Computing: Securing Data at Every Stage

Marker introduces the concept of confidential computing, highlighting the significance of protecting data while in use. He explains the role of Trusted Execution Environments in creating secure computing environments, ensuring data confidentiality throughout various AI processes.

Microsoft's Advancements in Confidential Computing

Microsoft's collaborations with Intel, AMD, and NVIDIA have led to the integration of confidential computing capabilities in GPUs, opening up new avenues for secure data processing. Marker showcases real-world applications of confidential computing, such as protected speech translation and secure data sharing.

Research and Development in Forgetting Mechanisms

In a fascinating turn, Marker shares insights into research projects focused on enabling AI models to forget specific information or topics. He demonstrates the novel concept of making AI models forget specific content, such as Harry Potter references or profanity, showcasing the potential for enhanced data control and privacy.

The Power of AI in Programming

Marker explores AI's role in enhancing programming capabilities, particularly through tools like Copilot. He discusses the immense productivity gains achieved by leveraging AI for code autocompletion and optimization, highlighting the promising future of AI-powered programming assistance.

Conclusion: Embracing Innovation in AI

In conclusion, Marker emphasizes the exciting opportunities and advancements in AI infrastructure and confidential computing. He encourages attendees to stay engaged with emerging technologies and trends, paving the way for a future of secure, efficient, and intelligent AI operations.

As Microsoft continues to lead the way in AI innovation, the session leaves attendees with a sense of anticipation and inspiration for the groundbreaking developments on the horizon. Whether in hardware design, data center efficiency, or programming assistance, Microsoft's commitment to innovation shines through in every aspect of their AI infrastructure.


This comprehensive overview of Microsoft's AI infrastructure innovations provides a deep dive into the cutting-edge technologies shaping the future of AI computing. From confidential computing advancements to research breakthroughs, the session offers a glimpse into the dynamic world of AI technology and its transformative potential.