00:10 - 00:13
Alright everyone, we're about to start our next session. Again,
00:13 - 00:17
some amazing content we're gonna get through. Just a quick
00:17 - 00:20
couple reminders, this is a hybrid event. So as you
00:20 - 00:23
see, we are all here in person, but we also
00:23 - 00:26
have audiences joined us virtually. So if you want to
00:26 - 00:28
join in the Q&A in the chat, just go ahead
00:28 - 00:31
and just scan those QR codes to the left and
00:31 - 00:34
at all. But if you also want to talk with
00:34 - 00:37
some of our experts, we do have experiences up on
00:37 - 00:39
level 5. We get to get some like like be
00:39 - 00:42
able to talk to our experts and also get get
00:42 - 00:45
your hands on some cool demos upstairs. So with that,
00:45 - 00:47
let's go ahead and let's get started.
00:49 - 00:50
Morning, everybody.
00:55 - 00:56
How's build going so far?
00:57 - 00:59
So that's good, good build. Alright,
01:00 - 01:03
Welcome to Inside Microsoft AI Innovation. My name is Marker
01:03 - 01:07
Sinovich. Just out of curiosity, how many people have seen
01:07 - 01:10
a previous version of this particular session? So a few
01:10 - 01:12
of you have. So you're going to see some new
01:12 - 01:16
things. Those people that have seen this before. And for
01:16 - 01:18
those of you that haven't seen this before, I hope
01:18 - 01:21
to show you highlighted tour through our AI stack from
01:22 - 01:23
the bottom to the top,
01:23 - 01:27
focusing on infrastructure. So you've heard a lot about Copilot
01:27 - 01:30
and writing apps and how to mitigate security risks. I've
01:30 - 01:33
talked about how to develop copilots on top of semantic
01:34 - 01:37
This is gonna actually go one layer beneath that to
01:37 - 01:40
show you how we run AI inside of Azure. And
01:40 - 01:43
I'm going to start by going through different aspects of
01:43 - 01:47
our infrastructure, starting with compute. Then I'll talk a little
01:47 - 01:50
bit about our network and how we design our network,
01:50 - 01:53
what makes it unique. And then I'll talk about storage
01:53 - 01:56
and some of the innovations that we've got in storage
01:56 - 01:59
that we've deployed to make our infrastructure more efficient and
01:59 - 02:02
if actually efficiency is the key to the game. And
02:02 - 02:03
if you take a look.
02:03 - 02:05
At what's happening?
02:05 - 02:07
To challenge the ability for us to be in an
02:08 - 02:12
You can see that the sizes of these frontier models
02:12 - 02:16
have continued to grow basically exponentially. The biggest models are
02:16 - 02:20
the most powerful models and this trend line doesn't seem
02:20 - 02:24
to be slowing down. You heard Kevin Scott talk about
02:24 - 02:27
with Sam Altman that we don't see an end to
02:27 - 02:30
the scaling law of capability insight. And what this is
02:30 - 02:34
doing is driving more and more compute requirements out of
02:34 - 02:36
the infrastructure.
02:37 - 02:39
One of the ways to look at this is this
02:39 - 02:42
amount of compute requires to train these models because they're
02:42 - 02:45
not only getting big, but the amount of data that
02:45 - 02:48
they need to process during a training run continues to
02:48 - 02:49
grow. So you've heard about tokens,
02:50 - 02:53
and when you train one of these models, you're talking
02:53 - 02:57
about not just trillion tokens, but 10s of trillions of
02:57 - 03:00
tokens are passed through the model for them to learn
03:00 - 03:02
how to get these capabilities.
03:03 - 03:07
The first AI supercomputer we built back in 2019 for
03:07 - 03:10
Open AI was used to train GPT 3.
03:11 - 03:13
And the size of that infrastructure, if we'd submitted it
03:14 - 03:16
to the top 500 supercomputer benchmark would have been in
03:16 - 03:19
the top five supercomputers in the world on premises or
03:19 - 03:22
in the cloud and but the largest in the cloud.
03:23 - 03:27
In November, we talked about the next generation, 2 generations
03:27 - 03:30
later supercomputers, so that we built one to train GT4
03:30 - 03:33
after the GPT 3 one. And this one we're building
03:33 - 03:37
out to train the next generation of weaponized models, this
03:37 - 03:40
one with a slice of that infrastructure. Just a piece
03:40 - 03:44
of that supercomputer that we were bringing online. We actually
03:44 - 03:47
did an official top 500 run and it came in
03:47 - 03:51
number three largest supercomputer in the world and the number
03:51 - 03:54
one largest supercomputer in the public cloud.
03:54 - 03:58
Again, that's a tiny fraction of the supercomputer that we're
03:58 - 04:01
ultimately finished building for Open AI, and we're already on
04:01 - 04:04
to the next one and designing the for the next
04:05 - 04:09
That which are multiple times even bigger than this one.
04:09 - 04:12
Now just give you an idea of why they need
04:12 - 04:15
the scale. If you take a look at the training
04:15 - 04:19
requirements for something like Llama 370 billion, which is kind
04:19 - 04:22
of a medium sized model at this point, it's not
04:22 - 04:25
the largest type model. In fact, Meta is coming out
04:25 - 04:28
with a 400 billion parameter version of LAMA 3
04:28 - 04:32
in the near future. They talk about this required 6.4
04:32 - 04:35
million GPU hours on an H100, which is the current
04:35 - 04:38
generation supercomputer infrastructure.
04:39 - 04:42
If you've trained this model on one GPU, that would
04:42 - 04:45
take 730 years to train that model.
04:45 - 04:48
But on the supercomputer that you see on the left,
04:48 - 04:51
that tiny slice of it, that would take about 27
04:51 - 04:54
days. And again, that's a tiny fraction. So just divide
04:54 - 04:56
by the multiplier on top of the side of your
04:56 - 04:59
computer to see that you can start to get down
04:59 - 05:02
into training a model like this literally in days on
05:02 - 05:04
the supercomputers we're building
05:06 - 05:08
now. Another way to look at the amount of compute
05:08 - 05:11
required for these models is to look at the amount
05:11 - 05:14
of memory there require. Because it's talking about the number
05:14 - 05:17
of flops or tops that they require for processing tokens,
05:17 - 05:19
you need to look at how much memory footprint they
05:21 - 05:24
and this is especially important when it comes to inference
05:24 - 05:27
loading the model and serving it. If you take a
05:27 - 05:30
look at inference, you typically load a model in 16
05:31 - 05:34
bit format, so two bytes. So each parameter is 2
05:34 - 05:37
bytes, which means that if you've got a billion parameters,
05:37 - 05:41
that's about two billion, 2 gigabytes of memory that you
05:41 - 05:44
required to load and store that. And so if you
05:44 - 05:45
take a look at.
05:45 - 05:50
175 billion parameter model which was GPT 3 size. That's
05:50 - 05:54
350 gigabytes just to load the model. But that's not
05:54 - 05:56
all you need for inference.
05:57 - 06:00
Because you load the model, those are the parameters or
06:00 - 06:03
the weights into the GPU. You also have to store
06:03 - 06:06
the processing of the tokens and this is called the
06:06 - 06:10
key value cache, storing what it processed when it's looking
06:10 - 06:13
at the prompt and starting to generate tokens and that
06:13 - 06:15
needs to be retained for efficiency.
06:16 - 06:19
This is kind of the rough layout of a 13
06:19 - 06:23
billion parameter model on a a 40 gigabyte A-100 GPU.
06:24 - 06:27
The fact is that the models of course are getting
06:27 - 06:30
bigger and bigger. This is a 13 billion parameter model.
06:30 - 06:32
If you take it 70 or 400 billion or U
06:32 - 06:36
into the trillion parameter models that you're seeing at the
06:37 - 06:40
then you're talking about many times larger than the sizes
06:40 - 06:43
of the high bandwidth memory that is coming in these
06:43 - 06:46
GPUs. So if you take a look at the evolution
06:46 - 06:48
of GPUs, one of the key aspects of it is
06:48 - 06:51
not just the performance, but how much high bandwidth memory
06:51 - 06:52
comes with it in.
06:53 - 06:57
This has become a limiting factor for GPU's. You can
06:57 - 07:02
see here they've been roughly doubling every year and a
07:02 - 07:05
half or so up until just this past year. When?
07:05 - 07:07
We saw the A-100.
07:07 - 07:10
80 gigabyte come out and then H100 came out at
07:10 - 07:14
80 gigabyte and immediately AMD turned around with the MI300X,
07:14 - 07:18
which you heard about us launching at 192 gigabytes. And
07:18 - 07:22
this made it very compelling for these large models, not
07:22 - 07:26
just large models, but loading for inferences you'll see later.
07:26 - 07:29
And so this is kind of created a race to
07:29 - 07:32
build more and more high bandwidth memory. You can see
07:32 - 07:36
Grace Blackwell 200 coming in at 384 gigabytes of high
07:36 - 07:37
bandwidth memory
07:37 - 07:41
and that makes it extremely powerful for training and especially
07:41 - 07:44
for inference because you can load these big models or
07:44 - 07:47
multiple big models on the same Gu and get more
07:47 - 07:48
efficiency out of it.
07:50 - 07:51
So this is one of the reasons why we have
07:52 - 07:54
this diverse array of hardware in our data centers. We've
07:54 - 07:57
got the lines from NVIDIA. We've partnered closely with them,
07:57 - 07:59
driving requirements into them, including
08:00 - 08:03
information about how to run these large models and why
08:03 - 08:05
we need so much memory AMD.
08:05 - 08:06
So that large.
08:06 - 08:10
HBM that they've got, it's directly through feedback from us
08:10 - 08:13
and Open AI and then our own Azure Maya 100,
08:13 - 08:16
which you've heard about our own custom silicon, which isn't
08:16 - 08:19
a GPU, it's custom accelerator for AI.
08:23 - 08:26
HBM that is continuing to grow, the performance that's continuing
08:26 - 08:29
to grow, the die sizes that are continuing to grow
08:29 - 08:32
is leading to more and more power consumption at the
08:33 - 08:36
If you take a look now at another dimension of
08:36 - 08:40
GPU evolution, number of watts per GPU, you can see
08:40 - 08:42
this this is also growing exponentially.
08:42 - 08:44
Because of those other factors.
08:45 - 08:48
The latest NVIDIA GB 200 with all its high bandwidth
08:48 - 08:51
memory and all its transistors. It's over 200 billion on
08:52 - 08:57
is 1.2 kilowatts just for one GPU, and there's eight
08:57 - 09:01
of them in a single server. And that's just the
09:01 - 09:04
GPUs, not the RAM and the CPUs.
09:04 - 09:07
That go with that. So now you're talking about close
09:07 - 09:09
to 10 kilowatts
09:09 - 09:13
of power for just one server, which is just an
09:13 - 09:17
amazing amount of power for a server in a data
09:19 - 09:20
Now how do you cool something like?
09:21 - 09:24
The traditionally it's been air cooled, but we've had liquid
09:24 - 09:26
cooling for a long time. In fact, this is my
09:26 - 09:27
home desktop machine.
09:28 - 09:29
Which is liquid.
09:29 - 09:32
Cooled so that I can play games like Battlefield 2142
09:32 - 09:33
which is my favorite game
09:34 - 09:36
and this is what it looks like when I run
09:37 - 09:39
it at 80% GPU utilization which is when I'm running
09:39 - 09:41
Battlefield at high res.
09:41 - 09:44
56°C and it's able to run that hot because of
09:44 - 09:48
the liquid cooling, because there's no way fans would be
09:48 - 09:51
able to keep it operating at 80% at with this.
09:51 - 09:55
Now this is the operating temperature for a consumer GPU.
09:55 - 09:58
The operating temperature for a data center GPU.
09:59 - 10:00
By the way, this is.
10:00 - 10:04
A Dolly image of our data center if we don't
10:04 - 10:06
have a good cooling solution
10:07 - 10:11
is you can see the H100 operating temperature top 30°C,
10:11 - 10:14
so even way lower than what I've got in my
10:15 - 10:16
consumer GPU.
10:16 - 10:17
So we've got.
10:17 - 10:19
A force a ton of air. You can see how
10:19 - 10:21
much air is required to flow through that thing.
10:22 - 10:25
This is just becoming unsustainable. We cannot push enough air
10:25 - 10:28
through our data centers to cool these kinds of systems
10:28 - 10:30
when they get to this kind of scale and it
10:30 - 10:32
get the density that we need in the data center
10:34 - 10:37
So we're having to turn to other solutions. And Maya
10:37 - 10:40
is our first step towards the new design of data
10:41 - 10:44
centers in the cloud. Maya is a liquid cooled system.
10:44 - 10:47
This is a Maya board. So Maya boards have four
10:48 - 10:51
Maya accelerators. You can see those sheets there at the
10:51 - 10:56
top. They're covering the Maya accelerator parts underneath them because
10:56 - 10:57
those sheets.
10:59 - 11:03
Are what carries the liquid into the plates that are
11:03 - 11:05
on top of the Maya accelerators.
11:06 - 11:10
So these are liquid cooled systems custom designed by us.
11:10 - 11:12
Here's another look at that.
11:13 - 11:15
This is the liquid coolant in and out that goes
11:15 - 11:19
into those sheaths carrying the water, the liquid into those
11:20 - 11:23
And this is allowing us to keep these systems cool
11:23 - 11:26
and to save water and to save energy. So it's
11:26 - 11:30
like a win, win, win. It's more complicated because we're
11:30 - 11:33
having to design cold plates custom for these things and
11:33 - 11:37
design custom liquid cooling systems. We can't go buy these
11:37 - 11:38
things off the shelf,
11:39 - 11:40
so we've been building
11:41 - 11:45
custom liquid cooled sidekicks and you've probably seen this video
11:45 - 11:48
before of the Maya rack, which consists of two sides.
11:48 - 11:50
This is the on the right, the back of the
11:51 - 11:54
my executor. There's four Maya servers on the top, four
11:54 - 11:57
Maya servers on the bottom, then the front end networking
11:57 - 12:00
in the middle, and then that second cabinet next to
12:01 - 12:04
Is the liquid closed loop liquid cooling system. You can
12:04 - 12:07
see the cable, the cooling that was there on the
12:07 - 12:10
back. Those giant cables carrying liquid hot and cold into
12:11 - 12:13
or cold and hot into and out of the Maya
12:13 - 12:16
rack next to it, and those cables that we just
12:16 - 12:18
saw in the slide before it.
12:19 - 12:21
This is actually.
12:21 - 12:24
Like I said, our first deployments of liquid cooling in
12:24 - 12:25
the in the Azure data center
12:26 - 12:29
and it is the trend you will see liquid cooling
12:29 - 12:32
for next generations of GPUs coming out from AMD and
12:32 - 12:36
NVIDIA for high end, their high end offerings into the
12:36 - 12:39
data centers. And we are prepared for this both with
12:39 - 12:42
the sidekick like this that will work with those systems
12:42 - 12:46
as well as liquid cooled natively data center footprints that
12:46 - 12:50
can support liquid cooling without the sidekicks. The sidekicks let
12:50 - 12:53
us take this liquid cooling into any data center.
12:56 - 12:59
When we look at the power consumption of these models,
12:59 - 13:02
we see something interesting for training. Once you get a
13:02 - 13:05
training run going, you get up to close to 100%
13:06 - 13:09
And stay pretty constant there. As the training run goes,
13:09 - 13:09
there might.
13:10 - 13:13
Dips and spikes bit inference is a little bit different.
13:13 - 13:17
We'll take a closer look later, but with inference you
13:17 - 13:20
can see these very spikes. Low power, relatively low power,
13:20 - 13:21
and then a big.
13:22 - 13:26
what's happening there are the differences between prompt processing and
13:26 - 13:30
token generation, which I'll get into later with some details.
13:30 - 13:33
But what we saw when we took a look at
13:33 - 13:35
these traces in the data center
13:36 - 13:39
is that we can actually save power
13:39 - 13:42
by running not at 100%.
13:42 - 13:46
Draw rated for 100%. Drop it by over subscribing the
13:46 - 13:47
power in the data center.
13:49 - 13:51
The idea is when you have so many
13:52 - 13:54
of these jobs happening in the same time, they're not
13:54 - 13:56
all going to spike at the same time. So there's
13:56 - 13:59
going to be an average power utilization which is below
14:00 - 14:02
And you can see the kind of headroom.
14:02 - 14:03
That we calculate.
14:03 - 14:07
3% for training, 21% for inference. That means we can
14:07 - 14:10
over subscribe power by 20% safely.
14:11 - 14:14
So we building systems and actually started to deploy these
14:14 - 14:17
in our data centers that are monitoring the power draw
14:17 - 14:18
of these GPUs and servers
14:19 - 14:23
and then basically over subscribing the power and with the
14:23 - 14:26
ability to cap or throttle when we are reaching limits
14:26 - 14:29
if they all happen to suddenly spike at the same
14:29 - 14:32
time, we don't want the data center to fail. So
14:32 - 14:36
we throttle the frequency on the servers. We also throttle
14:36 - 14:38
the power going into the racks
14:39 - 14:43
through software managed power control in the data center. And
14:43 - 14:46
what this is doing is allowing us to put 30%
14:46 - 14:49
more servers in our existing data center footprints.
14:50 - 14:52
This is literally hundreds of millions of dollars of saving
14:53 - 14:54
through power every year and.
14:55 - 14:56
A lot of questions I get is what are you
14:57 - 15:00
doing about sustainability? Lots of things. Liquid cooling is more
15:00 - 15:03
sustainable. It's actually much less water usage than our air
15:03 - 15:05
cooled data center. And this is another way that we're
15:05 - 15:08
getting more sustainable is just by making more efficient use
15:08 - 15:09
of that same power.
15:10 - 15:12
Now let's talk about networking.
15:12 - 15:15
In networking. Just to give you an idea of the
15:15 - 15:19
networking requirements of large scale AI, an AI training job
15:19 - 15:23
uses something called Data Parallel processing.
15:23 - 15:25
Where you've got lots of instances of the model all
15:26 - 15:28
learning different parts of the data at the same time.
15:28 - 15:31
And then there's this thing at the end called All
15:31 - 15:34
Reduce where they all share what they've learned and update
15:35 - 15:40
This sharing requires massive coordination across all the GPUs. They're
15:40 - 15:44
all sharing information with each other across whatever scale they
15:44 - 15:46
run at. And if you take a look at the
15:46 - 15:49
scale of those systems, I was talking about those a
15:49 - 15:52
supercomputers, there are 10s of thousands of servers
15:53 - 15:56
and they all have to be connected together to make
15:56 - 15:57
that efficient all reduced happen.
15:58 - 16:02
So this requires high bandwidth connections between them and in
16:02 - 16:05
low latency. And within the servers you also need low
16:06 - 16:10
latency because there's basically like mini data centers inside those
16:10 - 16:13
servers. There's eight GPUs on one of those GB, two
16:13 - 16:15
hundreds or H100 systems.
16:15 - 16:19
And or the M300X and they're connected through their own
16:19 - 16:23
custom high end connect high bandwidth connections.
16:24 - 16:29
That have 1.2 terabits. Terabits, sorry terabytes, not terabits, terabytes
16:29 - 16:32
per second of network bandwidth between those GPUs. So if
16:33 - 16:34
you can, stay within.
16:34 - 16:38
The server you get amazing amounts of network bandwidth.
16:39 - 16:41
Now, coming out of the network, the servers.
16:42 - 16:46
We now have 400 Gigabit InfiniBand links coming out for
16:48 - 16:52
so there's a total of 3.2 terabits of networking coming
16:52 - 16:55
out of these servers, and that is all connected
16:56 - 16:58
through our custom InfiniBand
16:59 - 17:02
topology. We're the only cloud that has InfiniBand at this
17:02 - 17:02
kind of scale,
17:04 - 17:08
and this is really the only difference between the supercomputers
17:08 - 17:10
we build for Open AI and what we make available
17:11 - 17:13
publicly is the scale of the InfiniBand domain.
17:14 - 17:17
In the case of Open AI, it's the InfiniBand domain
17:17 - 17:19
covers the entire supercomputer, which is
17:20 - 17:22
10s of thousands of servers. In the case of our
17:22 - 17:25
public systems where we don't have customers that are looking
17:25 - 17:27
for skate training at that kind of scale or even
17:27 - 17:29
we're not training at that kind of scale,
17:30 - 17:34
that InfiniBand domains are 1000 to 2000 servers in size,
17:34 - 17:37
which is still 10,000 to 20,000 GPUs, which is amount
17:37 - 17:41
massive supercomputer itself. But so you can get a massive
17:41 - 17:45
supercomputer of that scale in Azure through our public infrastructure.
17:46 - 17:50
Now, like to look at pictures of what's inside these
17:50 - 17:54
things. This is NVIDIA H100 system. You can see the
17:54 - 17:57
8 GPUs up there. These are the back of our
17:57 - 18:02
racks, those cyan cables. That's the InfiniBand cables we're laying
18:03 - 18:06
We've laid down at this point enough InfiniBand in our
18:06 - 18:08
data centers to wrap the Earth five times.
18:09 - 18:12
And by the way, the amount that we're scaling our
18:12 - 18:16
systems, Kevin Scott talked about this. We've, or I think
18:16 - 18:20
Satya did 30X since since November. We've built out 30X
18:22 - 18:25
the size of our AI infrastructure since November, the equivalent
18:25 - 18:28
of five of those supercomputers I showed you every single
18:28 - 18:30
month, and that rate continues to increase.
18:31 - 18:33
So this means we need to lay a lot of
18:33 - 18:37
cabling both for our front end and back end networking.
18:37 - 18:40
Well, not all the innovation has to be super high
18:41 - 18:44
One of the three ways that we've found is the
18:44 - 18:49
traditional cabling in data centers was very low tech. It
18:49 - 18:53
required technicians to just go along, pulling the cables,
18:54 - 18:55
stretching them between servers
18:56 - 19:02
and just hugely inefficient. So what our data center incubations
19:02 - 19:06
team came up with is 3D printed what they call
19:09 - 19:12
with something called Jeffries that allow them to take these
19:14 - 19:17
And pull them down sleds that are above the racks.
19:18 - 19:21
To pull that cabling and this saves them three times
19:21 - 19:23
the the time it takes to lay down the cabling.
19:23 - 19:26
And we're using this on our front end networks to
19:26 - 19:29
lay down that much cabling. We started with InfiniBand, but
19:29 - 19:32
we've got now got different solution for InfiniBand, but we're
19:32 - 19:35
using this now and all our data centers starting to
19:35 - 19:37
build this out. So not really a super high tech.
19:37 - 19:41
These are 3D printed thing things, but innovation, you know,
19:41 - 19:43
comes in whatever way innovation comes.
19:44 - 19:46
Now let me talk a little bit about storage.
19:47 - 19:51
So training, especially models and videos are big. Like we've
19:51 - 19:54
said, you need to distribute them to all these servers
19:54 - 19:57
for inference, the same thing, lots of models deployed on
19:57 - 20:00
these clusters. How do you distribute all of that data?
20:00 - 20:02
The models? The model checkpoints?
20:03 - 20:05
The key value cache is if you need to move
20:05 - 20:07
them or reload hot.
20:08 - 20:11
What we've built is inside of our infrastructure custom cluster
20:11 - 20:14
storage, we call it storage Accelerator. The idea here is
20:14 - 20:17
we wanted to make this extremely simple and extremely reliable.
20:17 - 20:20
You don't need really a parallel file system for this.
20:20 - 20:23
You just need something that can pull data in, distribute
20:23 - 20:26
it within the cluster so it doesn't have to go
20:26 - 20:27
all the way back to storage.
20:28 - 20:31
And so the solution is when a worker needs a
20:31 - 20:34
model for example, it goes and checks.
20:35 - 20:36
If it's got it.
20:36 - 20:38
Or if the the cash has it. If it doesn't,
20:38 - 20:39
it pulls it in.
20:40 - 20:43
And then it distributes it to the other servers in
20:44 - 20:48
Which are typically the cluster, the GPU cluster that's assigned
20:48 - 20:51
to this particular inference domain, for example. And it distributes
20:51 - 20:55
this through either InfiniBand or Ethernet, depending on the cluster.
20:56 - 20:59
And it does it without interfering with the running workloads
20:59 - 21:01
there. So it can take advantage of all the free
21:01 - 21:04
bandwidth to distribute these things around so that another worker,
21:04 - 21:07
when it needs something, can very quickly get it from
21:08 - 21:09
Even over InfiniBand.
21:10 - 21:13
So parallel reads from the cache, these are stored in
21:13 - 21:17
blocks. There's no metadata server, there's no replication across the
21:17 - 21:20
servers. It's very simple and very fast, and I'll show
21:21 - 21:23
you a quick demo of that right here.
21:24 - 21:27
So on the left side is without the storage cache
21:27 - 21:29
we're going to load in LAMA 370 B, again, coming
21:30 - 21:33
directly from Azure Storage. On the right side, it's gonna
21:33 - 21:36
pull it from another cache node over InfiniBand in the
21:38 - 21:40
And what we'll see.
21:41 - 21:43
That it took about 12 minutes to load that from
21:43 - 21:44
Azure storage.
21:44 - 21:47
It's a 270 gigabytes of data and it took less
21:47 - 21:50
than half that time to pull it from the other
21:50 - 21:52
server the the cache in the cluster.
21:53 - 21:56
This is really important for the model as a service
21:56 - 21:58
that we've announced because we're going to have lots of
21:58 - 22:01
models in these clusters. We need to very efficiently get
22:01 - 22:03
the models up and running on whatever GPUs that get
22:03 - 22:06
assigned to. And this is just a fundamental physics problem.
22:06 - 22:08
How do you get that data loaded?
22:10 - 22:11
So that's a look at.
22:12 - 22:15
That kind of hardware and the lowest layers of the
22:15 - 22:19
software for running our AI infrastructure, I'm gonna go one
22:19 - 22:22
level up and talk about how we resource manage all
22:22 - 22:24
of those GPUs in that infrastructure.
22:25 - 22:28
And important to keep in mind that what we're aimed
22:28 - 22:32
at with all of this is efficiency, power efficiency, time
22:32 - 22:35
efficiency, making a use of those resources as close to
22:35 - 22:39
100% as possible. So we're not wasting anything.
22:39 - 22:43
And to support that, we've got micro optimizations and we've
22:43 - 22:45
got macro optimizations.
22:45 - 22:47
Let's start with the macro optimizations.
22:48 - 22:51
Everything that I'm talking about now is built on something.
22:51 - 22:55
Our AI internal AI workload platform. It's a resource manager
22:55 - 22:58
that knows about an AI job. It also knows about
22:58 - 23:02
AI models for inference and it's called Project Forge. Internally,
23:02 - 23:05
it's got a different name that we decided wasn't it.
23:05 - 23:09
It's called internally, it's called Singularity, which kind of has
23:09 - 23:10
some negative aspects to it,
23:11 - 23:13
so I'm not supposed to tell you that,
23:15 - 23:17
but you can see that it's got a bunch of
23:17 - 23:21
subsystems associated with it supporting these things. One of the
23:21 - 23:23
key ones is this global scheduler.
23:25 - 23:28
So when we take a, when we, the global scheduler
23:28 - 23:30
treats all of the GPU's and all of the regions
23:30 - 23:33
around the world as a single pool and we call
23:33 - 23:33
it one pool.
23:34 - 23:37
And the idea is that when we have that kind
23:38 - 23:42
capacity to resource manage that we can do it more
23:42 - 23:48
efficiently. Because the challenge that you have with looking at
23:48 - 23:49
cluster by cluster,
23:50 - 23:53
it's something that we had at Microsoft if you up
23:53 - 23:57
until last year and that we see many enterprises have,
23:57 - 24:00
which is you design GPUs to individual teams.
24:01 - 24:03
And this has 2 problems. One team doesn't use all
24:03 - 24:06
their GPUs and those GPU's it's not they're not using
24:06 - 24:08
or sitting there doing nothing.
24:08 - 24:11
Another team has used all their GPUs and would like
24:11 - 24:13
to use more GPUs, but they don't have access to
24:13 - 24:16
them, including the ones that the next team next to
24:16 - 24:17
them is not using.
24:18 - 24:20
So the idea with one pool is.
24:21 - 24:25
Everybody gets a virtual GPU, not physical GPUs. They get
24:25 - 24:29
a certain amount at a certain priority they get. We've
24:29 - 24:32
got three priority levels, low standard and premium,
24:33 - 24:36
and you if a premium job comes in and there's
24:36 - 24:38
a low priority job running on the GPU's it needs
24:38 - 24:42
low priority gets evicted. Now, the interesting thing about the
24:43 - 24:46
is that this is a global pool. And just because
24:46 - 24:49
it gets evicted from a cluster in a particular region
24:49 - 24:52
doesn't mean there's not capacity somewhere else that it can
24:54 - 24:57
And Project Forge knows that. So it might say, well,
24:57 - 25:00
you, you need a one hundreds, this higher party job
25:00 - 25:03
needs your A1 hundreds, you get evicted. You're gonna go
25:03 - 25:06
now to another region and I'll restart you there on
25:06 - 25:09
those GPU's that the higher priority job can't use.
25:10 - 25:13
And if you take a look, we've migrated all of
25:14 - 25:16
our first party training onto 1 pool
25:17 - 25:19
and now we've if you take a look at three
25:19 - 25:22
different teams here, this is actual charts from three different
25:22 - 25:23
teams utilization.
25:24 - 25:25
You can see team.
25:25 - 25:29
A has over 100% utilization, and the reason why it's
25:30 - 25:33
is because that's their guaranteed capacity.
25:34 - 25:36
But they were able to use more because these other
25:36 - 25:38
teams aren't using all their capacity.
25:39 - 25:42
And so this is the benefits and the average total
25:42 - 25:46
over utilization now across these three teams. If you just
25:46 - 25:48
like take a look at them, it's about 100%.
25:50 - 25:52
And if you take a look in aggregate across all
25:52 - 25:55
of Microsoft for all of our GPU usage for training.
25:55 - 25:59
We went from 50 to 60% utilization to between 80
25:59 - 26:01
and 90 right now and we expect to get even
26:02 - 26:05
higher. And so this is all benefits to our bottom
26:05 - 26:07
line and costs that we pass on.
26:08 - 26:09
Now, another benefit of Project Forge.
26:10 - 26:14
Is the reliability system. When you're running these large jobs,
26:14 - 26:16
they take days, weeks, or even months in the case
26:16 - 26:18
of some of those Frontier models,
26:19 - 26:22
and you're inevitably with that amount of hardware, going to
26:22 - 26:25
have failures on a pretty regular basis. We see failures
26:25 - 26:28
on those large scale systems. If you have 1000 GPU,
26:28 - 26:30
you're going to see a failure roughly every two or
26:30 - 26:33
three days of some kind. A GPU is going to
26:33 - 26:35
fail, the server is going to fail, RAM is going
26:35 - 26:37
to fail, a network link is going to fail.
26:38 - 26:41
And if you're having to babysit those jobs and manually
26:41 - 26:45
restart them, manually diagnose what's going on, manually move it
26:45 - 26:48
to another healthy server, you're just never going to make
26:48 - 26:52
progress and you're going to have horrible utilization and efficiency.
26:53 - 26:56
So Project Forge is designed with reliability in mind. This
26:57 - 27:00
is an actual trace from Project Forge dashboard. It's got
27:00 - 27:04
automatic failure detection. So it will automatically detect if any
27:04 - 27:07
of these kinds of parts fail, it'll automatically diagnose it.
27:08 - 27:12
It'll automatically take it out of rotation, automatically file tickets
27:12 - 27:14
for the data center support people,
27:14 - 27:17
and it will automatically restart the job
27:18 - 27:21
again from the checkpoint on healthy servers to let it
27:22 - 27:25
continue. This is the way we get basically automated long
27:25 - 27:28
running reliability for our jobs.
27:28 - 27:31
This is by the way, actual trace. This is for
27:31 - 27:33
example, some of the Phi three runs that were done
27:33 - 27:36
on Project 4. It's like I said, all of our
27:36 - 27:38
runs, including all the Phi runs happen on top of
27:38 - 27:41
this project Forge infrastructure. This is 1024 GPUs,
27:42 - 27:45
a job that ran for two days, 11 hours, so
27:45 - 27:47
close to 2 1/2 days.
27:47 - 27:50
On 1100 GPUs.
27:53 - 27:55
So let me take a look at micro optimizations.
27:56 - 27:59
If we take a look at the anatomy of an
27:59 - 28:02
LM inference, it's broken up into two phases. The first
28:02 - 28:06
phase is called the prompt phase. This is where the
28:06 - 28:10
whole prompt is processed in parallel. This is the really.
28:10 - 28:13
Efficient phase because the GPU can do the whole thing
28:13 - 28:17
at once. Boom, using all of the compute on the
28:17 - 28:20
GPU and it gets the prediction for the next token,
28:20 - 28:22
which in this case is yes.
28:23 - 28:26
Now it goes back and it's entering now the token
28:26 - 28:29
phase, or the generation phase or the decode phase. People
28:29 - 28:32
call it different things, and at this point it's doing
28:32 - 28:33
next word prediction one at a time.
28:34 - 28:37
This next word was yes. What's the next word it?
28:40 - 28:43
And then that's it, end of end of sequence ES.
28:43 - 28:47
So this is the token generation phase. Now very different
28:47 - 28:51
characteristics from these two, which we'll get into in a
28:51 - 28:56
second, but also different AI applications have different ratios of
28:56 - 28:59
prompt versus token generation or or decode
29:00 - 29:03
on the top left, content creation, short prompt,
29:03 - 29:05
long generation.
29:05 - 29:07
On the bottom right.
29:08 - 29:11
Long prompt for summarization, short generation,
29:12 - 29:14
and then you can see chat bots and enterprise chat
29:14 - 29:16
bots are kind of in the middle.
29:17 - 29:20
If you take a look at naively scheduling.
29:21 - 29:24
Prompt and generations on the same GPUs.
29:25 - 29:28
What's gonna happen is this. You get a prompt to
29:28 - 29:29
come in from one session.
29:30 - 29:32
The GPU starts to process it.
29:33 - 29:37
It's and another prompt comes in while it's generating, and
29:37 - 29:40
that first generation slows down because the GPU now is
29:40 - 29:42
busy dealing with that second prompt.
29:43 - 29:45
Now that second prompt when it's finished.
29:46 - 29:49
It goes into generation mode and now you're into memory
29:49 - 29:53
constrained rather than GPU constraint, GPU processing constraint. But now
29:53 - 29:57
they can both proceed without interfering with each other. So
29:57 - 30:00
the problem is these prompts that arrive and are just
30:00 - 30:03
demand a lot of GPU that interfere with the existing
30:03 - 30:07
generation that are happening. How do you deal with that?
30:07 - 30:11
Well, we developed internal project called Flywheel to deal with
30:11 - 30:14
this. How many people have heard of PTO or PTA
30:14 - 30:17
managed offering? So a few of you have. This is
30:17 - 30:19
our serverless GPU offering
30:20 - 30:23
for inference, and it works on top of project Flywheel.
30:23 - 30:26
The idea is we don't want to have to give
30:26 - 30:29
you a whole GPU for your inference where now you're
30:29 - 30:34
responsible for getting that thing to 100% utilization. We'd like
30:34 - 30:35
to give you a.
30:35 - 30:36
Fraction of a GPU.
30:36 - 30:39
And we're going to share that GPU with other customers,
30:39 - 30:42
of course, in a very secure way. We're not mixing
30:42 - 30:44
TOE prompts and tokens from different customers,
30:45 - 30:46
but we need to do it in a way that
30:46 - 30:50
gives you guaranteed performance because you don't want your noisy
30:50 - 30:53
neighbor like you're doing. Your app is running and suddenly
30:53 - 30:56
another prompt from another customer comes in and now your
30:56 - 30:59
app slows down. So Project Flywheels aimed at providing that
30:59 - 30:59
enterprise grade.
31:00 - 31:02
Consistency. So a prompt comes in,
31:03 - 31:07
starts generating tokens at normal speed, and another prompt comes
31:07 - 31:10
in. We don't process the whole prompt, we chunk it,
31:11 - 31:15
and we interleave its prompt generation with the generation of
31:15 - 31:18
the tokens from the first prompt.
31:19 - 31:21
And now you don't get any kind of that kind
31:21 - 31:24
of interference effect. And the key here is how do
31:24 - 31:26
you do that and provide a consistent?
31:27 - 31:31
Delivery of throughput which is the way that you want
31:32 - 31:34
What's the throughput of my generations?
31:35 - 31:36
So I want to show you a quick demo here
31:36 - 31:37
of Project Flywheel
31:38 - 31:40
up at the top. This is actual trace from Project
31:40 - 31:43
Forge. Project Flywheel you can see naive prompt and token
31:44 - 31:46
processing. Some of those big blocks are the big prompts
31:46 - 31:49
coming in the small blocks or the token generations for
31:49 - 31:51
different sessions and different colors.
31:52 - 31:55
You can see that they're all over the place and
31:55 - 31:55
you big, big.
31:55 - 32:00
Gaps of white. Those whites are wasted GPU plus the
32:00 - 32:02
the latency, the throughput.
32:03 - 32:05
Is all over the place for these different sessions. Now
32:05 - 32:08
on the bottom with Project Flyway, you can see everything's
32:09 - 32:12
nice and consistent. All of the blocks are the same
32:12 - 32:15
size, whether they're prompt or generation.
32:16 - 32:16
Now here's
32:17 - 32:21
three different workloads, one that has big prompts
32:22 - 32:25
and small generations, one that has about balance, and then
32:25 - 32:29
on the right small prompts, big generations to demonstrate different
32:29 - 32:30
AI workloads.
32:31 - 32:32
They all have different.
32:32 - 32:35
Throughput requirements, and we're gonna run them
32:36 - 32:37
and then plot.
32:38 - 32:41
What throughput project fly wheels giving them?
32:42 - 32:45
Because they've each signed up for their own throughputs that
32:45 - 32:47
they need for their app to behave well.
32:47 - 32:49
And you can see they're all in different colors. You
32:49 - 32:50
can see prompt tokens per minute.
32:52 - 32:55
You can see the different green, yellow and blue for
32:55 - 32:56
those different.
32:58 - 33:01
And you can see generated tokens per minute. So the
33:01 - 33:05
blue obviously is generating a lot of tokens. That's one
33:05 - 33:07
that would be doing a the generation, the kind of
33:08 - 33:11
small prompt, big generation. But you can see all of
33:11 - 33:15
them have very consistent throughput, very consistent latency because of
33:15 - 33:19
that multiplexing project flywheel is doing. Another way to benefit
33:19 - 33:22
of this is you can dynamically raise the throughput on
33:22 - 33:25
any one of these and it will if there's enough
33:25 - 33:26
capacity on those GPU
33:28 - 33:30
So this is the way that you can dynamically without
33:30 - 33:31
having to go buy another GPU.
33:32 - 33:35
Being able to get a slice of GPU and be
33:35 - 33:37
able to dial up and down how much of the
33:37 - 33:40
GPU you need as your job changes, your application changes
33:42 - 33:45
So that's an example of a micro optimization. Want to
33:45 - 33:49
show you another one. How many people have heard of
33:49 - 33:52
Laura or low rank adaptive fine tuning off you? You
33:52 - 33:53
have this, Laura.
33:53 - 33:54
Is the way that.
33:54 - 33:57
People find tune AI models now. Now let me talk
33:57 - 34:00
about what fine tuning is. Fine tuning is when you
34:00 - 34:02
take a base model and you want to give it
34:02 - 34:05
new data to train on, to change its behavior or
34:05 - 34:08
to give it some new knowledge. It's typically change behavior
34:08 - 34:10
like I want you to speak in JSON format or
34:11 - 34:12
I want you to speak like a doctor.
34:14 - 34:17
That would be an example. Examples of fine tuning.
34:18 - 34:21
The way that traditionally it was done is you make
34:21 - 34:24
a whole copy of the model you give the you
34:24 - 34:28
basically continue the training by giving it small target data
34:28 - 34:31
set. Like, you know, examples of doctor talking.
34:32 - 34:34
And then you spit out a new model.
34:34 - 34:35
That knows how to talk like a doctor.
34:37 - 34:39
This is really inefficient.
34:40 - 34:42
So we came up with a way to
34:43 - 34:47
make fine tuning much more efficient by training what's called
34:47 - 34:51
an adapter. So you create some extra weights called an
34:52 - 34:54
instead of copying the whole model.
34:55 - 34:58
Then you train the whole thing, but you don't modify
34:58 - 35:00
the base model, just the this adapter. What waits
35:01 - 35:04
and what you end up with is a fine-tuned model
35:04 - 35:07
that is really the combination of the adapter plus the
35:07 - 35:11
base model. So you can have literally thousands of fine-tuned
35:11 - 35:14
versions of the base model. Each one of those adapters
35:14 - 35:15
is small. How small?
35:16 - 35:19
Well, if you take a look at a three hundred
35:19 - 35:23
175 gigabyte or sorry, 175 billion parameter model GPD 3.
35:24 - 35:27
You might have an adapter size of just 100 megabytes.
35:28 - 35:33
And so to train fine-tuned traditionally 96 GPUs, 24 GPUs
35:33 - 35:36
for Laura, one terabytes,
35:37 - 35:40
one terabyte per 100 bottle weights for traditional fine tuning,
35:41 - 35:43
one terabyte bass plus 200 megabytes
35:46 - 35:48
And then switching between models
35:49 - 35:52
takes minutes. To load that big giant model with a
35:52 - 35:55
new version that is fine-tuned for lore adapters, you just
35:55 - 35:57
need to load the adapter.
35:58 - 36:01
And then it has no downsides really no additional inference
36:01 - 36:04
latency. You get more training throughput at the same time.
36:06 - 36:09
Laura and this was developed by Microsoft Research and it
36:09 - 36:11
is the industry standard now for how to do this.
36:12 - 36:14
But we've taken it one step further inside of our
36:14 - 36:15
production systems.
36:16 - 36:18
Because if you take a look at, you know, we've
36:18 - 36:20
gotten this fine tuning as a service now with GPT
36:20 - 36:22
35. Now we have GPT 4 and we have other
36:22 - 36:24
models coming. Fine tuning as a service where you just
36:24 - 36:26
give us a data set, we take care of the
36:26 - 36:26
fine tuning.
36:27 - 36:30
You'd load your fine-tuned models. There's going to be lots
36:30 - 36:32
of different fine tune adapters on top of these base
36:33 - 36:35
We don't want to just be able to load.
36:35 - 36:37
One, remove it, load another one.
36:38 - 36:40
And if you take a look at traditional fine tuning,
36:41 - 36:42
that's what you're doing. But if you take a look
36:43 - 36:45
at something we've developed called multiflora fine tuning, which we've
36:45 - 36:46
got deployed,
36:47 - 36:50
we load multiple Laura adapters on top of the base
36:50 - 36:53
models and then can just they're all in the GPU.
36:53 - 36:56
So when it requests from 1 customer comes in for
36:56 - 36:59
the doctor adapter, it gets processed. When another one comes
36:59 - 37:02
in for the lawyer adapter, it gets processed at the
37:02 - 37:05
same time. We don't have to load and unload.
37:06 - 37:08
So I got a quick demo of that.
37:10 - 37:14
So here I'm gonna do it in traditional inference with
37:14 - 37:18
one lore adapter loaded in the model without multi Laura.
37:18 - 37:21
You can see when I try to lower load a
37:21 - 37:23
second adapter and query it.
37:23 - 37:26
I get a failure because the model's not ready because
37:26 - 37:28
it's trying to load the new adapter.
37:30 - 37:33
And hasn't finished by the time I sent the request
37:33 - 37:34
in, so I'm gonna load 100.
37:37 - 37:39
Into the second GPU or sorry, 1000 a thousand models
37:39 - 37:42
into the second GPU. And I'm going to just ping
37:42 - 37:44
a few of those Laura adapters and you can see
37:44 - 37:47
that I'm getting responses from all of them because they're
37:47 - 37:50
all sitting there in the GPU. And so this would
37:50 - 37:53
be the equivalent of multiple customers coming in with requests
37:53 - 37:56
at the same time for their different fine tune models,
37:56 - 37:58
and we're able to serve them all because of.
37:58 - 38:02
This efficiency and it the question is, does that impact
38:02 - 38:05
the performance of those serving? Do they interfere with each
38:05 - 38:07
other at these lower adapters?
38:08 - 38:10
You see on the left is a single GPT 35
38:10 - 38:11
fine-tuned adapter on the right.
38:12 - 38:15
Or 100 or 1000 of them being queried all at
38:15 - 38:16
the same time
38:17 - 38:18
and the latency is the same.
38:19 - 38:23
So no performance degradation, no impact, but we're able to
38:23 - 38:27
serve 1000 models basically. It's basically think of it as
38:27 - 38:29
1000 custom versions GP35 on the same GPU.
38:31 - 38:36
And this is the 10 model concurrency latency run for
38:36 - 38:38
2510 models 25.
38:40 - 38:43
20 calling them 2025 requests at the same time.
38:44 - 38:46
So this is another really key way to for us
38:46 - 38:50
to raise the efficiency of those GPUs in the world
38:50 - 38:53
where we're going to have lots of fine-tuned models out
38:55 - 38:58
So another example of kind of the innovation that we've.
38:59 - 39:02
Now I want to switch gears and talk about the
39:02 - 39:04
evolution of computing in general.
39:09 - 39:11
when we take a look at at cybersecurity,
39:12 - 39:15
it's been a world of logical protection of data.
39:17 - 39:20
You've got encryption at rest. And this is something that
39:20 - 39:23
all the cloud providers now have both server side keys,
39:23 - 39:26
we're encrypting the data and then customer managed keys where
39:26 - 39:29
you can define your own keys and encrypt the data
39:29 - 39:30
a second time. On top of that,
39:31 - 39:34
you also have encryption of data in transit. All network
39:34 - 39:37
communications expected to be encrypted now,
39:38 - 39:41
and that protects the data in transit and at rest.
39:41 - 39:42
But what's been missing
39:43 - 39:46
is protecting the data while it's being used.
39:47 - 39:48
It gets to the server
39:49 - 39:52
either loaded from storage or through the network, and now
39:52 - 39:55
it's in the clear when it's decrypted. And now it's
39:55 - 39:58
being processed as part of a training job, part of
39:58 - 40:01
a data analytics job. It's sitting out there in the
40:02 - 40:05
And what's it susceptible when it's sitting out to, when
40:05 - 40:07
it's what kind of risks when it's out in there
40:07 - 40:10
in the open on the server? Well, it's susceptible to.
40:11 - 40:15
Somebody that breaches the infrastructure, it's susceptible to insiders, it's
40:15 - 40:18
susceptible to operators that get access, administrators.
40:20 - 40:22
What we want to do is keep that data as
40:22 - 40:26
secure as possible through its own life cycle. So
40:27 - 40:31
something called Trusted Computing emerged in the late 2000s with
40:32 - 40:34
ARM, Trustzone and Intel SGX.
40:34 - 40:38
And I got it when I became went into Azure
40:38 - 40:41
and I was looking at SGX,
40:42 - 40:43
realized that this would be
40:44 - 40:46
filling in the third leg.
40:46 - 40:48
Of this data protection
40:51 - 40:53
Protecting data while it's in use.
40:53 - 40:57
Now what are these technologies? They're they're originally called trusted
40:57 - 41:01
Execution environments. We've branded it a confidential computing this concept.
41:01 - 41:05
Protecting data with hardware and that hardware protection.
41:06 - 41:08
Is based on a root of trust in the CPU.
41:09 - 41:13
Where it creates effectively a box, an encrypted box in
41:14 - 41:15
the CPUs memory.
41:16 - 41:19
Nothing can get into it after it's been created. Nothing
41:19 - 41:21
can see into it after it's been created.
41:22 - 41:23
And it's even encrypted
41:25 - 41:28
physically when that data leaves the CPU and goes to
41:28 - 41:32
the to memory. So even somebody with physical access
41:32 - 41:35
can't easily sniff the memory bus, for example, and see
41:35 - 41:39
the data. So you get logical protection and physical protection.
41:39 - 41:40
But it's not just that
41:41 - 41:44
the real key to confidential computing, besides the justice that
41:44 - 41:48
inherent protection is being able to prove that you're protected.
41:48 - 41:51
So that workload running inside of that box can ask
41:51 - 41:54
the CPU, give me proof that I can present to
41:54 - 41:56
somebody else that this I am this code, I'm this
41:56 - 41:59
piece of code, and I'm being protected by you. And
41:59 - 42:03
this is called an attestation report. And that attestation report
42:03 - 42:06
can be handed to another application. It can be handed
42:06 - 42:09
to a human with a policy evaluation on top of
42:09 - 42:11
it. It can be handed to a key release
42:11 - 42:14
service set with a policy that says only release this
42:15 - 42:17
if the code is this
42:17 - 42:20
and it's being protected by this hardware.
42:22 - 42:26
And this then creates this sealed environment, this confidential computing
42:26 - 42:30
environment where the data can be processed with minimal and
42:30 - 42:33
minimize the risks of those other kinds of attacks that
42:33 - 42:34
I talked about.
42:36 - 42:42
Since the early 2020 tens, we've been working with Intel
42:42 - 42:43
and AMD to bring out
42:44 - 42:47
confidential computing hardware that cannot just support what are called
42:47 - 42:50
enclaves or small boxes, but actually full virtual machines. And
42:50 - 42:53
a couple of years ago, we announced with AMD the
42:53 - 42:54
first confidential virtual machines.
42:55 - 42:57
We and we released them in Azure. And then we've
42:57 - 43:01
announced confidential virtual machines with Intel, their TDX technology and
43:01 - 43:02
we've released them in Azure.
43:03 - 43:05
But the time for
43:06 - 43:10
confidential computing is really coming, because we've been working for
43:11 - 43:15
the last few years with NVIDIA to codesign confidential computing
43:15 - 43:16
with them into.
43:16 - 43:17
Their GPU lines.
43:18 - 43:22
And the first line to have confidential GPU, confidential computing
43:22 - 43:24
and GPUs is the H100.
43:25 - 43:29
The H100 has this Trusted Execution environment on it, which
43:29 - 43:33
protects the model weights, protects the data going into the
43:33 - 43:34
model, encrypting them,
43:36 - 43:38
and it being able to attest to what's inside of
43:38 - 43:41
the GPU. So that, for example, you've got this confidential
43:41 - 43:43
virtual machine on the left.
43:44 - 43:47
It's can talking to the confidential GPU and now it
43:47 - 43:51
can release keys to the GPU to decrypt the model
43:51 - 43:53
or to decrypt a prompt
43:54 - 43:56
and then re encrypt the response going back to an
43:58 - 44:02
This actually is fleshing out now the introduction of confidential
44:02 - 44:06
accelerators with confidential GPU for AI confidentiality. And this has
44:07 - 44:10
got me really excited because there's a bunch of scenarios
44:10 - 44:14
that are just very obviously going to benefit from this.
44:14 - 44:15
One of them.
44:16 - 44:17
Is protecting the model weights,
44:18 - 44:20
and you might want to protect the model weights because
44:20 - 44:22
there's a ton of your IP in the model weights,
44:23 - 44:26
so you don't want them to leak, and confidential computing
44:26 - 44:28
can provide another layer protection around them.
44:29 - 44:30
Bit more generally.
44:31 - 44:34
When it comes to AI, the data that AI models
44:34 - 44:37
process is extremely sensitive in some cases,
44:38 - 44:41
and what this allows is for you to protect that.
44:41 - 44:42
Data end to end.
44:43 - 44:45
For things like fine tuning, the data that you're going
44:45 - 44:47
to find tune the model on can be protected and
44:47 - 44:48
given only to the GPU.
44:49 - 44:51
The data that you give as your prompt and the
44:51 - 44:54
response you get back can be protected with confidential computing.
44:56 - 44:57
Give you one example.
44:57 - 44:58
Got everybody very.
44:58 - 45:04
Excited is confidential speech translation and speech transcription.
45:05 - 45:09
Speech is incredibly sensitive for a lot of enterprises, and
45:09 - 45:13
with this you'll be able to send encrypted speech into
45:14 - 45:18
Get a get a transcription back that's encrypted and nothing
45:18 - 45:20
can see it from the point that you send it
45:20 - 45:23
in to the point you get it back out.
45:23 - 45:24
Other than the GPU.
45:26 - 45:30
And then finally, a really exciting scenario is confidential multiparty
45:30 - 45:34
sharing. Different parties get together and share their data.
45:35 - 45:37
But without each of them being able to see the
45:37 - 45:40
others data because they're sharing it really with
45:41 - 45:44
the AI model and the GPU, not with each other.
45:45 - 45:47
So I want to show you a really quick demo
45:47 - 45:47
that kind of.
45:47 - 45:50
Shows you a nuts and bolts view.
45:50 - 45:53
Of confidential retrieval augmented generation. So I think all of
45:53 - 45:56
all of you know what retrieval augmented generation is, is
45:56 - 45:59
when you give a model some information that it wasn't
45:59 - 46:01
trained on into its context so it can answer questions
46:02 - 46:05
And the idea here is that the RAG data can
46:05 - 46:08
be very sensitive. So I'm going to show you this.
46:09 - 46:13
Here's a website that uses confidential computing attestation to decide
46:13 - 46:14
whether I trust the site.
46:15 - 46:19
And you can see that the attestation information for this
46:19 - 46:19
RAG website.
46:20 - 46:23
Here this is a website that we're connecting to that
46:23 - 46:27
was its attestation from the GPU and the CPU. I'm
46:27 - 46:28
deciding I trust it.
46:29 - 46:31
I can examine the attestation report, see that it's got
46:31 - 46:33
the right versions of the hardware, that I like, the
46:33 - 46:35
right versions of the firmware,
46:36 - 46:39
and then I can decide to create a policy that
46:39 - 46:41
is, I trust this policy. And now when I query
46:42 - 46:45
the attestation of that website according to that policy, I
46:45 - 46:47
get a green and I can look and see this
46:47 - 46:50
thing conforms to the policies I set. So I trust
46:50 - 46:51
it with my data.
46:52 - 46:56
Now, if I ask this model what is confidential computing,
46:57 - 47:00
this model doesn't know about confidential computing unfortunately, and it's
47:01 - 47:02
going to say I don't know.
47:04 - 47:06
But now this is a confidential rag app running on
47:06 - 47:10
a container on a confidential virtual machine. With a confidential
47:10 - 47:13
GPU. I can upload my very sensitive information. In this
47:14 - 47:16
case it's confidential computing documentation
47:18 - 47:20
that goes into a vector database which sitting in the
47:20 - 47:23
confidential virtual machine. Same with embeddings there
47:24 - 47:25
and now I can ask it.
47:26 - 47:27
What is confidential computing? It knows.
47:28 - 47:31
The whole thing was protected from the from my client
47:32 - 47:34
side on the browser all the way into the GPU
47:34 - 47:38
and back, including the data. The document that I uploaded
47:38 - 47:42
protected, so nothing outside of that hardware boundary could see
47:43 - 47:45
And so this, I mean, I think I see some
47:45 - 47:48
nods. You can see just how powerful this is when
47:49 - 47:52
we bring it to scale at providing just another level,
47:52 - 47:55
a massive jump in the level of data protection that
47:55 - 47:58
we can achieve. And by the way, this is just
47:58 - 48:01
showing if I tweak the version that's in my policy
48:02 - 48:04
now, I get, hey, the website doesn't conform to the
48:04 - 48:07
policy you specified because it's got the wrong version of
48:07 - 48:10
a particular piece of firmware. So I know now, ohh,
48:10 - 48:12
I don't trust this thing anymore. It's not the versions
48:12 - 48:13
that I trust.
48:15 - 48:17
Now I want to conclude that that kind of gives
48:17 - 48:20
you an overview of some of the innovation we've got
48:20 - 48:23
going on in Azure. And you're gonna see confidential services
48:23 - 48:26
coming. You're going to see more innovation in data center
48:26 - 48:30
efficiency and serving efficiency and hardware efficiency. The innovations that
48:30 - 48:33
we'll have drawing with our next generation, the Maya Jeep
48:33 - 48:36
accelerators. I want to just transition and give you a
48:36 - 48:39
little look at some of the research that I'm doing
48:39 - 48:41
as a conclusion, because I'm having a lot of fun
48:41 - 48:43
with I myself and this.
48:43 - 48:46
Is my own research which is getting my hands dirty.
48:46 - 48:49
With AI, Python, Pytorch, our own systems like Project Forge,
48:49 - 48:52
so that I have really a good understanding of the
48:52 - 48:56
technology underneath and the way that we deliver our technology
48:56 - 48:59
through things like Visual Studio Code and Copilot,
48:59 - 49:01
because I've been using Copilot a lot.
49:02 - 49:05
So one of the things that I did with another
49:05 - 49:08
researcher in Microsoft Research that works on the Phi team
49:08 - 49:11
is look at the question, can we have a model?
49:11 - 49:12
Forget things
49:13 - 49:16
How? Let me set this up with. Why would you
49:16 - 49:18
want a model to forget something
49:19 - 49:22
when you talk about these training jobs? Even a few
49:22 - 49:25
days on a on 1000 H one hundreds is a
49:25 - 49:29
huge amount of money. And certainly if you're talking about
49:30 - 49:34
jobs that are even bigger or long running longer, that's
49:34 - 49:38
a tremendous amount of money you're spending on training one
49:38 - 49:41
job. What happens if you find out that you trained
49:41 - 49:45
on data that is problematic. It could be problematic because
49:45 - 49:46
it's got API on it.
49:47 - 49:50
You're leaking private information that you didn't realize was in
49:50 - 49:53
the training data. What happens if it's got copyrighted data
49:53 - 49:56
that you are unlicensed to train on? What happens if
49:56 - 49:59
it's got poison data that's gotten into your training set
49:59 - 50:00
and it's causing the model to?
50:01 - 50:04
Do you wanna go retrain from scratch? It would be
50:04 - 50:07
nice if you could just have the model forget those
50:07 - 50:09
things. And so that's what we set out to do.
50:10 - 50:12
Now it turns out we said, what can we have
50:12 - 50:15
a model forget that really shows gives us a challenge
50:15 - 50:18
and really can people can see that it really forgot
50:18 - 50:21
something. What do a ILM's know? They all know what
50:21 - 50:25
topics have. They know so clearly that it's obvious that
50:25 - 50:26
they've really deeply know them.
50:27 - 50:30
Well, that's gonna depend on how much of the data
50:30 - 50:32
they've seen as they train.
50:33 - 50:35
So we thought about it for a while, and then
50:35 - 50:37
we realized one of the things that all these models
50:37 - 50:39
seemed to know is the world of Harry Potter.
50:41 - 50:44
They know Harry Potter because there's so much Harry Potter.
50:44 - 50:46
The Harry Potter books are all over the web, it
50:47 - 50:50
And so these models that are just scraping the web,
50:50 - 50:53
sucking a whole bunch of copies of Harry Potter into
50:53 - 50:56
their training data, they see so many copies of it
50:56 - 51:00
that they almost memorize, if not memorize Harry Potter. So
51:00 - 51:02
we're like, if we can get these models to forget
51:02 - 51:05
Harry Potter, that would be really cool.
51:06 - 51:08
And so we set out to do that, and we
51:08 - 51:11
succeeded. And so this is a research paper. You can
51:11 - 51:14
find an archive that we wrote. Who's Harry Potter?
51:15 - 51:17
I want to show you a quick demo of our
51:17 - 51:19
Who's Harry Potter model and how it works.
51:20 - 51:23
So on the left side you can see Llama 7
51:25 - 51:28
and you can see when Harry went back to class,
51:28 - 51:31
he saw that his best friends. What do you think
51:31 - 51:33
it's gonna say? All we said is Harry in class.
51:35 - 51:37
And it says to see Ron and her money. Think
51:38 - 51:39
about that for a second.
51:40 - 51:43
Harry and School Harry Potter
51:44 - 51:45
like there's no other hair in the world.
51:49 - 51:52
So they really, really know and like to talk about
51:52 - 51:55
Harry Potter. So here's the unlearned one. And it says
51:56 - 51:58
instead Saw's friends Sarah and Emily.
51:59 - 52:01
And it's very consistent too with that generation. So this
52:01 - 52:03
was the key. How do we make it forget but
52:03 - 52:06
yet be very natural and not degrade its performance otherwise?
52:08 - 52:08
So that's.
52:09 - 52:12
Unlearning and now the the second thing we did we're
52:12 - 52:14
working on is how can we take it and make
52:14 - 52:18
forget it particular topics like particular words or particular concepts.
52:18 - 52:21
So we're continuing down this line of research and the
52:21 - 52:22
latest thing we are doing
52:25 - 52:27
For example, showing how we can make it forget.
52:28 - 52:29
Like profanity.
52:29 - 52:33
So this is the Mistral model, which doesn't. It's not
52:33 - 52:35
a shame to swear. It's a warning.
52:36 - 52:38
And we're gonna ask it, right? A rant about inflation
52:38 - 52:39
filled with profanity.
52:41 - 52:43
And we blurred this.
52:46 - 52:47
But you can see that it's like, sure, here you
52:47 - 52:49
go. And then we say, what are the top five
52:49 - 52:51
most profane words in the English language?
52:54 - 52:56
I don't know which those words are.
52:57 - 53:00
Now here's the forget version. So we want again make
53:00 - 53:02
it very natural. Like how do you make it not
53:02 - 53:05
talk about profanity but sound natural? Here's a rant.
53:06 - 53:06
The same prompt.
53:09 - 53:11
So, and this is actually losing the model itself. To
53:11 - 53:14
figure out what it should say instead is the key
53:14 - 53:17
to this research. And then here's the topmost profane words.
53:18 - 53:20
And by the way, the first one we didn't put
53:20 - 53:23
in the dictionary, so we were like, that word is,
53:23 - 53:25
yeah, not that bad, alright. But you can see, you
53:25 - 53:27
get the idea. Anyway, so I've been having a lot
53:27 - 53:29
of fun and I can tell you, let me just
53:29 - 53:32
sum it up by one thing. Copilot has been indispensable.
53:32 - 53:36
I've become I'm an expert when if Python And pytorch.
53:36 - 53:39
I can do stuff so fast with Python And py
53:39 - 53:41
torch because of copilot.
53:42 - 53:46
My brain is really lazy. So if you took copilot
53:47 - 53:50
I don't, I'm like a Python noob still like have
53:50 - 53:53
me do a list comprehension and I'm looking at the
53:53 - 53:56
documentation. If you don't have, I don't have copilot, but
53:56 - 53:59
I've gotten so to the point where I just have
53:59 - 54:02
AI generate all of the kind of that kind of
54:02 - 54:04
code for me. I have it. Give me advice on
54:04 - 54:05
how to optimize code.
54:07 - 54:09
So this has been a huge boon to my productivity.
54:09 - 54:11
I'm doing this kind of stuff in my as a
54:11 - 54:14
side projects that if I didn't have Copilot, I just
54:14 - 54:16
would not have the bandwidth to do.
54:17 - 54:20
That said, we're if you're a software developer, you have
54:20 - 54:22
nothing to fear, at least for the near future,
54:23 - 54:25
because these models, while they're great at that
54:26 - 54:30
kind of help an autocomplete. And here's a short little
54:30 - 54:32
function for more complicated, nuanced.
54:33 - 54:36
Just aren't there yet. And so I've I've, there's been
54:36 - 54:40
many rounds where there's bugs and even flagrant things like
54:40 - 54:42
you just wrote me a function where you use a
54:42 - 54:45
variable you didn't define it's Oh, I'm sorry, here's the
54:45 - 54:48
revised version. All right, you still didn't define it.
54:50 - 54:52
We're still at that kind of level. But that's, you
54:52 - 54:55
know, like I said, it's I just can't live without
54:55 - 54:57
it. It is so powerful and they're just gonna get
54:58 - 55:00
I just want to leave you on that note on
55:00 - 55:03
use of AI for programming. And with that, want to
55:03 - 55:06
wrap it up and hope you found this an interesting
55:06 - 55:09
tour. Some of the innovations we going from hardware to
55:09 - 55:11
data center design to the way that we manage the
55:12 - 55:16
infrastructure on micro and macro optimizations to the future confidential
55:16 - 55:18
computing. And then a look at some of the kind
55:18 - 55:21
of fun research that you can still do kind of
55:22 - 55:23
with just a single.
55:24 - 55:25
Server of GPUs.
55:25 - 55:28
So this is kind of a really fun time to
55:28 - 55:31
be working on innovation at any level of the stack.
55:31 - 55:31
So with that.
55:31 - 55:33
Hope you have a great build. Hope you have a
55:33 - 55:35
great party tonight. Hope to see you next year.