00:00 - 00:05
hey I'm Dave welcome to my shop this is
00:03 - 00:07
nvidia's Jetson Ora Nano super and it's
00:05 - 00:09
an impressive Edge computer capable of
00:07 - 00:12
running deep seek R1 models right on the
00:09 - 00:14
device it's got 1024 Cuda cores 32
00:12 - 00:18
tensor cores 8 GB of
00:14 - 00:20
lpddr5 six arm CPU cores SSD expansion
00:18 - 00:22
and much much more we're going to use it
00:20 - 00:24
today to flip the script on AI as I show
00:22 - 00:27
you how to run it locally on your own
00:24 - 00:30
desktop or on the Nano and then for
00:27 - 00:31
comparison I'll show you the mass of 67
00:30 - 00:34
1 billion parameter version running
00:31 - 00:36
Uncorked on a top end thread Ripper you
00:34 - 00:37
see when it comes to AI most of us are
00:36 - 00:39
used to asking questions in a web
00:37 - 00:41
browser window and then waiting for the
00:39 - 00:43
cloud to do its thing but what if you
00:41 - 00:44
didn't need the cloud at all what if you
00:43 - 00:46
could ask the same questions and get
00:44 - 00:49
answers from an AI running right on your
00:46 - 00:51
desk in the privacy of your home lab or
00:49 - 00:54
maybe even your own garage that's where
00:51 - 00:56
deep seek R1 comes in it's a NextGen
00:54 - 00:58
conversational AI that unlike its Cloud
00:56 - 01:00
locked cousins can be self-hosted at
00:58 - 01:02
home the advantages of this are clear
01:00 - 01:04
once you think about them you get full
01:02 - 01:06
control over your data privacy isn't
01:04 - 01:07
somebody else's problem and you avoid
01:06 - 01:10
the recurring subscription fees that
01:07 - 01:12
many services charge and perhaps best of
01:10 - 01:14
all it can be just plain faster or at
01:12 - 01:16
least more responsive when you're not at
01:14 - 01:18
the mercy of server latency or network
01:16 - 01:19
oages youve suddenly got yourself an AI
01:18 - 01:22
assistant that's truly yours no
01:19 - 01:23
middleman required and if like me you're
01:22 - 01:26
working on something that has complex
01:23 - 01:27
code that causes a large context window
01:26 - 01:29
you won't burn through your open AI
01:27 - 01:31
subscription meter quite as quickly now
01:29 - 01:33
the specs on this little guy as I said
01:31 - 01:38
earlier are pretty impressive 1024 Cuda
01:33 - 01:41
cores 16 tensor cores six CPU cores 8 GB
01:38 - 01:43
of RAM uh what do we got in here 1 tby
01:41 - 01:45
SSD as it's configured but what does all
01:43 - 01:46
this mean in Practical terms well it's a
01:45 - 01:48
bit like having the brain of a
01:46 - 01:50
workstation GPU packed into something
01:48 - 01:52
small enough to well almost fit in the
01:50 - 01:53
palm of your hand but the kicker here is
01:52 - 01:55
that it's specifically tuned for AI
01:53 - 01:58
workloads that makes it the perfect
01:55 - 02:00
platform for deeps R1 an AI model that
01:58 - 02:03
thrives on edge Hardware so let's talk
02:00 - 02:04
setup to get deep seek R1 up and running
02:03 - 02:06
at home we're using a program called
02:04 - 02:08
olama if you're not familiar with it
02:06 - 02:09
think of it like a streamlined
02:08 - 02:12
deployment tool for large language
02:09 - 02:14
models you run olama AMA downloads and
02:12 - 02:16
runs the models for you olama simplifies
02:14 - 02:18
the process of downloading setting up
02:16 - 02:20
and configuring AI models without
02:18 - 02:22
needing to be a wizard or really even
02:20 - 02:23
know that much about them I know some of
02:22 - 02:25
you probably love the command line as
02:23 - 02:26
much as I do especially if you grew up
02:25 - 02:28
compiling kernels and Machines That
02:26 - 02:30
Couldn't load web pages yet but trust me
02:28 - 02:32
when I say that ama does make life a lot
02:30 - 02:34
easier you'll be up and running in
02:32 - 02:36
minutes not hours and the good news is
02:34 - 02:37
you can still use the command line to
02:36 - 02:40
operate it if you prefer once it's
02:37 - 02:41
running I'll set a llama up on the oron
02:40 - 02:44
Nano just as you might on your own
02:41 - 02:45
desktop PC and we'll use it the same way
02:44 - 02:48
so everything we cover here works with
02:45 - 02:49
your own desktop GPU as well the
02:48 - 02:51
installation straightforward and once
02:49 - 02:54
it's done we'll pull the Deep seek R1
02:51 - 02:58
model down from ama's catalog we use the
02:54 - 03:02
following command olama pull deep seek D
02:58 - 03:03
R1 colon 1.5 B and yes this step does
03:02 - 03:04
require an internet connection but
03:03 - 03:06
here's the beauty of it once the model
03:04 - 03:08
is then downloaded you're done with the
03:06 - 03:10
web you could pull the cable everything
03:08 - 03:12
after that is completely local and why
03:10 - 03:14
does this matter well for one privacy
03:12 - 03:16
when you run deep seek R1 locally your
03:14 - 03:18
queries and data never leave your
03:16 - 03:20
machine if you've ever hesitated to ask
03:18 - 03:22
a sensitive question to a cloud-based AI
03:20 - 03:24
you're not alone the idea that your
03:22 - 03:26
inquiries might live on forever in some
03:24 - 03:28
Far Away server or state that represents
03:26 - 03:30
you can be a bit unnerving with deep SE
03:28 - 03:32
car one what you ask stays right where
03:30 - 03:34
you ask it on the Jetson Nano sitting on
03:32 - 03:36
your desk but privacy isn't the only win
03:34 - 03:38
here there's something satisfying about
03:36 - 03:40
the idea of self-hosting it's the same
03:38 - 03:42
appeal that Drew many of us into running
03:40 - 03:43
our own web servers back in the day I
03:42 - 03:45
mean I didn't need to be running
03:43 - 03:47
exchange server at home for my email but
03:45 - 03:49
I was and running deep seek R1 locally
03:47 - 03:51
scratches that same kind of itch it's a
03:49 - 03:53
project that you control and there's a
03:51 - 03:54
sense of ownership that comes with that
03:53 - 03:56
plus you get the added benefit of
03:54 - 03:58
knowing that your setup can run even
03:56 - 04:00
when your internet connection doesn't
03:58 - 04:02
once olama is installed and model is
04:00 - 04:03
loaded running queries is as simple as
04:02 - 04:05
opening a terminal or connecting to its
04:03 - 04:07
web interface you can input your
04:05 - 04:09
questions just like you would any other
04:07 - 04:11
AI chatbot and the responses come back
04:09 - 04:12
in near real time assuming you're not
04:11 - 04:15
asking it to write the Great American
04:12 - 04:17
novel or do Innovative fluid Dynamic
04:15 - 04:18
simulations now this is also a reasoning
04:17 - 04:20
model so it does think for a while
04:18 - 04:22
before it generates an answer but the
04:20 - 04:24
thinking is fast and starts immediately
04:22 - 04:25
the jet sonat handles most
04:24 - 04:28
conversational queries with ease thanks
04:25 - 04:30
to its optimized tensor cores and GPU
04:28 - 04:31
compute capabilities
04:30 - 04:33
let's consider the Practical side of
04:31 - 04:36
things say you're working on a coding
04:33 - 04:37
project maybe something in python or C++
04:36 - 04:40
now I've managed to burn through my open
04:37 - 04:42
AI monthly credits in just a few days by
04:40 - 04:44
iterating with the AI on a complex piece
04:42 - 04:46
of code because the longer the context
04:44 - 04:47
window gets the more resources it
04:46 - 04:50
consumes but if you're running it
04:47 - 04:51
locally you don't care you just want the
04:50 - 04:53
code that produces to work and you don't
04:51 - 04:55
want to be build for it as it goes about
04:53 - 04:57
it and what about home automation
04:55 - 04:59
enthusiasts well this setup can serve as
04:57 - 05:01
the brains behind your smart home taking
04:59 - 05:03
voice commands analyzing sensor data
05:01 - 05:04
offering suggestions all without needing
05:03 - 05:07
to send a single bite of your
05:04 - 05:09
information to a Cloud Server imagine
05:07 - 05:11
asking your AI to analyze the security
05:09 - 05:14
footage to find a particular person all
05:11 - 05:15
handled locally and securely in a
05:14 - 05:17
previous video you might have seen how I
05:15 - 05:20
rigged the oron Nano up to monitor the
05:17 - 05:22
feed for my own driveway it used P torch
05:20 - 05:24
and YOLO to wash for and announce as new
05:22 - 05:26
vehicles came and left and I think
05:24 - 05:28
that's a killer feature of the Nano it's
05:26 - 05:30
small but it's not a toy it's got the
05:28 - 05:32
hardware to do real work and it does it
05:30 - 05:34
admirably well of course the jetsen or
05:32 - 05:36
Nano isn't the only Hardware capable of
05:34 - 05:38
running deepsea car1 but it's arguably
05:36 - 05:40
one of the most cost-effective options
05:38 - 05:42
for its level of performance there's no
05:40 - 05:45
need to invest thousands into Enterprise
05:42 - 05:47
grade gpus or Cloud credits because for
05:45 - 05:48
under 250 bucks you got a system that's
05:47 - 05:50
powerful enough for most personal Ai
05:48 - 05:52
workloads and flexible enough to handle
05:50 - 05:55
a variety of projects Beyond just
05:52 - 05:56
chat-based queries and because the
05:55 - 05:57
Jetson series designed for Edge
05:56 - 05:59
Computing it's also well suited for
05:57 - 06:01
mobile or embedded use cases meaning you
05:59 - 06:04
could deploy it in everything from
06:01 - 06:05
robots to custom iot devices but at this
06:04 - 06:07
point you might be wondering what's the
06:05 - 06:09
catch well honestly there isn't much of
06:07 - 06:11
one sure there are limitations to
06:09 - 06:12
running AI models locally you're
06:11 - 06:14
constrained by the hardware and you're
06:12 - 06:16
not going to train a large language
06:14 - 06:19
model on the Jetson Nano but that's not
06:16 - 06:21
the point here for inference to actually
06:19 - 06:22
use the AI to generate answers the
06:21 - 06:25
Jetson Nano punches well above its
06:22 - 06:27
weight to prove that point let's start
06:25 - 06:29
with the smallest model with only 1.5
06:27 - 06:31
billion parameters I'll ask it to simple
06:29 - 06:33
science question like why no two
06:31 - 06:35
snowflakes are apparently alike and see
06:33 - 06:36
what it comes up with it processes The
06:35 - 06:38
Prompt and begins thinking almost
06:36 - 06:40
immediately and what appears to be less
06:38 - 06:42
than 1 second it then goes into its
06:40 - 06:44
reasoning phase because you see deep
06:42 - 06:46
seek is not just a regular large
06:44 - 06:48
language model but a reasoning Model A
06:46 - 06:50
reasoning model is a type of AI system
06:48 - 06:52
specifically designed to go beyond
06:50 - 06:54
surface level responses and to provide
06:52 - 06:56
conclusions based on deeper contextual
06:54 - 06:58
understanding and logical deductions
06:56 - 07:00
unlike traditional large language models
06:58 - 07:01
which focus on predicting the word or
07:00 - 07:04
token based on patterns it finds in
07:01 - 07:06
massive data sets reasoning models are
07:04 - 07:08
engineered to evaluate facts consider
07:06 - 07:10
possible outcomes and synthesize answers
07:08 - 07:12
that demonstrate a level of structured
07:10 - 07:14
thought and here's where deep seek R1
07:12 - 07:16
stands apart it's not just regurgitating
07:14 - 07:18
patterns from its training data that it
07:16 - 07:20
saw on the web somewhere it's capable of
07:18 - 07:22
understanding the relationships between
07:20 - 07:24
Concepts and applying deductive or
07:22 - 07:26
inductive or abductive reasoning
07:24 - 07:28
processes deductive reasoning works by
07:26 - 07:31
applying general rules to specific cases
07:28 - 07:33
such as all humans are mortal Socrates
07:31 - 07:36
is a human and therefore Socrates is
07:33 - 07:38
Mortal inductive reasoning generalizes
07:36 - 07:40
based on observations for example I've
07:38 - 07:43
seen many swans and they've always been
07:40 - 07:44
white therefore swans are likely white
07:43 - 07:46
abductive reasoning deals with the best
07:44 - 07:48
explanation given the evidence often
07:46 - 07:50
used in scenarios where multiple
07:48 - 07:53
hypotheses could explain an observation
07:50 - 07:54
deep seek as a reasoning model handles
07:53 - 07:57
queries by considering how multiple
07:54 - 07:58
pieces of information relate and whether
07:57 - 08:01
a given response fits logically within
07:58 - 08:03
the presented context for example if you
08:01 - 08:05
asked a reasoning model to explain why a
08:03 - 08:07
system might be overheating it wouldn't
08:05 - 08:09
just list common causes from the
08:07 - 08:11
training data instead it would evaluate
08:09 - 08:13
context specific variables like airf
08:11 - 08:16
flow or component specs or recent system
08:13 - 08:18
behavior and produce a well thought of
08:16 - 08:20
diagnosis this is a significant Leap
08:18 - 08:23
Forward for self-hosted AI a reasoning
08:20 - 08:25
model like deep seek on local hardware
08:23 - 08:26
doesn't just save bandwidth it brings
08:25 - 08:28
meaningful decision-making directly to
08:26 - 08:30
your machine making it perfect for
08:28 - 08:32
environments where privacy latency or
08:30 - 08:35
cost are critical whether you're
08:32 - 08:37
analyzing system logs making predictions
08:35 - 08:39
or solving complex problems a reasoning
08:37 - 08:41
model as the structured thinking that
08:39 - 08:43
large language models otherwise
08:41 - 08:45
sometimes Overlook with the smallest
08:43 - 08:47
model the 1.5 billion parameter model we
08:45 - 08:49
saw a performance of about 32 tokens per
08:47 - 08:51
second which is fast enough for almost
08:49 - 08:53
all interactive purposes that I can
08:51 - 08:55
think of at least once the thinking part
08:53 - 08:57
is over if we step up to the next larger
08:55 - 08:59
model which is a 7 billion parameter
08:57 - 09:01
model we find that it can produce re
08:59 - 09:03
reasoning at a rate of about 12 tokens
09:01 - 09:04
per second that's a fair bit slower than
09:03 - 09:06
the smallest model but it's still
09:04 - 09:07
reasonable performance akin to what
09:06 - 09:10
you're going to experience in the cloud
09:07 - 09:12
at least for Speed I find that it's also
09:10 - 09:14
just slightly slower than my reading
09:12 - 09:16
speed so I can read its line of thinking
09:14 - 09:18
in that model at about the rate that it
09:16 - 09:19
produces it and it's all still local and
09:18 - 09:21
it's all still running on affordable
09:19 - 09:22
Hardware we could just keep working our
09:21 - 09:23
way up the food chain until we couldn't
09:22 - 09:25
load one of the models and that's
09:23 - 09:26
precisely what I did but I won't make
09:25 - 09:28
you watch me load and test them all
09:26 - 09:30
because anything bigger than 8 gbes is
09:28 - 09:32
not going to into memory and that limits
09:30 - 09:34
us to about the 7 billion parameter
09:32 - 09:36
model size if we want to run a larger
09:34 - 09:37
model then we're going to have to leave
09:36 - 09:39
the aura Nano behind for a moment and
09:37 - 09:41
break out another one of nvidia's big
09:39 - 09:44
party tricks this one in the form of an
09:41 - 09:47
RTX 60008 GPU which can still push
09:44 - 09:52
$110,000 on the retail Market with its
09:47 - 09:55
48 GB of ggdr 6 18,1 76 cicor and 91
09:52 - 09:57
Tera flops of floating Point performance
09:55 - 10:00
we'll pair it with a CPU of a similar
09:57 - 10:03
price the AMD Thro for $799 WX and then
10:00 - 10:05
throw in 512 GB of RAM to make sure that
10:03 - 10:07
it is room for even the largest of the
10:05 - 10:09
large models and we're going to need it
10:07 - 10:12
because the largest deeps R1 model has
10:09 - 10:16
671 billion parameters now thankfully
10:12 - 10:17
I'm on 5 GB fiber because it's 44 GB to
10:16 - 10:19
download and it's still a linkly
10:17 - 10:21
download though only about 20 minutes I
10:19 - 10:23
think I recall it being but even once
10:21 - 10:25
you have the model downloaded verifying
10:23 - 10:27
the hash will take many minutes as we'll
10:25 - 10:29
simply loading the model each time you
10:27 - 10:33
go to start it after all the model model
10:29 - 10:35
is 404 GB and if your SSD manages 4 GB
10:33 - 10:37
per second in sustained reads that's
10:35 - 10:38
still 100 seconds minimum to load that
10:37 - 10:40
much data and since it's not perfectly
10:38 - 10:43
efficient you're realistically looking
10:40 - 10:44
at a couple of minutes to load the model
10:43 - 10:47
once it loads though it works fine and
10:44 - 10:49
has impressive reasoning skills in fact
10:47 - 10:50
on the now famous performance slide
10:49 - 10:52
that's been making the rounds with deep
10:50 - 10:54
seek you can see that it even best chat
10:52 - 10:57
gpts o01 in some tasks and effectively
10:54 - 10:58
equals it on the remainder the
10:57 - 11:00
performance however does leave something
10:58 - 11:02
to be design tired in terms of real-time
11:00 - 11:04
interaction even with this Mighty
11:02 - 11:05
Hardware that we've brought to the task
11:04 - 11:08
the system manages the best of only
11:05 - 11:10
about four tokens per second I also
11:08 - 11:12
found that on Windows AMA isn't great
11:10 - 11:14
about taking advantage of all your CPU
11:12 - 11:15
cores at least if you have more than 64
11:14 - 11:17
of them if you do it's important to
11:15 - 11:18
issue a command in The Interpreter to
11:17 - 11:21
set the maximum number of threads to
11:18 - 11:23
match your CPU and that way it will take
11:21 - 11:25
advantage of all your cores see the
11:23 - 11:27
video description on the thread Ripper
11:25 - 11:29
the CPU is pegged at 100% but with the
11:27 - 11:31
smaller models though more of it runs on
11:29 - 11:34
the GPU and you'll see your GPU loads
11:31 - 11:36
approaching 100% And now for one last
11:34 - 11:37
trick the smallest model and the fastest
11:36 - 11:40
Hardware just so we can see how many
11:37 - 11:41
tokens per second that it can generate
11:40 - 11:43
I'll ask deep seek to tell me a long and
11:41 - 11:46
interesting story so to spend some time
11:43 - 11:48
thinking and as it does we see a GPU
11:46 - 11:51
load of 100% And this time it's in
11:48 - 11:53
contrast to a largely idle CPU and when
11:51 - 11:55
running the 1.5 billion parameter model
11:53 - 11:59
the big RTX 6000 cranks out an
11:55 - 12:01
impressive 233 tokens per second if
11:59 - 12:02
you've enjoyed today's little fora into
12:01 - 12:04
deep seek on both ends of the hardware
12:02 - 12:05
Spectrum remember I'm mostly in this for
12:04 - 12:07
the subs and likes so I'd be honored if
12:05 - 12:09
you consider subscribing to my channel
12:07 - 12:11
to get more like it and if you're
12:09 - 12:12
already subscribed thank you don't
12:11 - 12:14
forget to turn on the Bell icon leave a
12:12 - 12:15
like on the video and maybe click on
12:14 - 12:17
share to send it to a friend who might
12:15 - 12:19
also be interested I always appreciate
12:17 - 12:21
any organic efforts to hack the YouTube
12:19 - 12:22
algorithm as that other guy likes to say
12:21 - 12:24
and if you have any interest in matters
12:22 - 12:26
related to the autism spectrum please be
12:24 - 12:28
sure to check out the sample of my book
12:26 - 12:29
on Amazon Link in the video description
12:28 - 12:31
it's everything I know now about living
12:29 - 12:34
your best life on the spectrum that I
12:31 - 12:35
wish I'd known years ago in the meantime
12:34 - 12:38
and in between time hope to see you next
12:35 - 12:41
time right here in Dave's Garage hello
12:38 - 12:43
my baby hello my honey hello my R time