Running AI Locally: Exploring NVIDIA's Jetson Nano and DeepSeek R1

00:00 - 00:05

hey I'm Dave welcome to my shop this is

00:03 - 00:07

nvidia's Jetson Ora Nano super and it's

00:05 - 00:09

an impressive Edge computer capable of

00:07 - 00:12

running deep seek R1 models right on the

00:09 - 00:14

device it's got 1024 Cuda cores 32

00:12 - 00:18

tensor cores 8 GB of

00:14 - 00:20

lpddr5 six arm CPU cores SSD expansion

00:18 - 00:22

and much much more we're going to use it

00:20 - 00:24

today to flip the script on AI as I show

00:22 - 00:27

you how to run it locally on your own

00:24 - 00:30

desktop or on the Nano and then for

00:27 - 00:31

comparison I'll show you the mass of 67

00:30 - 00:34

1 billion parameter version running

00:31 - 00:36

Uncorked on a top end thread Ripper you

00:34 - 00:37

see when it comes to AI most of us are

00:36 - 00:39

used to asking questions in a web

00:37 - 00:41

browser window and then waiting for the

00:39 - 00:43

cloud to do its thing but what if you

00:41 - 00:44

didn't need the cloud at all what if you

00:43 - 00:46

could ask the same questions and get

00:44 - 00:49

answers from an AI running right on your

00:46 - 00:51

desk in the privacy of your home lab or

00:49 - 00:54

maybe even your own garage that's where

00:51 - 00:56

deep seek R1 comes in it's a NextGen

00:54 - 00:58

conversational AI that unlike its Cloud

00:56 - 01:00

locked cousins can be self-hosted at

00:58 - 01:02

home the advantages of this are clear

01:00 - 01:04

once you think about them you get full

01:02 - 01:06

control over your data privacy isn't

01:04 - 01:07

somebody else's problem and you avoid

01:06 - 01:10

the recurring subscription fees that

01:07 - 01:12

many services charge and perhaps best of

01:10 - 01:14

all it can be just plain faster or at

01:12 - 01:16

least more responsive when you're not at

01:14 - 01:18

the mercy of server latency or network

01:16 - 01:19

oages youve suddenly got yourself an AI

01:18 - 01:22

assistant that's truly yours no

01:19 - 01:23

middleman required and if like me you're

01:22 - 01:26

working on something that has complex

01:23 - 01:27

code that causes a large context window

01:26 - 01:29

you won't burn through your open AI

01:27 - 01:31

subscription meter quite as quickly now

01:29 - 01:33

the specs on this little guy as I said

01:31 - 01:38

earlier are pretty impressive 1024 Cuda

01:33 - 01:41

cores 16 tensor cores six CPU cores 8 GB

01:38 - 01:43

of RAM uh what do we got in here 1 tby

01:41 - 01:45

SSD as it's configured but what does all

01:43 - 01:46

this mean in Practical terms well it's a

01:45 - 01:48

bit like having the brain of a

01:46 - 01:50

workstation GPU packed into something

01:48 - 01:52

small enough to well almost fit in the

01:50 - 01:53

palm of your hand but the kicker here is

01:52 - 01:55

that it's specifically tuned for AI

01:53 - 01:58

workloads that makes it the perfect

01:55 - 02:00

platform for deeps R1 an AI model that

01:58 - 02:03

thrives on edge Hardware so let's talk

02:00 - 02:04

setup to get deep seek R1 up and running

02:03 - 02:06

at home we're using a program called

02:04 - 02:08

olama if you're not familiar with it

02:06 - 02:09

think of it like a streamlined

02:08 - 02:12

deployment tool for large language

02:09 - 02:14

models you run olama AMA downloads and

02:12 - 02:16

runs the models for you olama simplifies

02:14 - 02:18

the process of downloading setting up

02:16 - 02:20

and configuring AI models without

02:18 - 02:22

needing to be a wizard or really even

02:20 - 02:23

know that much about them I know some of

02:22 - 02:25

you probably love the command line as

02:23 - 02:26

much as I do especially if you grew up

02:25 - 02:28

compiling kernels and Machines That

02:26 - 02:30

Couldn't load web pages yet but trust me

02:28 - 02:32

when I say that ama does make life a lot

02:30 - 02:34

easier you'll be up and running in

02:32 - 02:36

minutes not hours and the good news is

02:34 - 02:37

you can still use the command line to

02:36 - 02:40

operate it if you prefer once it's

02:37 - 02:41

running I'll set a llama up on the oron

02:40 - 02:44

Nano just as you might on your own

02:41 - 02:45

desktop PC and we'll use it the same way

02:44 - 02:48

so everything we cover here works with

02:45 - 02:49

your own desktop GPU as well the

02:48 - 02:51

installation straightforward and once

02:49 - 02:54

it's done we'll pull the Deep seek R1

02:51 - 02:58

model down from ama's catalog we use the

02:54 - 03:02

following command olama pull deep seek D

02:58 - 03:03

R1 colon 1.5 B and yes this step does

03:02 - 03:04

require an internet connection but

03:03 - 03:06

here's the beauty of it once the model

03:04 - 03:08

is then downloaded you're done with the

03:06 - 03:10

web you could pull the cable everything

03:08 - 03:12

after that is completely local and why

03:10 - 03:14

does this matter well for one privacy

03:12 - 03:16

when you run deep seek R1 locally your

03:14 - 03:18

queries and data never leave your

03:16 - 03:20

machine if you've ever hesitated to ask

03:18 - 03:22

a sensitive question to a cloud-based AI

03:20 - 03:24

you're not alone the idea that your

03:22 - 03:26

inquiries might live on forever in some

03:24 - 03:28

Far Away server or state that represents

03:26 - 03:30

you can be a bit unnerving with deep SE

03:28 - 03:32

car one what you ask stays right where

03:30 - 03:34

you ask it on the Jetson Nano sitting on

03:32 - 03:36

your desk but privacy isn't the only win

03:34 - 03:38

here there's something satisfying about

03:36 - 03:40

the idea of self-hosting it's the same

03:38 - 03:42

appeal that Drew many of us into running

03:40 - 03:43

our own web servers back in the day I

03:42 - 03:45

mean I didn't need to be running

03:43 - 03:47

exchange server at home for my email but

03:45 - 03:49

I was and running deep seek R1 locally

03:47 - 03:51

scratches that same kind of itch it's a

03:49 - 03:53

project that you control and there's a

03:51 - 03:54

sense of ownership that comes with that

03:53 - 03:56

plus you get the added benefit of

03:54 - 03:58

knowing that your setup can run even

03:56 - 04:00

when your internet connection doesn't

03:58 - 04:02

once olama is installed and model is

04:00 - 04:03

loaded running queries is as simple as

04:02 - 04:05

opening a terminal or connecting to its

04:03 - 04:07

web interface you can input your

04:05 - 04:09

questions just like you would any other

04:07 - 04:11

AI chatbot and the responses come back

04:09 - 04:12

in near real time assuming you're not

04:11 - 04:15

asking it to write the Great American

04:12 - 04:17

novel or do Innovative fluid Dynamic

04:15 - 04:18

simulations now this is also a reasoning

04:17 - 04:20

model so it does think for a while

04:18 - 04:22

before it generates an answer but the

04:20 - 04:24

thinking is fast and starts immediately

04:22 - 04:25

the jet sonat handles most

04:24 - 04:28

conversational queries with ease thanks

04:25 - 04:30

to its optimized tensor cores and GPU

04:28 - 04:31

compute capabilities

04:30 - 04:33

let's consider the Practical side of

04:31 - 04:36

things say you're working on a coding

04:33 - 04:37

project maybe something in python or C++

04:36 - 04:40

now I've managed to burn through my open

04:37 - 04:42

AI monthly credits in just a few days by

04:40 - 04:44

iterating with the AI on a complex piece

04:42 - 04:46

of code because the longer the context

04:44 - 04:47

window gets the more resources it

04:46 - 04:50

consumes but if you're running it

04:47 - 04:51

locally you don't care you just want the

04:50 - 04:53

code that produces to work and you don't

04:51 - 04:55

want to be build for it as it goes about

04:53 - 04:57

it and what about home automation

04:55 - 04:59

enthusiasts well this setup can serve as

04:57 - 05:01

the brains behind your smart home taking

04:59 - 05:03

voice commands analyzing sensor data

05:01 - 05:04

offering suggestions all without needing

05:03 - 05:07

to send a single bite of your

05:04 - 05:09

information to a Cloud Server imagine

05:07 - 05:11

asking your AI to analyze the security

05:09 - 05:14

footage to find a particular person all

05:11 - 05:15

handled locally and securely in a

05:14 - 05:17

previous video you might have seen how I

05:15 - 05:20

rigged the oron Nano up to monitor the

05:17 - 05:22

feed for my own driveway it used P torch

05:20 - 05:24

and YOLO to wash for and announce as new

05:22 - 05:26

vehicles came and left and I think

05:24 - 05:28

that's a killer feature of the Nano it's

05:26 - 05:30

small but it's not a toy it's got the

05:28 - 05:32

hardware to do real work and it does it

05:30 - 05:34

admirably well of course the jetsen or

05:32 - 05:36

Nano isn't the only Hardware capable of

05:34 - 05:38

running deepsea car1 but it's arguably

05:36 - 05:40

one of the most cost-effective options

05:38 - 05:42

for its level of performance there's no

05:40 - 05:45

need to invest thousands into Enterprise

05:42 - 05:47

grade gpus or Cloud credits because for

05:45 - 05:48

under 250 bucks you got a system that's

05:47 - 05:50

powerful enough for most personal Ai

05:48 - 05:52

workloads and flexible enough to handle

05:50 - 05:55

a variety of projects Beyond just

05:52 - 05:56

chat-based queries and because the

05:55 - 05:57

Jetson series designed for Edge

05:56 - 05:59

Computing it's also well suited for

05:57 - 06:01

mobile or embedded use cases meaning you

05:59 - 06:04

could deploy it in everything from

06:01 - 06:05

robots to custom iot devices but at this

06:04 - 06:07

point you might be wondering what's the

06:05 - 06:09

catch well honestly there isn't much of

06:07 - 06:11

one sure there are limitations to

06:09 - 06:12

running AI models locally you're

06:11 - 06:14

constrained by the hardware and you're

06:12 - 06:16

not going to train a large language

06:14 - 06:19

model on the Jetson Nano but that's not

06:16 - 06:21

the point here for inference to actually

06:19 - 06:22

use the AI to generate answers the

06:21 - 06:25

Jetson Nano punches well above its

06:22 - 06:27

weight to prove that point let's start

06:25 - 06:29

with the smallest model with only 1.5

06:27 - 06:31

billion parameters I'll ask it to simple

06:29 - 06:33

science question like why no two

06:31 - 06:35

snowflakes are apparently alike and see

06:33 - 06:36

what it comes up with it processes The

06:35 - 06:38

Prompt and begins thinking almost

06:36 - 06:40

immediately and what appears to be less

06:38 - 06:42

than 1 second it then goes into its

06:40 - 06:44

reasoning phase because you see deep

06:42 - 06:46

seek is not just a regular large

06:44 - 06:48

language model but a reasoning Model A

06:46 - 06:50

reasoning model is a type of AI system

06:48 - 06:52

specifically designed to go beyond

06:50 - 06:54

surface level responses and to provide

06:52 - 06:56

conclusions based on deeper contextual

06:54 - 06:58

understanding and logical deductions

06:56 - 07:00

unlike traditional large language models

06:58 - 07:01

which focus on predicting the word or

07:00 - 07:04

token based on patterns it finds in

07:01 - 07:06

massive data sets reasoning models are

07:04 - 07:08

engineered to evaluate facts consider

07:06 - 07:10

possible outcomes and synthesize answers

07:08 - 07:12

that demonstrate a level of structured

07:10 - 07:14

thought and here's where deep seek R1

07:12 - 07:16

stands apart it's not just regurgitating

07:14 - 07:18

patterns from its training data that it

07:16 - 07:20

saw on the web somewhere it's capable of

07:18 - 07:22

understanding the relationships between

07:20 - 07:24

Concepts and applying deductive or

07:22 - 07:26

inductive or abductive reasoning

07:24 - 07:28

processes deductive reasoning works by

07:26 - 07:31

applying general rules to specific cases

07:28 - 07:33

such as all humans are mortal Socrates

07:31 - 07:36

is a human and therefore Socrates is

07:33 - 07:38

Mortal inductive reasoning generalizes

07:36 - 07:40

based on observations for example I've

07:38 - 07:43

seen many swans and they've always been

07:40 - 07:44

white therefore swans are likely white

07:43 - 07:46

abductive reasoning deals with the best

07:44 - 07:48

explanation given the evidence often

07:46 - 07:50

used in scenarios where multiple

07:48 - 07:53

hypotheses could explain an observation

07:50 - 07:54

deep seek as a reasoning model handles

07:53 - 07:57

queries by considering how multiple

07:54 - 07:58

pieces of information relate and whether

07:57 - 08:01

a given response fits logically within

07:58 - 08:03

the presented context for example if you

08:01 - 08:05

asked a reasoning model to explain why a

08:03 - 08:07

system might be overheating it wouldn't

08:05 - 08:09

just list common causes from the

08:07 - 08:11

training data instead it would evaluate

08:09 - 08:13

context specific variables like airf

08:11 - 08:16

flow or component specs or recent system

08:13 - 08:18

behavior and produce a well thought of

08:16 - 08:20

diagnosis this is a significant Leap

08:18 - 08:23

Forward for self-hosted AI a reasoning

08:20 - 08:25

model like deep seek on local hardware

08:23 - 08:26

doesn't just save bandwidth it brings

08:25 - 08:28

meaningful decision-making directly to

08:26 - 08:30

your machine making it perfect for

08:28 - 08:32

environments where privacy latency or

08:30 - 08:35

cost are critical whether you're

08:32 - 08:37

analyzing system logs making predictions

08:35 - 08:39

or solving complex problems a reasoning

08:37 - 08:41

model as the structured thinking that

08:39 - 08:43

large language models otherwise

08:41 - 08:45

sometimes Overlook with the smallest

08:43 - 08:47

model the 1.5 billion parameter model we

08:45 - 08:49

saw a performance of about 32 tokens per

08:47 - 08:51

second which is fast enough for almost

08:49 - 08:53

all interactive purposes that I can

08:51 - 08:55

think of at least once the thinking part

08:53 - 08:57

is over if we step up to the next larger

08:55 - 08:59

model which is a 7 billion parameter

08:57 - 09:01

model we find that it can produce re

08:59 - 09:03

reasoning at a rate of about 12 tokens

09:01 - 09:04

per second that's a fair bit slower than

09:03 - 09:06

the smallest model but it's still

09:04 - 09:07

reasonable performance akin to what

09:06 - 09:10

you're going to experience in the cloud

09:07 - 09:12

at least for Speed I find that it's also

09:10 - 09:14

just slightly slower than my reading

09:12 - 09:16

speed so I can read its line of thinking

09:14 - 09:18

in that model at about the rate that it

09:16 - 09:19

produces it and it's all still local and

09:18 - 09:21

it's all still running on affordable

09:19 - 09:22

Hardware we could just keep working our

09:21 - 09:23

way up the food chain until we couldn't

09:22 - 09:25

load one of the models and that's

09:23 - 09:26

precisely what I did but I won't make

09:25 - 09:28

you watch me load and test them all

09:26 - 09:30

because anything bigger than 8 gbes is

09:28 - 09:32

not going to into memory and that limits

09:30 - 09:34

us to about the 7 billion parameter

09:32 - 09:36

model size if we want to run a larger

09:34 - 09:37

model then we're going to have to leave

09:36 - 09:39

the aura Nano behind for a moment and

09:37 - 09:41

break out another one of nvidia's big

09:39 - 09:44

party tricks this one in the form of an

09:41 - 09:47

RTX 60008 GPU which can still push

09:44 - 09:52

$110,000 on the retail Market with its

09:47 - 09:55

48 GB of ggdr 6 18,1 76 cicor and 91

09:52 - 09:57

Tera flops of floating Point performance

09:55 - 10:00

we'll pair it with a CPU of a similar

09:57 - 10:03

price the AMD Thro for $799 WX and then

10:00 - 10:05

throw in 512 GB of RAM to make sure that

10:03 - 10:07

it is room for even the largest of the

10:05 - 10:09

large models and we're going to need it

10:07 - 10:12

because the largest deeps R1 model has

10:09 - 10:16

671 billion parameters now thankfully

10:12 - 10:17

I'm on 5 GB fiber because it's 44 GB to

10:16 - 10:19

download and it's still a linkly

10:17 - 10:21

download though only about 20 minutes I

10:19 - 10:23

think I recall it being but even once

10:21 - 10:25

you have the model downloaded verifying

10:23 - 10:27

the hash will take many minutes as we'll

10:25 - 10:29

simply loading the model each time you

10:27 - 10:33

go to start it after all the model model

10:29 - 10:35

is 404 GB and if your SSD manages 4 GB

10:33 - 10:37

per second in sustained reads that's

10:35 - 10:38

still 100 seconds minimum to load that

10:37 - 10:40

much data and since it's not perfectly

10:38 - 10:43

efficient you're realistically looking

10:40 - 10:44

at a couple of minutes to load the model

10:43 - 10:47

once it loads though it works fine and

10:44 - 10:49

has impressive reasoning skills in fact

10:47 - 10:50

on the now famous performance slide

10:49 - 10:52

that's been making the rounds with deep

10:50 - 10:54

seek you can see that it even best chat

10:52 - 10:57

gpts o01 in some tasks and effectively

10:54 - 10:58

equals it on the remainder the

10:57 - 11:00

performance however does leave something

10:58 - 11:02

to be design tired in terms of real-time

11:00 - 11:04

interaction even with this Mighty

11:02 - 11:05

Hardware that we've brought to the task

11:04 - 11:08

the system manages the best of only

11:05 - 11:10

about four tokens per second I also

11:08 - 11:12

found that on Windows AMA isn't great

11:10 - 11:14

about taking advantage of all your CPU

11:12 - 11:15

cores at least if you have more than 64

11:14 - 11:17

of them if you do it's important to

11:15 - 11:18

issue a command in The Interpreter to

11:17 - 11:21

set the maximum number of threads to

11:18 - 11:23

match your CPU and that way it will take

11:21 - 11:25

advantage of all your cores see the

11:23 - 11:27

video description on the thread Ripper

11:25 - 11:29

the CPU is pegged at 100% but with the

11:27 - 11:31

smaller models though more of it runs on

11:29 - 11:34

the GPU and you'll see your GPU loads

11:31 - 11:36

approaching 100% And now for one last

11:34 - 11:37

trick the smallest model and the fastest

11:36 - 11:40

Hardware just so we can see how many

11:37 - 11:41

tokens per second that it can generate

11:40 - 11:43

I'll ask deep seek to tell me a long and

11:41 - 11:46

interesting story so to spend some time

11:43 - 11:48

thinking and as it does we see a GPU

11:46 - 11:51

load of 100% And this time it's in

11:48 - 11:53

contrast to a largely idle CPU and when

11:51 - 11:55

running the 1.5 billion parameter model

11:53 - 11:59

the big RTX 6000 cranks out an

11:55 - 12:01

impressive 233 tokens per second if

11:59 - 12:02

you've enjoyed today's little fora into

12:01 - 12:04

deep seek on both ends of the hardware

12:02 - 12:05

Spectrum remember I'm mostly in this for

12:04 - 12:07

the subs and likes so I'd be honored if

12:05 - 12:09

you consider subscribing to my channel

12:07 - 12:11

to get more like it and if you're

12:09 - 12:12

already subscribed thank you don't

12:11 - 12:14

forget to turn on the Bell icon leave a

12:12 - 12:15

like on the video and maybe click on

12:14 - 12:17

share to send it to a friend who might

12:15 - 12:19

also be interested I always appreciate

12:17 - 12:21

any organic efforts to hack the YouTube

12:19 - 12:22

algorithm as that other guy likes to say

12:21 - 12:24

and if you have any interest in matters

12:22 - 12:26

related to the autism spectrum please be

12:24 - 12:28

sure to check out the sample of my book

12:26 - 12:29

on Amazon Link in the video description

12:28 - 12:31

it's everything I know now about living

12:29 - 12:34

your best life on the spectrum that I

12:31 - 12:35

wish I'd known years ago in the meantime

12:34 - 12:38

and in between time hope to see you next

12:35 - 12:41

time right here in Dave's Garage hello

12:38 - 12:43

my baby hello my honey hello my R time

12:41 - 12:43

girl