00:00 - 00:05

welcome back this is another video of

00:03 - 00:07

our introduction in the previous video

00:05 - 00:09

we talked about docker and why it's

00:07 - 00:12

useful for data engineers we also took a

00:09 - 00:14

look at how we can create

00:12 - 00:16

air codes pipeline and put it in docker

00:14 - 00:18

now we know what docker is and how to

00:16 - 00:22

use it now we will run postgres in

00:18 - 00:24

docker and we will put some data to

00:22 - 00:27

this postgres

00:24 - 00:29

and and for putting the data there we'll

00:27 - 00:31

use a simple python script so let's

00:29 - 00:33

start this is where we finished last

00:31 - 00:35

time we have two files we have a docker

00:33 - 00:38

file and we have pipeline and now we

00:35 - 00:40

want to get some practice with sql and

00:38 - 00:42

we want to run postgres and we can use

00:40 - 00:45

this postgres for practice in sql before

00:42 - 00:47

we run the sql queries in bigquery later

00:45 - 00:49

in the course to run postgres we will

00:47 - 00:51

use the official docker image of

00:49 - 00:54

postgres you can just look it up what

00:51 - 00:57

you can also do is for week two we will

00:54 - 00:59

use airflow and airflow internally we

00:57 - 01:01

will talk more about what airflow is and

00:59 - 01:04

why do we need this but airflow

01:01 - 01:06

internally uses postgres and we can take

01:04 - 01:08

a look at what's the image we use there

01:06 - 01:12

and we can just reuse that so i think

01:08 - 01:12

right now the code is in branch

01:14 - 01:19

so we are interested in this so you will

01:16 - 01:21

see this code in action in the week 2.

01:19 - 01:23

we have we will talk about docker

01:21 - 01:25

composer docker compose as a way of

01:23 - 01:28

running multiple docker images we will

01:25 - 01:31

talk about this in the next video i want

01:28 - 01:34

to copy it and i want to use it for

01:31 - 01:36

running postgres locally let me copy it

01:34 - 01:39

then i open visual studio code in this

01:36 - 01:40

directory

01:39 - 01:43

now i want to come up with a comment

01:40 - 01:46

that we need to run in order to

01:43 - 01:48

have postgres brand remember to run

01:46 - 01:50

something in docker we need to do docker

01:48 - 01:53

run and yeah also we do minus a t to be

01:50 - 01:56

able to stop it then we see this we have

01:53 - 01:58

image it's postgres 13. this is the

01:56 - 02:01

image and the version and we need to

01:58 - 02:03

configure postgres so for postgres for

02:01 - 02:05

configuring postgres we need a bunch of

02:03 - 02:07

things so these are environment

02:05 - 02:09

variables so this is a way of

02:07 - 02:11

configuring whatever we have in the

02:09 - 02:14

docker container so in this particular

02:11 - 02:16

case we want to say what the user will

02:14 - 02:19

be what the password for this user will

02:16 - 02:21

be and how we want to call our database

02:19 - 02:23

so these are environmental variables let

02:21 - 02:26

me copy them and to set environmental

02:23 - 02:29

variables when we run docker with docker

02:26 - 02:32

run we just need to use minus e flag and

02:29 - 02:34

then say what the parameter for this

02:32 - 02:36

particular variable should be i'll just

02:34 - 02:41

use root for the user

02:36 - 02:44

then for password i'll also use root

02:41 - 02:44

and then for the database name i will

02:44 - 02:47

use

02:44 - 02:49

ny taxi so because this is the name of

02:47 - 02:51

the data set we will use for practicing

02:49 - 02:53

sql and this is the data set will use

02:51 - 02:56

throughout the course we'll talk a bit

02:53 - 02:59

about this data set later but i will

02:56 - 03:01

call the database ny taxi new york taxi

02:59 - 03:04

so that's one part of the configuration

03:01 - 03:07

we need there is other part we need this

03:04 - 03:10

volumes part so this volumes is a way of

03:07 - 03:13

mapping folder that we have in our file

03:10 - 03:15

system on the host machine to a folder

03:13 - 03:17

in the container and because postgres is

03:15 - 03:20

a database it needs to keep files in a

03:17 - 03:22

file system if we want to be able to

03:20 - 03:25

save requests there so it will go

03:22 - 03:27

physically save this record in its

03:25 - 03:29

specific format to your file system and

03:27 - 03:30

then later it will also go there to read

03:29 - 03:32

this record

03:30 - 03:34

and if we do this in docker remember

03:32 - 03:36

docker doesn't keep the state so next

03:34 - 03:38

time we will run the state will be lost

03:36 - 03:40

we don't want that we want when we next

03:38 - 03:42

time run postgres we want to make sure

03:40 - 03:45

that our data is still there so that's

03:42 - 03:47

why we need to map a folder we have on

03:45 - 03:49

our host machine to a folder in the

03:47 - 03:51

container and this is called mounting

03:49 - 03:56

and let me create a folder i'll call

03:51 - 04:00

this folder ny taxi postgres data

03:56 - 04:03

and now i will map this folder to to

04:00 - 04:06

this folder on the container so for this

04:03 - 04:08

for mounting we use this minus v flag

04:06 - 04:10

and then we have the path to the folder

04:08 - 04:13

on the host machine and then we use

04:10 - 04:15

colon to say to which directory on the

04:13 - 04:17

container we want to map it to

04:15 - 04:19

we actually need to do one more thing

04:17 - 04:21

i'm not sure why it's not mentioned here

04:19 - 04:24

we also need to specify the port in

04:21 - 04:27

order to access the port we need to map

04:24 - 04:28

a port on our host machine to a port on

04:27 - 04:30

a container

04:28 - 04:32

this is needed to send a request to your

04:30 - 04:34

database because for database we will

04:32 - 04:37

send sql queries to this database and

04:34 - 04:39

then it needs to reply with answers so

04:37 - 04:41

we need to be able to access a specific

04:39 - 04:45

port of postgres and this port if i

04:41 - 04:47

remember correctly is 5432 and so we map

04:45 - 04:49

a port on our host machine with this

04:47 - 04:51

number to a port in the container that's

04:49 - 04:53

all the configuration we need it's

04:51 - 04:56

actually it will not work because docker

04:53 - 04:59

needs a full path since i'm on windows i

04:56 - 05:01

need to use this long path

04:59 - 05:03

and they actually need to use the

05:01 - 05:06

windows way of setting the path so

05:03 - 05:07

instead of slash c i need to start with

05:06 - 05:10

c

05:07 - 05:12

colon and then the rest of the path so i

05:10 - 05:14

need to have the full name and the full

05:12 - 05:17

path the absolute path to the folder on

05:14 - 05:19

linux on mac os i would do something

05:17 - 05:22

like

05:19 - 05:26

current directory like this pwd i can do

05:22 - 05:28

that here as well pwd but this returns a

05:26 - 05:30

linux unix like path and docker and

05:28 - 05:33

windows will not like this and it will

05:30 - 05:34

complain that it cannot find this path

05:33 - 05:35

or whatever

05:34 - 05:37

so

05:35 - 05:39

yeah for windows you need to do it this

05:37 - 05:42

way and if you're on a normal operating

05:39 - 05:44

system so you can just do this pwd let's

05:42 - 05:47

run it

05:44 - 05:49

hopefully it will work okay i must be

05:47 - 05:51

running something on this port probably

05:49 - 05:54

another instance of postgres so let me

05:51 - 05:56

do drps

05:54 - 05:58

so this is something that is still

05:56 - 06:02

running from my previous experiments let

05:58 - 06:03

me quickly remove that and i'll cut this

06:02 - 06:06

part out because this is not interesting

06:03 - 06:08

for you so i'll see you soon i stopped

06:06 - 06:10

the container let me check what is

06:08 - 06:13

running again nothing is running this

06:10 - 06:15

means i can run this thing again

06:13 - 06:16

so let me do that

06:15 - 06:18

i think it's running so let's see what

06:16 - 06:21

happens next and because it's running

06:18 - 06:22

for the first time it needs to do some

06:21 - 06:24

instantiation when we run it for the

06:22 - 06:27

second time i think it should be faster

06:24 - 06:30

okay and now it says database system is

06:27 - 06:32

ready to accept connections it means

06:30 - 06:33

that we can actually start working with

06:32 - 06:35

this

06:33 - 06:37

i'm not sure what that is i will just

06:35 - 06:39

ignore that i hope it's working so now i

06:37 - 06:42

can see i can take a look at what is in

06:39 - 06:45

this folder and why taxi postgres data

06:42 - 06:47

so this is the folder that we mapped to

06:45 - 06:49

a folder in the container and we see

06:47 - 06:52

that postgres created some stuff there i

06:49 - 06:54

have no idea what it's doing exactly

06:52 - 06:57

this is some internal representation of

06:54 - 06:59

data for postgres let's not look at this

06:57 - 07:01

and now we can try accessing this

06:59 - 07:02

database let me create another terminal

07:01 - 07:05

window

07:02 - 07:07

and now i want to run a cli client for

07:05 - 07:10

accessing the database the client i use

07:07 - 07:12

is called pgcli this is a library in

07:10 - 07:16

python you can install it using peep

07:12 - 07:18

install pg cli i think if i run it it

07:16 - 07:20

will tell me that it it's already

07:18 - 07:22

installed it says requirement already

07:20 - 07:24

satisfied so that means i already have

07:22 - 07:26

it you probably don't then you just do

07:24 - 07:29

pip install pgcli and then i can use

07:26 - 07:30

this cli to access the database and then

07:29 - 07:32

run some queries there so let me just

07:30 - 07:35

run it it will output help

07:32 - 07:38

okay i think i need to see help i think

07:35 - 07:40

i need to do maybe minus minus help

07:38 - 07:42

yeah

07:40 - 07:45

so this is what we need to specify in

07:42 - 07:48

order to log into the database so let me

07:45 - 07:52

do pg cli then we'll start with host

07:48 - 07:55

minus h host for us it's localhost then

07:52 - 07:56

minus p i think it's port we don't need

07:55 - 07:59

to specify it because this is the

07:56 - 08:01

standard web port from postgres or we

07:59 - 08:05

might as well specify it because why not

08:01 - 08:07

then user name so for us user minus u is

08:05 - 08:09

root this is what we specified here so

08:07 - 08:11

user password

08:09 - 08:13

and database so this is what we

08:11 - 08:17

specified for database i think it's

08:13 - 08:19

minus d and then it's ny underscore taxi

08:17 - 08:21

i don't think we need to specify

08:19 - 08:24

anything else here so let me just run it

08:21 - 08:26

now it will ask for okay for you

08:24 - 08:30

probably it will ask her password i have

08:26 - 08:32

already accessed spgcli on localhost for

08:30 - 08:34

root so that's why i didn't ask me i

08:32 - 08:35

assume for you you will get a password

08:34 - 08:38

prompt and you will need to type the

08:35 - 08:40

password forward which in our case will

08:38 - 08:42

be root so you type this in and then you

08:40 - 08:45

will get something like that

08:42 - 08:47

we now connect it to postgres and we can

08:45 - 08:50

run some queries we don't have anything

08:47 - 08:52

there yet so all we can do is look at a

08:50 - 08:55

list of tables that is available in our

08:52 - 08:57

database there is nothing because of

08:55 - 08:59

course we haven't created anything or we

08:57 - 09:01

can do something like select one and

08:59 - 09:03

then it will give us a response with

09:01 - 09:05

just one column one row one there but

09:03 - 09:08

not super useful but at least we can

09:05 - 09:10

test the connection to the database and

09:08 - 09:13

we can see that it's working what we

09:10 - 09:15

will do now is now we'll take a look at

09:13 - 09:17

the data set we will use for this course

09:15 - 09:20

and we will

09:17 - 09:22

take this data set and load this to our

09:20 - 09:23

postgres i'll open another terminal

09:22 - 09:26

window

09:23 - 09:28

and in this window i will run jupyter

09:26 - 09:30

notebook if you install python using

09:28 - 09:32

anaconda you already have jupyter if you

09:30 - 09:36

installed python differently then you

09:32 - 09:38

can do pip install jupyter and then you

09:36 - 09:42

will have it i already have it and to

09:38 - 09:43

run it i will do jupiter notebook and i

09:42 - 09:46

will get access to jupyter because now i

09:43 - 09:50

want to use python to be able to read

09:46 - 09:52

the data set and load this data set and

09:50 - 09:54

put it to postgres so i want to use

09:52 - 09:56

python for that and jupiter is a

09:54 - 09:58

convenient interactive environment for

09:56 - 10:00

doing this you don't have to use jupyter

09:58 - 10:02

you can use for example visual studio

10:00 - 10:05

code or you can use pycharm or whatever

10:02 - 10:07

you want i just prefer using jupiter and

10:05 - 10:09

that's why i will use it as well

10:07 - 10:12

so we have jupiter right now i will

10:09 - 10:14

create a new notebook here by the way

10:12 - 10:16

you can ignore this so if you just

10:14 - 10:18

installed jupiter you will probably have

10:16 - 10:20

only one kernel here it doesn't matter

10:18 - 10:23

when you create a new notebook click

10:20 - 10:26

here on python 3 and we have this window

10:23 - 10:29

first let's check if we have wonders

10:26 - 10:32

it should be pd not spd yeah so we have

10:29 - 10:34

pandas the version of pandas i have is

10:32 - 10:38

one three four let me also rename it

10:34 - 10:40

i will call it upload data

10:38 - 10:42

your jupiter might look differently it's

10:40 - 10:45

because my header is

10:42 - 10:48

hidden and my toolbar is hidden as well

10:45 - 10:50

this way i have more space for recording

10:48 - 10:52

videos but for you it looks a little bit

10:50 - 10:53

different so that's fine now we have

10:52 - 10:56

that and

10:53 - 10:57

now let's take a look at the data set we

10:56 - 11:00

actually want to use there should be a

10:57 - 11:03

link in our report

11:00 - 11:03

this is the link

11:03 - 11:08

okay so there is an article medium this

11:05 - 11:10

is not the link i was looking for i'm

11:08 - 11:13

actually looking for

11:10 - 11:16

let me just go to google and why taxi

11:13 - 11:19

data and yeah so this is the link i was

11:16 - 11:21

looking for let me copy it and i'll put

11:19 - 11:24

this in the repo as well so this data

11:21 - 11:26

set contains taxi trips from new york

11:24 - 11:28

city there are actually two data sets or

11:26 - 11:31

even three like yellow taxis and green

11:28 - 11:33

taxis don't know what is the difference

11:31 - 11:36

between the these two and we have quite

11:33 - 11:38

a few quite a lot of data here and for

11:36 - 11:41

this lesson we'll take this thing here

11:38 - 11:45

so it's yellow taxi trip records from

11:41 - 11:47

january 2021 and i will click copy link

11:45 - 11:49

address and what i will do let me

11:47 - 11:52

actually open yet another terminal

11:49 - 11:55

window so now i will use wget for

11:52 - 11:58

downloading this csv file what you can

11:55 - 12:00

do you can also just right click on this

11:58 - 12:03

and do save link as for example on

12:00 - 12:04

windows you do not have it by default

12:03 - 12:07

you will need to download this i already

12:04 - 12:10

have it it's a very convenient command

12:07 - 12:12

line utility for downloading things i

12:10 - 12:14

will now use it for downloading the csv

12:12 - 12:17

file i don't know what is wrong with my

12:14 - 12:19

internet it's quite slow but i actually

12:17 - 12:22

already have it i don't need to download

12:19 - 12:23

it i'll just copy it from a different

12:22 - 12:25

location

12:23 - 12:28

i have it so you can pretend you didn't

12:25 - 12:31

see this and i downloaded so i have it

12:28 - 12:33

and we can take a first look at this

12:31 - 12:36

data set using less so this is another

12:33 - 12:38

command line utility for looking at text

12:36 - 12:41

data and we see that in this csv file we

12:38 - 12:43

have a bunch of fields so the first one

12:41 - 12:45

is vendor id maybe actually we should

12:43 - 12:46

use like excel or

12:45 - 12:50

something else we're looking at this

12:46 - 12:52

data i have libreoffice calc which is a

12:50 - 12:56

excel like program let me use it for

12:52 - 12:56

opening this file

12:59 - 13:04

it's quite a big document actually let

13:01 - 13:06

me stop it so what i'll do is in

13:04 - 13:09

linux there is a very convenient command

13:06 - 13:11

called hat which outputs only first

13:09 - 13:12

let's say 100 rows of a data set and

13:11 - 13:15

then

13:12 - 13:18

yeah so just outputs first 100 records

13:15 - 13:20

and then i can save it to a different

13:18 - 13:21

file using this command so i can say

13:20 - 13:24

yellow

13:21 - 13:27

hat csv then now i will stop it because

13:24 - 13:30

it will kill my computer

13:27 - 13:32

and i will open a sample

13:30 - 13:34

this sample should be a lot faster to

13:32 - 13:36

open because it's just 100 rows and by

13:34 - 13:39

the way we can also take a look at the

13:36 - 13:41

number of rows using this vc minus l

13:39 - 13:44

command vc is word count minus l instead

13:41 - 13:47

of counting words it will count lines so

13:44 - 13:49

vc minus l means count how many lines

13:47 - 13:52

are there in this file

13:49 - 13:55

so there are one million three hundred

13:52 - 13:58

thousand something rows so it's quite a

13:55 - 14:00

large data set i would say so that's why

13:58 - 14:02

libreoffice it was a bit slow when

14:00 - 14:05

opening all this and we needed to help

14:02 - 14:07

it a little bit by taking only the first

14:05 - 14:09

100 rows let's take a look at what we

14:07 - 14:11

have here so we have quite a few columns

14:09 - 14:13

you know what we actually have a

14:11 - 14:15

document that describes exactly what

14:13 - 14:16

each field is

14:15 - 14:19

so i'm looking for

14:16 - 14:21

this yellow trip data dictionary i will

14:19 - 14:25

also copy this link

14:21 - 14:27

to our repository and this link tells us

14:25 - 14:28

what exactly each column means it says

14:27 - 14:31

that vendor id this is who is the

14:28 - 14:33

provider operator of the taxi then this

14:31 - 14:36

peak update time drop-off date time is

14:33 - 14:39

when pickup happened and then the drop

14:36 - 14:42

off happened these are timestamps how

14:39 - 14:44

many passengers are there the distance i

14:42 - 14:47

think it's miles probably yeah miles

14:44 - 14:50

then rate call id says what exactly it

14:47 - 14:53

is this one pu it's pickup location id

14:50 - 14:55

do drop down location id so it says

14:53 - 14:57

where the passenger was picked up and

14:55 - 15:00

where they were dropped off we actually

14:57 - 15:02

also have this data

15:00 - 15:06

this taxi zone lookup table and that's

15:02 - 15:08

another csv file let me just open it

15:06 - 15:11

for this customer so they were picked up

15:08 - 15:14

in this location so let me quickly find

15:11 - 15:17

it's uh they were picked up in manhattan

15:14 - 15:19

on lincoln square east and then they

15:17 - 15:21

were dropped off

15:19 - 15:23

in location 43 which is manhattan

15:21 - 15:25

central park

15:23 - 15:27

i don't know if it's far not maybe they

15:25 - 15:29

could have walked yeah so this is the

15:27 - 15:30

type of information we have and a bunch

15:29 - 15:32

of other things there are quite a few

15:30 - 15:34

columns another interesting one for us

15:32 - 15:37

could be this total amount or tips how

15:34 - 15:39

many tips they gave to the driver this

15:37 - 15:42

is how the data set looks like so this

15:39 - 15:43

this document describes what exactly we

15:42 - 15:46

are looking at what kind of columns are

15:43 - 15:48

there and it's helpful to make sense of

15:46 - 15:51

this data so we took a look at the data

15:48 - 15:53

now let's load this data with pandas

15:51 - 15:55

don't worry if you don't know pandas

15:53 - 15:57

actually if you want to learn more about

15:55 - 16:00

pandas you can go to a different course

15:57 - 16:02

that we also run in data talks club this

16:00 - 16:04

course is called machine learning zoom

16:02 - 16:06

camp and in the first week of this

16:04 - 16:09

course in the introduction

16:06 - 16:11

we had a video about pandas

16:09 - 16:14

and you can check it out this is module

16:11 - 16:16

one lesson nine we will not go into

16:14 - 16:18

details we will just need a few comments

16:16 - 16:21

there pandas is just a convenient way of

16:18 - 16:22

dealing with tabular data csv file is a

16:21 - 16:25

tabular data so that's why we'll use it

16:22 - 16:28

for reading data we will use the csv

16:25 - 16:29

command and then we just need to specify

16:28 - 16:31

that we want to read this particular

16:29 - 16:34

dataset

16:31 - 16:36

i don't know what this warning is

16:34 - 16:38

i didn't want to read the entire data

16:36 - 16:40

set in memory and that's quite large so

16:38 - 16:43

i also wanted to say that i want to read

16:40 - 16:45

only the first 100 rows so then reading

16:43 - 16:47

the first 100 was quite fast

16:45 - 16:49

and we see that we have the same we see

16:47 - 16:52

the same data as we saw in our

16:49 - 16:54

libreoffice so now what we can do next

16:52 - 16:57

is take this data set and put it to our

16:54 - 17:00

postgres for that we need to first

16:57 - 17:02

generate a schema to say an instruction

17:00 - 17:04

like create table that specifies what

17:02 - 17:06

kind of columns are there what kind of

17:04 - 17:09

types are there and so on for that in

17:06 - 17:13

pandas there is a model called io and in

17:09 - 17:15

io we have another package sql and then

17:13 - 17:18

in sql we have this thing called get

17:15 - 17:21

schema and we can actually

17:18 - 17:25

use this function to convert the data

17:21 - 17:27

frame we have this df thing to a ddl

17:25 - 17:31

data definition language is a part of

17:27 - 17:33

sql that is used for specifying schemas

17:31 - 17:35

so this will have it yeah we need to

17:33 - 17:39

specify the name of the table so let's

17:35 - 17:42

say the name will be yellow taxi data

17:39 - 17:44

for example data we haven't created a

17:42 - 17:46

table yet so this just prints a

17:44 - 17:48

statement for us to see how the

17:46 - 17:51

statement will look like you see that we

17:48 - 17:54

have this slash n things so these are

17:51 - 17:55

line breaks jupiter doesn't print them

17:54 - 17:58

that's why we need to use this print

17:55 - 18:01

function now we have this statement so

17:58 - 18:04

this is a statement in so-called ddl

18:01 - 18:06

data definition language which describes

18:04 - 18:09

how the schema the table should look

18:06 - 18:11

like in sql one thing we can see

18:09 - 18:13

immediately is that for pickup date time

18:11 - 18:15

and drop-off daytime it actually uses

18:13 - 18:17

text it shouldn't use slacks because

18:15 - 18:20

pandas doesn't know that this is

18:17 - 18:22

actually timestamp we need to tell

18:20 - 18:25

pandas that this is timestamp in pandas

18:22 - 18:28

there is a function called two datetime

18:25 - 18:31

that takes date in string and parses it

18:28 - 18:34

and creates a datetime object it's pick

18:31 - 18:37

up time and you see that now the type

18:34 - 18:40

here is datetime64. it doesn't matter

18:37 - 18:41

what 64 is but we see that this is a

18:40 - 18:44

datetime which is good because this is

18:41 - 18:46

what we want to have here for the rest

18:44 - 18:49

it's not a super optimal schema because

18:46 - 18:50

it uses real for total amount so we

18:49 - 18:53

actually need to use decimal but for our

18:50 - 18:55

purposes here is fine

18:53 - 18:58

so the only issue we have here is that

18:55 - 19:00

we need to say that these two fields are

18:58 - 19:02

timestamps

19:00 - 19:04

and we will do this by parsing them with

19:02 - 19:06

pandas and now we also need to write

19:04 - 19:08

them back to the data frame and this is

19:06 - 19:12

how we do it

19:08 - 19:13

okay so now we parsed these columns as

19:12 - 19:15

timestamps and wrote them back to the

19:13 - 19:18

data frame and now when we generate the

19:15 - 19:20

statement we will see that these two are

19:18 - 19:22

timestamps now this statement i'm not

19:20 - 19:24

sure if we can just copy this and paste

19:22 - 19:26

two postgres maybe it will work maybe it

19:24 - 19:28

will not so we actually need to generate

19:26 - 19:30

this ddl statement this create table

19:28 - 19:33

statement in something that postgres

19:30 - 19:35

will understand for sure so for that we

19:33 - 19:37

need to tell pandas that we want to put

19:35 - 19:39

this into postgres and then it will

19:37 - 19:41

generate a statement that works for

19:39 - 19:43

postgres so for that we actually need to

19:41 - 19:45

create a connection to postgres and then

19:43 - 19:48

it will say okay this is a connection to

19:45 - 19:50

postgres and then it will generate a

19:48 - 19:52

statement that is specifically for

19:50 - 19:54

postgresql and we will need to have this

19:52 - 19:57

connection anyways to be able to write

19:54 - 20:00

data to our database so let's do that

19:57 - 20:03

pandas uses a library called sql alchemy

20:00 - 20:06

this is a python library for dealing

20:03 - 20:09

with sql this is what it uses so we need

20:06 - 20:11

to install it i already have it i think

20:09 - 20:12

it's a part of anaconda so when you

20:11 - 20:15

install anaconda you have this sql

20:12 - 20:19

alchemy if you don't have anaconda then

20:15 - 20:20

you just do pip install scale okay me i

20:19 - 20:22

think you'll need to double check i'm

20:20 - 20:24

pretty sure this comment will work so

20:22 - 20:28

let's import this

20:24 - 20:31

so we need to import this create engine

20:28 - 20:34

and now let's have engine so for us it

20:31 - 20:35

will be create engine and then we will

20:34 - 20:36

need to specify the type of the database

20:35 - 20:40

we have

20:36 - 20:43

for us the type of database is postgres

20:40 - 20:45

so we need to write postgres postgresql

20:43 - 20:47

the full name and then we need to

20:45 - 20:50

specify the user which is the root then

20:47 - 20:53

the password and then we need to say at

20:50 - 20:56

the host name the local host and then

20:53 - 20:59

the port is 5432 and then the database

20:56 - 21:02

name is ny taxi let me check yeah it's

20:59 - 21:04

ny taxi looks correct let me check if it

21:02 - 21:07

actually works it didn't complain what

21:04 - 21:09

happens if i type something wrong here

21:07 - 21:11

yeah it also doesn't complain maybe i

21:09 - 21:12

need to run this connect okay and now it

21:11 - 21:13

says

21:12 - 21:15

something is wrong

21:13 - 21:17

here password authentication and if i

21:15 - 21:20

put the correct credentials yeah now it

21:17 - 21:22

works okay i just deleted the cell i

21:20 - 21:24

pressed the x key so i just pressed it

21:22 - 21:26

now we need to specify the connection so

21:24 - 21:27

connection is the thing we just created

21:26 - 21:30

and

21:27 - 21:33

now we have the definition in

21:30 - 21:36

postgres dialogue that one would have

21:33 - 21:38

worked i'm not sure if postgres knows

21:36 - 21:41

what real is and if it treats it as

21:38 - 21:43

flawed maybe it does maybe doesn't i

21:41 - 21:46

don't know but this is the statement

21:43 - 21:49

that pandas will execute when it will

21:46 - 21:50

try to create this table well i think

21:49 - 21:52

now

21:50 - 21:54

we are ready so we know what to do we

21:52 - 21:56

took a look at the data we see that we

21:54 - 21:59

need to want to parse only this thing

21:56 - 22:02

and the rest it's not ideal but it will

21:59 - 22:05

work fine for us for this video and

22:02 - 22:06

later in other lessons in other weeks we

22:05 - 22:09

will see how to define these types

22:06 - 22:12

properly actually we only

22:09 - 22:13

read the first hundred rows we want to

22:12 - 22:16

read the entire data frame the entire

22:13 - 22:18

data frame is too big we don't want to

22:16 - 22:20

insert one million six hundred thousand

22:18 - 22:21

rows at the same time to our database i

22:20 - 22:24

don't know how it will react so that's

22:21 - 22:28

why we need to budget to chunk this csv

22:24 - 22:30

file into smaller batches and i will use

22:28 - 22:33

iterators from pandas iterator allows us

22:30 - 22:35

to chunk the csv file into smaller

22:33 - 22:38

dataframes for that i think we need yeah

22:35 - 22:40

so we have this iterator through

22:38 - 22:43

iterator true and then there is chunk

22:40 - 22:46

size to see help what i do is i do shift

22:43 - 22:49

tab and then i see this help

22:46 - 22:51

chunk size let's say 100 000 rows at

22:49 - 22:53

once so we'll

22:51 - 22:55

split we will break our cc file into

22:53 - 22:58

chunks of this size and then we will

22:55 - 23:01

read them one chunk at a time and then

22:58 - 23:05

for each chunk we will put the chunk to

23:01 - 23:07

database and then after 16 17 chunks the

23:05 - 23:10

entire dataset will be in our database

23:07 - 23:12

so i'll call it dfi term meaning this is

23:10 - 23:14

an iterator if i do this

23:12 - 23:16

if i print it right now it shows that

23:14 - 23:19

this is not a data frame this is an

23:16 - 23:20

iterator and to get the first value i'll

23:19 - 23:22

do

23:20 - 23:25

next the next in python this is a

23:22 - 23:27

function that returns the next element

23:25 - 23:29

in an iterator so now i tried this so i

23:27 - 23:33

have this data frame and the size of

23:29 - 23:34

this data frame is exactly 100 000 like

23:33 - 23:37

we specified

23:34 - 23:39

and we need to do these two things here

23:37 - 23:42

to make sure that this is timestamps so

23:39 - 23:43

let us do that and then now we will

23:42 - 23:46

generate

23:43 - 23:47

this statement and we'll create a table

23:46 - 23:50

for that what we can do is we can just

23:47 - 23:53

do data frame head and yeah the frame

23:50 - 23:55

head returns uh the first five records

23:53 - 23:58

in the first n records we can specify

23:55 - 24:01

whatever many rows we want if we specify

23:58 - 24:03

n0 it will only return the header and

24:01 - 24:05

now we want to

24:03 - 24:07

use that to insert all these data

24:05 - 24:09

meaning there are no rows only the table

24:07 - 24:11

definition to our database what will

24:09 - 24:14

happen effectively when we do this we

24:11 - 24:16

will only execute this create table but

24:14 - 24:18

we will not insert any data because we

24:16 - 24:20

are doing it in chunks so first we want

24:18 - 24:22

to create a table and then insert the

24:20 - 24:25

data chunk by chunk in this data frame

24:22 - 24:28

there is a method called to sql and what

24:25 - 24:31

this method is doing it's inserting so

24:28 - 24:34

if i do a connection this engine and

24:31 - 24:38

name name what's the name yellow taxi

24:34 - 24:40

data so now what it will do it will go

24:38 - 24:42

to our database and insert all the rows

24:40 - 24:44

so first it will create a table and then

24:42 - 24:46

insert all the rows that's why we only

24:44 - 24:48

want to have the first part and insert

24:46 - 24:51

rows separately and another thing we

24:48 - 24:53

need to specify here is what happens

24:51 - 24:57

when a table with this name already

24:53 - 24:59

exists i can see that

24:57 - 25:01

yeah there are multiple options so fail

24:59 - 25:03

replace an append so first we will

25:01 - 25:06

replace means that it will see if a

25:03 - 25:08

database with this name already exists

25:06 - 25:10

then it will drop it and it will create

25:08 - 25:12

a new one and if a database with this

25:10 - 25:14

name doesn't exist it will not do

25:12 - 25:17

anything actually i wanted to do this

25:14 - 25:18

only on the column names so i only want

25:17 - 25:21

to create a table i do not want to

25:18 - 25:24

insert anything yet so it executed

25:21 - 25:27

successfully and so now let's see what

25:24 - 25:29

kind of tables we have so now this slash

25:27 - 25:32

dt returns a list of tables

25:29 - 25:37

in this yellow taxi data and i think we

25:32 - 25:37

can also do describe a table

25:37 - 25:42

i think desk is from mysql just need to

25:41 - 25:46

do d

25:42 - 25:49

yellow taxi data

25:46 - 25:52

wrong slash okay yeah so we

25:49 - 25:54

now see that this was the schema that we

25:52 - 25:57

created

25:54 - 25:59

and now we can finally put some data to

25:57 - 26:03

this table yeah we will execute the same

25:59 - 26:06

thing but without this hat 0 and then if

26:03 - 26:08

exists instead of replacing we will use

26:06 - 26:10

append i think it's apparent

26:08 - 26:12

if exists it will append insert new

26:10 - 26:14

values to the existing table so this is

26:12 - 26:16

exactly what we want to do and we want

26:14 - 26:19

to do this for each chunk of the

26:16 - 26:20

iterator you can actually time it

26:19 - 26:22

this

26:20 - 26:25

time thing will tell us how much time it

26:22 - 26:29

actually took to run this line so it

26:25 - 26:32

took 10 seconds and now we can do count

26:29 - 26:36

actually select count from

26:32 - 26:38

what was the name from yellow taxi data

26:36 - 26:40

which would tell us how many rows we

26:38 - 26:43

have in our table and we see that we

26:40 - 26:45

have 100 000 rows as expected so now

26:43 - 26:47

what we need to do is we just need to

26:45 - 26:49

insert the rest of the data frame to our

26:47 - 26:52

table i will use maybe it's not the

26:49 - 26:54

cleanest code when i do this data frame

26:52 - 26:56

next i will get the next chunk and then

26:54 - 26:59

i want to do it in a loop and then the

26:56 - 27:01

way iterators in python work is when you

26:59 - 27:04

no longer can get next it will throw an

27:01 - 27:06

exception it will say there is no data

27:04 - 27:08

anymore and our loop will exit this

27:06 - 27:11

that's why i'll have an infinite loop

27:08 - 27:13

here and then i will take next the next

27:11 - 27:15

next next chunk until there are no

27:13 - 27:19

chunks left and then for each size chunk

27:15 - 27:22

i will first do this thing and then i

27:19 - 27:24

will insert data to our database

27:22 - 27:25

to have some visibility into what's

27:24 - 27:27

going on let me add some print

27:25 - 27:29

statements i'll write insert that

27:27 - 27:31

another chunk

27:29 - 27:34

and we can also measure how much time

27:31 - 27:35

each chunk takes so i'll do from time

27:34 - 27:39

import time

27:35 - 27:41

permanence benchmarking do the start

27:39 - 27:44

this returns the current timestamp in

27:41 - 27:45

seconds at the beginning and then at the

27:44 - 27:48

end

27:45 - 27:50

and then if we subtract start from the

27:48 - 27:53

end then we will know how much time we

27:50 - 27:55

took all right it took

27:53 - 27:58

that many seconds and now we will just

27:55 - 27:59

do the subtraction so this notation here

27:58 - 28:02

if you're new to python it means that it

27:59 - 28:04

will treat it as flawed and it will have

28:02 - 28:06

three decimal digits after

28:04 - 28:10

you will see right now so now we are

28:06 - 28:12

inserting data to our table i hope there

28:10 - 28:14

will be no mistakes

28:12 - 28:17

okay here the first chunk

28:14 - 28:19

we can check already that yeah so now

28:17 - 28:22

it's 200 yeah and it will take some time

28:19 - 28:25

now i will not wait now till finishes it

28:22 - 28:27

will take a bit of time because now we

28:25 - 28:30

are done with this video so we run

28:27 - 28:32

postgres using docker we run it locally

28:30 - 28:34

and then we were able to connect to

28:32 - 28:36

these database then we took a look at

28:34 - 28:37

the data that we will

28:36 - 28:40

use throughout the course and then we

28:37 - 28:42

started putting this data to our

28:40 - 28:45

postgres database so what we will do

28:42 - 28:48

next in the next lesson is we will use a

28:45 - 28:50

special tool called pg admin to be able

28:48 - 28:52

to connect to our database this is quite

28:50 - 28:56

convenient this is more convenient than

28:52 - 28:58

using this pg scli and this is exactly

28:56 - 28:59

what we will do in the next lesson

28:58 - 29:02

so

28:59 - 29:02

see you soon

Running Postgres and Loading Data Sets into Postgres Using Python

In this video, we dive into running Postgres in Docker, setting up configurations, and loading data sets into our Postgres database using Python. By focusing on key areas like Docker run commands, configuring environmental variables, mapping folders, and interacting with databases using SQL and Python scripts, we gain valuable insights into working with data for SQL practice and beyond.

Running Postgres in Docker for SQL Practice

After setting up Docker and creating our Docker file and pipeline in the previous video, we delve into running Postgres in Docker. By using the official Docker image of Postgres, we configure environmental variables for user, password, and database name (e.g., ny_taxi). Accordingly, we map a folder to the Postgres container for data storage and specify the port (e.g., 5432) for SQL queries.

Loading Data Sets into Postgres

Next, we explore the yellow taxi trip data set from January 2021 in New York City, understanding the columns and data dictionary provided. Using Python's Pandas library, we chunk the large CSV data set, parse timestamp columns, generate a table schema, and establish a connection to Postgres using SQLAlchemy. With data inserted chunk by chunk, we ensure efficient data loading and schema creation within our Postgres database.

Conclusion and

As we wrap up, we reflect on the successful loading of data into Postgres and look forward to using tools like pgAdmin for enhanced database management in the upcoming lesson. Exciting opportunities lie ahead as we deepen our understanding of SQL practices, data manipulation, and database interactions in real-world scenarios.

Stay tuned for more hands-on tutorials and practical insights into mastering database management tools and techniques. Get ready to level up your data engineering skills with each lesson!

Let's continue this data journey and unlock the true potential of data engineering in our ever-evolving tech landscape!