00:00 - 00:05
welcome back this is another video of
00:03 - 00:07
our introduction in the previous video
00:05 - 00:09
we talked about docker and why it's
00:07 - 00:12
useful for data engineers we also took a
00:09 - 00:14
look at how we can create
00:12 - 00:16
air codes pipeline and put it in docker
00:14 - 00:18
now we know what docker is and how to
00:16 - 00:22
use it now we will run postgres in
00:18 - 00:24
docker and we will put some data to
00:22 - 00:27
this postgres
00:24 - 00:29
and and for putting the data there we'll
00:27 - 00:31
use a simple python script so let's
00:29 - 00:33
start this is where we finished last
00:31 - 00:35
time we have two files we have a docker
00:33 - 00:38
file and we have pipeline and now we
00:35 - 00:40
want to get some practice with sql and
00:38 - 00:42
we want to run postgres and we can use
00:40 - 00:45
this postgres for practice in sql before
00:42 - 00:47
we run the sql queries in bigquery later
00:45 - 00:49
in the course to run postgres we will
00:47 - 00:51
use the official docker image of
00:49 - 00:54
postgres you can just look it up what
00:51 - 00:57
you can also do is for week two we will
00:54 - 00:59
use airflow and airflow internally we
00:57 - 01:01
will talk more about what airflow is and
00:59 - 01:04
why do we need this but airflow
01:01 - 01:06
internally uses postgres and we can take
01:04 - 01:08
a look at what's the image we use there
01:06 - 01:12
and we can just reuse that so i think
01:08 - 01:12
right now the code is in branch
01:14 - 01:19
so we are interested in this so you will
01:16 - 01:21
see this code in action in the week 2.
01:19 - 01:23
we have we will talk about docker
01:21 - 01:25
composer docker compose as a way of
01:23 - 01:28
running multiple docker images we will
01:25 - 01:31
talk about this in the next video i want
01:28 - 01:34
to copy it and i want to use it for
01:31 - 01:36
running postgres locally let me copy it
01:34 - 01:39
then i open visual studio code in this
01:39 - 01:43
now i want to come up with a comment
01:40 - 01:46
that we need to run in order to
01:43 - 01:48
have postgres brand remember to run
01:46 - 01:50
something in docker we need to do docker
01:48 - 01:53
run and yeah also we do minus a t to be
01:50 - 01:56
able to stop it then we see this we have
01:53 - 01:58
image it's postgres 13. this is the
01:56 - 02:01
image and the version and we need to
01:58 - 02:03
configure postgres so for postgres for
02:01 - 02:05
configuring postgres we need a bunch of
02:03 - 02:07
things so these are environment
02:05 - 02:09
variables so this is a way of
02:07 - 02:11
configuring whatever we have in the
02:09 - 02:14
docker container so in this particular
02:11 - 02:16
case we want to say what the user will
02:14 - 02:19
be what the password for this user will
02:16 - 02:21
be and how we want to call our database
02:19 - 02:23
so these are environmental variables let
02:21 - 02:26
me copy them and to set environmental
02:23 - 02:29
variables when we run docker with docker
02:26 - 02:32
run we just need to use minus e flag and
02:29 - 02:34
then say what the parameter for this
02:32 - 02:36
particular variable should be i'll just
02:34 - 02:41
use root for the user
02:36 - 02:44
then for password i'll also use root
02:41 - 02:44
and then for the database name i will
02:44 - 02:49
ny taxi so because this is the name of
02:47 - 02:51
the data set we will use for practicing
02:49 - 02:53
sql and this is the data set will use
02:51 - 02:56
throughout the course we'll talk a bit
02:53 - 02:59
about this data set later but i will
02:56 - 03:01
call the database ny taxi new york taxi
02:59 - 03:04
so that's one part of the configuration
03:01 - 03:07
we need there is other part we need this
03:04 - 03:10
volumes part so this volumes is a way of
03:07 - 03:13
mapping folder that we have in our file
03:10 - 03:15
system on the host machine to a folder
03:13 - 03:17
in the container and because postgres is
03:15 - 03:20
a database it needs to keep files in a
03:17 - 03:22
file system if we want to be able to
03:20 - 03:25
save requests there so it will go
03:22 - 03:27
physically save this record in its
03:25 - 03:29
specific format to your file system and
03:27 - 03:30
then later it will also go there to read
03:30 - 03:34
and if we do this in docker remember
03:32 - 03:36
docker doesn't keep the state so next
03:34 - 03:38
time we will run the state will be lost
03:36 - 03:40
we don't want that we want when we next
03:38 - 03:42
time run postgres we want to make sure
03:40 - 03:45
that our data is still there so that's
03:42 - 03:47
why we need to map a folder we have on
03:45 - 03:49
our host machine to a folder in the
03:47 - 03:51
container and this is called mounting
03:49 - 03:56
and let me create a folder i'll call
03:51 - 04:00
this folder ny taxi postgres data
03:56 - 04:03
and now i will map this folder to to
04:00 - 04:06
this folder on the container so for this
04:03 - 04:08
for mounting we use this minus v flag
04:06 - 04:10
and then we have the path to the folder
04:08 - 04:13
on the host machine and then we use
04:10 - 04:15
colon to say to which directory on the
04:13 - 04:17
container we want to map it to
04:15 - 04:19
we actually need to do one more thing
04:17 - 04:21
i'm not sure why it's not mentioned here
04:19 - 04:24
we also need to specify the port in
04:21 - 04:27
order to access the port we need to map
04:24 - 04:28
a port on our host machine to a port on
04:28 - 04:32
this is needed to send a request to your
04:30 - 04:34
database because for database we will
04:32 - 04:37
send sql queries to this database and
04:34 - 04:39
then it needs to reply with answers so
04:37 - 04:41
we need to be able to access a specific
04:39 - 04:45
port of postgres and this port if i
04:41 - 04:47
remember correctly is 5432 and so we map
04:45 - 04:49
a port on our host machine with this
04:47 - 04:51
number to a port in the container that's
04:49 - 04:53
all the configuration we need it's
04:51 - 04:56
actually it will not work because docker
04:53 - 04:59
needs a full path since i'm on windows i
04:56 - 05:01
need to use this long path
04:59 - 05:03
and they actually need to use the
05:01 - 05:06
windows way of setting the path so
05:03 - 05:07
instead of slash c i need to start with
05:07 - 05:12
colon and then the rest of the path so i
05:10 - 05:14
need to have the full name and the full
05:12 - 05:17
path the absolute path to the folder on
05:14 - 05:19
linux on mac os i would do something
05:19 - 05:26
current directory like this pwd i can do
05:22 - 05:28
that here as well pwd but this returns a
05:26 - 05:30
linux unix like path and docker and
05:28 - 05:33
windows will not like this and it will
05:30 - 05:34
complain that it cannot find this path
05:35 - 05:39
yeah for windows you need to do it this
05:37 - 05:42
way and if you're on a normal operating
05:39 - 05:44
system so you can just do this pwd let's
05:44 - 05:49
hopefully it will work okay i must be
05:47 - 05:51
running something on this port probably
05:49 - 05:54
another instance of postgres so let me
05:54 - 05:58
so this is something that is still
05:56 - 06:02
running from my previous experiments let
05:58 - 06:03
me quickly remove that and i'll cut this
06:02 - 06:06
part out because this is not interesting
06:03 - 06:08
for you so i'll see you soon i stopped
06:06 - 06:10
the container let me check what is
06:08 - 06:13
running again nothing is running this
06:10 - 06:15
means i can run this thing again
06:13 - 06:16
so let me do that
06:15 - 06:18
i think it's running so let's see what
06:16 - 06:21
happens next and because it's running
06:18 - 06:22
for the first time it needs to do some
06:21 - 06:24
instantiation when we run it for the
06:22 - 06:27
second time i think it should be faster
06:24 - 06:30
okay and now it says database system is
06:27 - 06:32
ready to accept connections it means
06:30 - 06:33
that we can actually start working with
06:33 - 06:37
i'm not sure what that is i will just
06:35 - 06:39
ignore that i hope it's working so now i
06:37 - 06:42
can see i can take a look at what is in
06:39 - 06:45
this folder and why taxi postgres data
06:42 - 06:47
so this is the folder that we mapped to
06:45 - 06:49
a folder in the container and we see
06:47 - 06:52
that postgres created some stuff there i
06:49 - 06:54
have no idea what it's doing exactly
06:52 - 06:57
this is some internal representation of
06:54 - 06:59
data for postgres let's not look at this
06:57 - 07:01
and now we can try accessing this
06:59 - 07:02
database let me create another terminal
07:02 - 07:07
and now i want to run a cli client for
07:05 - 07:10
accessing the database the client i use
07:07 - 07:12
is called pgcli this is a library in
07:10 - 07:16
python you can install it using peep
07:12 - 07:18
install pg cli i think if i run it it
07:16 - 07:20
will tell me that it it's already
07:18 - 07:22
installed it says requirement already
07:20 - 07:24
satisfied so that means i already have
07:22 - 07:26
it you probably don't then you just do
07:24 - 07:29
pip install pgcli and then i can use
07:26 - 07:30
this cli to access the database and then
07:29 - 07:32
run some queries there so let me just
07:30 - 07:35
run it it will output help
07:32 - 07:38
okay i think i need to see help i think
07:35 - 07:40
i need to do maybe minus minus help
07:40 - 07:45
so this is what we need to specify in
07:42 - 07:48
order to log into the database so let me
07:45 - 07:52
do pg cli then we'll start with host
07:48 - 07:55
minus h host for us it's localhost then
07:52 - 07:56
minus p i think it's port we don't need
07:55 - 07:59
to specify it because this is the
07:56 - 08:01
standard web port from postgres or we
07:59 - 08:05
might as well specify it because why not
08:01 - 08:07
then user name so for us user minus u is
08:05 - 08:09
root this is what we specified here so
08:07 - 08:11
user password
08:09 - 08:13
and database so this is what we
08:11 - 08:17
specified for database i think it's
08:13 - 08:19
minus d and then it's ny underscore taxi
08:17 - 08:21
i don't think we need to specify
08:19 - 08:24
anything else here so let me just run it
08:21 - 08:26
now it will ask for okay for you
08:24 - 08:30
probably it will ask her password i have
08:26 - 08:32
already accessed spgcli on localhost for
08:30 - 08:34
root so that's why i didn't ask me i
08:32 - 08:35
assume for you you will get a password
08:34 - 08:38
prompt and you will need to type the
08:35 - 08:40
password forward which in our case will
08:38 - 08:42
be root so you type this in and then you
08:40 - 08:45
will get something like that
08:42 - 08:47
we now connect it to postgres and we can
08:45 - 08:50
run some queries we don't have anything
08:47 - 08:52
there yet so all we can do is look at a
08:50 - 08:55
list of tables that is available in our
08:52 - 08:57
database there is nothing because of
08:55 - 08:59
course we haven't created anything or we
08:57 - 09:01
can do something like select one and
08:59 - 09:03
then it will give us a response with
09:01 - 09:05
just one column one row one there but
09:03 - 09:08
not super useful but at least we can
09:05 - 09:10
test the connection to the database and
09:08 - 09:13
we can see that it's working what we
09:10 - 09:15
will do now is now we'll take a look at
09:13 - 09:17
the data set we will use for this course
09:17 - 09:22
take this data set and load this to our
09:20 - 09:23
postgres i'll open another terminal
09:23 - 09:28
and in this window i will run jupyter
09:26 - 09:30
notebook if you install python using
09:28 - 09:32
anaconda you already have jupyter if you
09:30 - 09:36
installed python differently then you
09:32 - 09:38
can do pip install jupyter and then you
09:36 - 09:42
will have it i already have it and to
09:38 - 09:43
run it i will do jupiter notebook and i
09:42 - 09:46
will get access to jupyter because now i
09:43 - 09:50
want to use python to be able to read
09:46 - 09:52
the data set and load this data set and
09:50 - 09:54
put it to postgres so i want to use
09:52 - 09:56
python for that and jupiter is a
09:54 - 09:58
convenient interactive environment for
09:56 - 10:00
doing this you don't have to use jupyter
09:58 - 10:02
you can use for example visual studio
10:00 - 10:05
code or you can use pycharm or whatever
10:02 - 10:07
you want i just prefer using jupiter and
10:05 - 10:09
that's why i will use it as well
10:07 - 10:12
so we have jupiter right now i will
10:09 - 10:14
create a new notebook here by the way
10:12 - 10:16
you can ignore this so if you just
10:14 - 10:18
installed jupiter you will probably have
10:16 - 10:20
only one kernel here it doesn't matter
10:18 - 10:23
when you create a new notebook click
10:20 - 10:26
here on python 3 and we have this window
10:23 - 10:29
first let's check if we have wonders
10:26 - 10:32
it should be pd not spd yeah so we have
10:29 - 10:34
pandas the version of pandas i have is
10:32 - 10:38
one three four let me also rename it
10:34 - 10:40
i will call it upload data
10:38 - 10:42
your jupiter might look differently it's
10:40 - 10:45
because my header is
10:42 - 10:48
hidden and my toolbar is hidden as well
10:45 - 10:50
this way i have more space for recording
10:48 - 10:52
videos but for you it looks a little bit
10:50 - 10:53
different so that's fine now we have
10:53 - 10:57
now let's take a look at the data set we
10:56 - 11:00
actually want to use there should be a
10:57 - 11:03
link in our report
11:00 - 11:03
this is the link
11:03 - 11:08
okay so there is an article medium this
11:05 - 11:10
is not the link i was looking for i'm
11:08 - 11:13
actually looking for
11:10 - 11:16
let me just go to google and why taxi
11:13 - 11:19
data and yeah so this is the link i was
11:16 - 11:21
looking for let me copy it and i'll put
11:19 - 11:24
this in the repo as well so this data
11:21 - 11:26
set contains taxi trips from new york
11:24 - 11:28
city there are actually two data sets or
11:26 - 11:31
even three like yellow taxis and green
11:28 - 11:33
taxis don't know what is the difference
11:31 - 11:36
between the these two and we have quite
11:33 - 11:38
a few quite a lot of data here and for
11:36 - 11:41
this lesson we'll take this thing here
11:38 - 11:45
so it's yellow taxi trip records from
11:41 - 11:47
january 2021 and i will click copy link
11:45 - 11:49
address and what i will do let me
11:47 - 11:52
actually open yet another terminal
11:49 - 11:55
window so now i will use wget for
11:52 - 11:58
downloading this csv file what you can
11:55 - 12:00
do you can also just right click on this
11:58 - 12:03
and do save link as for example on
12:00 - 12:04
windows you do not have it by default
12:03 - 12:07
you will need to download this i already
12:04 - 12:10
have it it's a very convenient command
12:07 - 12:12
line utility for downloading things i
12:10 - 12:14
will now use it for downloading the csv
12:12 - 12:17
file i don't know what is wrong with my
12:14 - 12:19
internet it's quite slow but i actually
12:17 - 12:22
already have it i don't need to download
12:19 - 12:23
it i'll just copy it from a different
12:23 - 12:28
i have it so you can pretend you didn't
12:25 - 12:31
see this and i downloaded so i have it
12:28 - 12:33
and we can take a first look at this
12:31 - 12:36
data set using less so this is another
12:33 - 12:38
command line utility for looking at text
12:36 - 12:41
data and we see that in this csv file we
12:38 - 12:43
have a bunch of fields so the first one
12:41 - 12:45
is vendor id maybe actually we should
12:43 - 12:46
use like excel or
12:45 - 12:50
something else we're looking at this
12:46 - 12:52
data i have libreoffice calc which is a
12:50 - 12:56
excel like program let me use it for
12:52 - 12:56
opening this file
12:59 - 13:04
it's quite a big document actually let
13:01 - 13:06
me stop it so what i'll do is in
13:04 - 13:09
linux there is a very convenient command
13:06 - 13:11
called hat which outputs only first
13:09 - 13:12
let's say 100 rows of a data set and
13:12 - 13:18
yeah so just outputs first 100 records
13:15 - 13:20
and then i can save it to a different
13:18 - 13:21
file using this command so i can say
13:21 - 13:27
hat csv then now i will stop it because
13:24 - 13:30
it will kill my computer
13:27 - 13:32
and i will open a sample
13:30 - 13:34
this sample should be a lot faster to
13:32 - 13:36
open because it's just 100 rows and by
13:34 - 13:39
the way we can also take a look at the
13:36 - 13:41
number of rows using this vc minus l
13:39 - 13:44
command vc is word count minus l instead
13:41 - 13:47
of counting words it will count lines so
13:44 - 13:49
vc minus l means count how many lines
13:47 - 13:52
are there in this file
13:49 - 13:55
so there are one million three hundred
13:52 - 13:58
thousand something rows so it's quite a
13:55 - 14:00
large data set i would say so that's why
13:58 - 14:02
libreoffice it was a bit slow when
14:00 - 14:05
opening all this and we needed to help
14:02 - 14:07
it a little bit by taking only the first
14:05 - 14:09
100 rows let's take a look at what we
14:07 - 14:11
have here so we have quite a few columns
14:09 - 14:13
you know what we actually have a
14:11 - 14:15
document that describes exactly what
14:13 - 14:16
each field is
14:15 - 14:19
so i'm looking for
14:16 - 14:21
this yellow trip data dictionary i will
14:19 - 14:25
also copy this link
14:21 - 14:27
to our repository and this link tells us
14:25 - 14:28
what exactly each column means it says
14:27 - 14:31
that vendor id this is who is the
14:28 - 14:33
provider operator of the taxi then this
14:31 - 14:36
peak update time drop-off date time is
14:33 - 14:39
when pickup happened and then the drop
14:36 - 14:42
off happened these are timestamps how
14:39 - 14:44
many passengers are there the distance i
14:42 - 14:47
think it's miles probably yeah miles
14:44 - 14:50
then rate call id says what exactly it
14:47 - 14:53
is this one pu it's pickup location id
14:50 - 14:55
do drop down location id so it says
14:53 - 14:57
where the passenger was picked up and
14:55 - 15:00
where they were dropped off we actually
14:57 - 15:02
also have this data
15:00 - 15:06
this taxi zone lookup table and that's
15:02 - 15:08
another csv file let me just open it
15:06 - 15:11
for this customer so they were picked up
15:08 - 15:14
in this location so let me quickly find
15:11 - 15:17
it's uh they were picked up in manhattan
15:14 - 15:19
on lincoln square east and then they
15:17 - 15:21
were dropped off
15:19 - 15:23
in location 43 which is manhattan
15:21 - 15:25
central park
15:23 - 15:27
i don't know if it's far not maybe they
15:25 - 15:29
could have walked yeah so this is the
15:27 - 15:30
type of information we have and a bunch
15:29 - 15:32
of other things there are quite a few
15:30 - 15:34
columns another interesting one for us
15:32 - 15:37
could be this total amount or tips how
15:34 - 15:39
many tips they gave to the driver this
15:37 - 15:42
is how the data set looks like so this
15:39 - 15:43
this document describes what exactly we
15:42 - 15:46
are looking at what kind of columns are
15:43 - 15:48
there and it's helpful to make sense of
15:46 - 15:51
this data so we took a look at the data
15:48 - 15:53
now let's load this data with pandas
15:51 - 15:55
don't worry if you don't know pandas
15:53 - 15:57
actually if you want to learn more about
15:55 - 16:00
pandas you can go to a different course
15:57 - 16:02
that we also run in data talks club this
16:00 - 16:04
course is called machine learning zoom
16:02 - 16:06
camp and in the first week of this
16:04 - 16:09
course in the introduction
16:06 - 16:11
we had a video about pandas
16:09 - 16:14
and you can check it out this is module
16:11 - 16:16
one lesson nine we will not go into
16:14 - 16:18
details we will just need a few comments
16:16 - 16:21
there pandas is just a convenient way of
16:18 - 16:22
dealing with tabular data csv file is a
16:21 - 16:25
tabular data so that's why we'll use it
16:22 - 16:28
for reading data we will use the csv
16:25 - 16:29
command and then we just need to specify
16:28 - 16:31
that we want to read this particular
16:31 - 16:36
i don't know what this warning is
16:34 - 16:38
i didn't want to read the entire data
16:36 - 16:40
set in memory and that's quite large so
16:38 - 16:43
i also wanted to say that i want to read
16:40 - 16:45
only the first 100 rows so then reading
16:43 - 16:47
the first 100 was quite fast
16:45 - 16:49
and we see that we have the same we see
16:47 - 16:52
the same data as we saw in our
16:49 - 16:54
libreoffice so now what we can do next
16:52 - 16:57
is take this data set and put it to our
16:54 - 17:00
postgres for that we need to first
16:57 - 17:02
generate a schema to say an instruction
17:00 - 17:04
like create table that specifies what
17:02 - 17:06
kind of columns are there what kind of
17:04 - 17:09
types are there and so on for that in
17:06 - 17:13
pandas there is a model called io and in
17:09 - 17:15
io we have another package sql and then
17:13 - 17:18
in sql we have this thing called get
17:15 - 17:21
schema and we can actually
17:18 - 17:25
use this function to convert the data
17:21 - 17:27
frame we have this df thing to a ddl
17:25 - 17:31
data definition language is a part of
17:27 - 17:33
sql that is used for specifying schemas
17:31 - 17:35
so this will have it yeah we need to
17:33 - 17:39
specify the name of the table so let's
17:35 - 17:42
say the name will be yellow taxi data
17:39 - 17:44
for example data we haven't created a
17:42 - 17:46
table yet so this just prints a
17:44 - 17:48
statement for us to see how the
17:46 - 17:51
statement will look like you see that we
17:48 - 17:54
have this slash n things so these are
17:51 - 17:55
line breaks jupiter doesn't print them
17:54 - 17:58
that's why we need to use this print
17:55 - 18:01
function now we have this statement so
17:58 - 18:04
this is a statement in so-called ddl
18:01 - 18:06
data definition language which describes
18:04 - 18:09
how the schema the table should look
18:06 - 18:11
like in sql one thing we can see
18:09 - 18:13
immediately is that for pickup date time
18:11 - 18:15
and drop-off daytime it actually uses
18:13 - 18:17
text it shouldn't use slacks because
18:15 - 18:20
pandas doesn't know that this is
18:17 - 18:22
actually timestamp we need to tell
18:20 - 18:25
pandas that this is timestamp in pandas
18:22 - 18:28
there is a function called two datetime
18:25 - 18:31
that takes date in string and parses it
18:28 - 18:34
and creates a datetime object it's pick
18:31 - 18:37
up time and you see that now the type
18:34 - 18:40
here is datetime64. it doesn't matter
18:37 - 18:41
what 64 is but we see that this is a
18:40 - 18:44
datetime which is good because this is
18:41 - 18:46
what we want to have here for the rest
18:44 - 18:49
it's not a super optimal schema because
18:46 - 18:50
it uses real for total amount so we
18:49 - 18:53
actually need to use decimal but for our
18:50 - 18:55
purposes here is fine
18:53 - 18:58
so the only issue we have here is that
18:55 - 19:00
we need to say that these two fields are
19:00 - 19:04
and we will do this by parsing them with
19:02 - 19:06
pandas and now we also need to write
19:04 - 19:08
them back to the data frame and this is
19:06 - 19:12
how we do it
19:08 - 19:13
okay so now we parsed these columns as
19:12 - 19:15
timestamps and wrote them back to the
19:13 - 19:18
data frame and now when we generate the
19:15 - 19:20
statement we will see that these two are
19:18 - 19:22
timestamps now this statement i'm not
19:20 - 19:24
sure if we can just copy this and paste
19:22 - 19:26
two postgres maybe it will work maybe it
19:24 - 19:28
will not so we actually need to generate
19:26 - 19:30
this ddl statement this create table
19:28 - 19:33
statement in something that postgres
19:30 - 19:35
will understand for sure so for that we
19:33 - 19:37
need to tell pandas that we want to put
19:35 - 19:39
this into postgres and then it will
19:37 - 19:41
generate a statement that works for
19:39 - 19:43
postgres so for that we actually need to
19:41 - 19:45
create a connection to postgres and then
19:43 - 19:48
it will say okay this is a connection to
19:45 - 19:50
postgres and then it will generate a
19:48 - 19:52
statement that is specifically for
19:50 - 19:54
postgresql and we will need to have this
19:52 - 19:57
connection anyways to be able to write
19:54 - 20:00
data to our database so let's do that
19:57 - 20:03
pandas uses a library called sql alchemy
20:00 - 20:06
this is a python library for dealing
20:03 - 20:09
with sql this is what it uses so we need
20:06 - 20:11
to install it i already have it i think
20:09 - 20:12
it's a part of anaconda so when you
20:11 - 20:15
install anaconda you have this sql
20:12 - 20:19
alchemy if you don't have anaconda then
20:15 - 20:20
you just do pip install scale okay me i
20:19 - 20:22
think you'll need to double check i'm
20:20 - 20:24
pretty sure this comment will work so
20:22 - 20:28
let's import this
20:24 - 20:31
so we need to import this create engine
20:28 - 20:34
and now let's have engine so for us it
20:31 - 20:35
will be create engine and then we will
20:34 - 20:36
need to specify the type of the database
20:36 - 20:43
for us the type of database is postgres
20:40 - 20:45
so we need to write postgres postgresql
20:43 - 20:47
the full name and then we need to
20:45 - 20:50
specify the user which is the root then
20:47 - 20:53
the password and then we need to say at
20:50 - 20:56
the host name the local host and then
20:53 - 20:59
the port is 5432 and then the database
20:56 - 21:02
name is ny taxi let me check yeah it's
20:59 - 21:04
ny taxi looks correct let me check if it
21:02 - 21:07
actually works it didn't complain what
21:04 - 21:09
happens if i type something wrong here
21:07 - 21:11
yeah it also doesn't complain maybe i
21:09 - 21:12
need to run this connect okay and now it
21:12 - 21:15
something is wrong
21:13 - 21:17
here password authentication and if i
21:15 - 21:20
put the correct credentials yeah now it
21:17 - 21:22
works okay i just deleted the cell i
21:20 - 21:24
pressed the x key so i just pressed it
21:22 - 21:26
now we need to specify the connection so
21:24 - 21:27
connection is the thing we just created
21:27 - 21:33
now we have the definition in
21:30 - 21:36
postgres dialogue that one would have
21:33 - 21:38
worked i'm not sure if postgres knows
21:36 - 21:41
what real is and if it treats it as
21:38 - 21:43
flawed maybe it does maybe doesn't i
21:41 - 21:46
don't know but this is the statement
21:43 - 21:49
that pandas will execute when it will
21:46 - 21:50
try to create this table well i think
21:50 - 21:54
we are ready so we know what to do we
21:52 - 21:56
took a look at the data we see that we
21:54 - 21:59
need to want to parse only this thing
21:56 - 22:02
and the rest it's not ideal but it will
21:59 - 22:05
work fine for us for this video and
22:02 - 22:06
later in other lessons in other weeks we
22:05 - 22:09
will see how to define these types
22:06 - 22:12
properly actually we only
22:09 - 22:13
read the first hundred rows we want to
22:12 - 22:16
read the entire data frame the entire
22:13 - 22:18
data frame is too big we don't want to
22:16 - 22:20
insert one million six hundred thousand
22:18 - 22:21
rows at the same time to our database i
22:20 - 22:24
don't know how it will react so that's
22:21 - 22:28
why we need to budget to chunk this csv
22:24 - 22:30
file into smaller batches and i will use
22:28 - 22:33
iterators from pandas iterator allows us
22:30 - 22:35
to chunk the csv file into smaller
22:33 - 22:38
dataframes for that i think we need yeah
22:35 - 22:40
so we have this iterator through
22:38 - 22:43
iterator true and then there is chunk
22:40 - 22:46
size to see help what i do is i do shift
22:43 - 22:49
tab and then i see this help
22:46 - 22:51
chunk size let's say 100 000 rows at
22:49 - 22:53
once so we'll
22:51 - 22:55
split we will break our cc file into
22:53 - 22:58
chunks of this size and then we will
22:55 - 23:01
read them one chunk at a time and then
22:58 - 23:05
for each chunk we will put the chunk to
23:01 - 23:07
database and then after 16 17 chunks the
23:05 - 23:10
entire dataset will be in our database
23:07 - 23:12
so i'll call it dfi term meaning this is
23:10 - 23:14
an iterator if i do this
23:12 - 23:16
if i print it right now it shows that
23:14 - 23:19
this is not a data frame this is an
23:16 - 23:20
iterator and to get the first value i'll
23:20 - 23:25
next the next in python this is a
23:22 - 23:27
function that returns the next element
23:25 - 23:29
in an iterator so now i tried this so i
23:27 - 23:33
have this data frame and the size of
23:29 - 23:34
this data frame is exactly 100 000 like
23:33 - 23:37
we specified
23:34 - 23:39
and we need to do these two things here
23:37 - 23:42
to make sure that this is timestamps so
23:39 - 23:43
let us do that and then now we will
23:43 - 23:47
this statement and we'll create a table
23:46 - 23:50
for that what we can do is we can just
23:47 - 23:53
do data frame head and yeah the frame
23:50 - 23:55
head returns uh the first five records
23:53 - 23:58
in the first n records we can specify
23:55 - 24:01
whatever many rows we want if we specify
23:58 - 24:03
n0 it will only return the header and
24:01 - 24:05
now we want to
24:03 - 24:07
use that to insert all these data
24:05 - 24:09
meaning there are no rows only the table
24:07 - 24:11
definition to our database what will
24:09 - 24:14
happen effectively when we do this we
24:11 - 24:16
will only execute this create table but
24:14 - 24:18
we will not insert any data because we
24:16 - 24:20
are doing it in chunks so first we want
24:18 - 24:22
to create a table and then insert the
24:20 - 24:25
data chunk by chunk in this data frame
24:22 - 24:28
there is a method called to sql and what
24:25 - 24:31
this method is doing it's inserting so
24:28 - 24:34
if i do a connection this engine and
24:31 - 24:38
name name what's the name yellow taxi
24:34 - 24:40
data so now what it will do it will go
24:38 - 24:42
to our database and insert all the rows
24:40 - 24:44
so first it will create a table and then
24:42 - 24:46
insert all the rows that's why we only
24:44 - 24:48
want to have the first part and insert
24:46 - 24:51
rows separately and another thing we
24:48 - 24:53
need to specify here is what happens
24:51 - 24:57
when a table with this name already
24:53 - 24:59
exists i can see that
24:57 - 25:01
yeah there are multiple options so fail
24:59 - 25:03
replace an append so first we will
25:01 - 25:06
replace means that it will see if a
25:03 - 25:08
database with this name already exists
25:06 - 25:10
then it will drop it and it will create
25:08 - 25:12
a new one and if a database with this
25:10 - 25:14
name doesn't exist it will not do
25:12 - 25:17
anything actually i wanted to do this
25:14 - 25:18
only on the column names so i only want
25:17 - 25:21
to create a table i do not want to
25:18 - 25:24
insert anything yet so it executed
25:21 - 25:27
successfully and so now let's see what
25:24 - 25:29
kind of tables we have so now this slash
25:27 - 25:32
dt returns a list of tables
25:29 - 25:37
in this yellow taxi data and i think we
25:32 - 25:37
can also do describe a table
25:37 - 25:42
i think desk is from mysql just need to
25:42 - 25:49
yellow taxi data
25:46 - 25:52
wrong slash okay yeah so we
25:49 - 25:54
now see that this was the schema that we
25:54 - 25:59
and now we can finally put some data to
25:57 - 26:03
this table yeah we will execute the same
25:59 - 26:06
thing but without this hat 0 and then if
26:03 - 26:08
exists instead of replacing we will use
26:06 - 26:10
append i think it's apparent
26:08 - 26:12
if exists it will append insert new
26:10 - 26:14
values to the existing table so this is
26:12 - 26:16
exactly what we want to do and we want
26:14 - 26:19
to do this for each chunk of the
26:16 - 26:20
iterator you can actually time it
26:20 - 26:25
time thing will tell us how much time it
26:22 - 26:29
actually took to run this line so it
26:25 - 26:32
took 10 seconds and now we can do count
26:29 - 26:36
actually select count from
26:32 - 26:38
what was the name from yellow taxi data
26:36 - 26:40
which would tell us how many rows we
26:38 - 26:43
have in our table and we see that we
26:40 - 26:45
have 100 000 rows as expected so now
26:43 - 26:47
what we need to do is we just need to
26:45 - 26:49
insert the rest of the data frame to our
26:47 - 26:52
table i will use maybe it's not the
26:49 - 26:54
cleanest code when i do this data frame
26:52 - 26:56
next i will get the next chunk and then
26:54 - 26:59
i want to do it in a loop and then the
26:56 - 27:01
way iterators in python work is when you
26:59 - 27:04
no longer can get next it will throw an
27:01 - 27:06
exception it will say there is no data
27:04 - 27:08
anymore and our loop will exit this
27:06 - 27:11
that's why i'll have an infinite loop
27:08 - 27:13
here and then i will take next the next
27:11 - 27:15
next next chunk until there are no
27:13 - 27:19
chunks left and then for each size chunk
27:15 - 27:22
i will first do this thing and then i
27:19 - 27:24
will insert data to our database
27:22 - 27:25
to have some visibility into what's
27:24 - 27:27
going on let me add some print
27:25 - 27:29
statements i'll write insert that
27:27 - 27:31
another chunk
27:29 - 27:34
and we can also measure how much time
27:31 - 27:35
each chunk takes so i'll do from time
27:35 - 27:41
permanence benchmarking do the start
27:39 - 27:44
this returns the current timestamp in
27:41 - 27:45
seconds at the beginning and then at the
27:45 - 27:50
and then if we subtract start from the
27:48 - 27:53
end then we will know how much time we
27:50 - 27:55
took all right it took
27:53 - 27:58
that many seconds and now we will just
27:55 - 27:59
do the subtraction so this notation here
27:58 - 28:02
if you're new to python it means that it
27:59 - 28:04
will treat it as flawed and it will have
28:02 - 28:06
three decimal digits after
28:04 - 28:10
you will see right now so now we are
28:06 - 28:12
inserting data to our table i hope there
28:10 - 28:14
will be no mistakes
28:12 - 28:17
okay here the first chunk
28:14 - 28:19
we can check already that yeah so now
28:17 - 28:22
it's 200 yeah and it will take some time
28:19 - 28:25
now i will not wait now till finishes it
28:22 - 28:27
will take a bit of time because now we
28:25 - 28:30
are done with this video so we run
28:27 - 28:32
postgres using docker we run it locally
28:30 - 28:34
and then we were able to connect to
28:32 - 28:36
these database then we took a look at
28:34 - 28:37
the data that we will
28:36 - 28:40
use throughout the course and then we
28:37 - 28:42
started putting this data to our
28:40 - 28:45
postgres database so what we will do
28:42 - 28:48
next in the next lesson is we will use a
28:45 - 28:50
special tool called pg admin to be able
28:48 - 28:52
to connect to our database this is quite
28:50 - 28:56
convenient this is more convenient than
28:52 - 28:58
using this pg scli and this is exactly
28:56 - 28:59
what we will do in the next lesson
28:59 - 29:02
see you soon