00:00 - 00:04
in the world of data processing there
00:02 - 00:06
are two main approaches batch processing
00:04 - 00:07
and stream processing now we have
00:06 - 00:09
already discussed stream processing in
00:07 - 00:11
my previous video which is the process
00:09 - 00:14
of analyzing and processing data in real
00:11 - 00:15
time as it is being generated badge
00:14 - 00:17
processing on the other hand is the
00:15 - 00:19
process of analyzing and processing data
00:17 - 00:22
in large junks or patches after it has
00:19 - 00:24
been collected and stored unlike stream
00:22 - 00:26
processing where data is analyzed as it
00:24 - 00:28
generated badge processing collects and
00:26 - 00:31
process data over a period of time
00:28 - 00:32
typically in scheduled LS so while
00:31 - 00:35
stream processing is ideal for real-time
00:32 - 00:37
insights batch processing is often used
00:35 - 00:39
for tasks that don't require immediate
00:37 - 00:40
results or where data can be collected
00:39 - 00:43
over time before
00:40 - 00:44
analysis badge processing is old school
00:43 - 00:46
but still a very powerful data
00:44 - 00:48
processing method that every software
00:46 - 00:50
engineer should know in this video I'll
00:48 - 00:52
start with the use cases for batch
00:50 - 00:54
processing how businesses can benefit
00:52 - 00:56
from it followed by its core technical
00:54 - 00:58
aspects by the end of this video you
00:56 - 00:59
should have an idea of how to work with
00:58 - 01:04
batches effectively in your working
00:59 - 01:04
environment so let's get
01:04 - 01:09
started in today's interconnected world
01:07 - 01:11
every Human Action translates into an
01:09 - 01:13
event within the system whether it's
01:11 - 01:15
buying clothes online or in person
01:13 - 01:18
browsing through social media feeds or
01:15 - 01:19
riding with services like uber naturally
01:18 - 01:22
each of these events undergo some form
01:19 - 01:24
of processing some events demand Swift
01:22 - 01:26
action and are processed instantly for
01:24 - 01:28
example after concluding a trip with
01:26 - 01:30
Uber you promptly receive the ride
01:28 - 01:32
receipt within moments typically the
01:30 - 01:36
relationship between input and output in
01:32 - 01:38
such cases is one to one alternatively
01:36 - 01:40
certain events are more valuable when
01:38 - 01:42
processed together in background for
01:40 - 01:44
instance think about creating monthly
01:42 - 01:47
reports where all transaction from that
01:44 - 01:51
month are combined here many inputs
01:47 - 01:53
result in one output meaning many to one
01:51 - 01:55
this is known as batch processing
01:53 - 01:57
typically we opt for batch processing
01:55 - 02:00
for two primary reasons business
01:57 - 02:02
necessity and efficiency some outputs
02:00 - 02:05
rely on the availability of series of
02:02 - 02:07
records for instance generating end of
02:05 - 02:09
month reports Crossing payroll managing
02:07 - 02:12
billing and invoicing systems all
02:09 - 02:14
necessitate a continuous stream of data
02:12 - 02:16
missing even a single days transaction
02:14 - 02:18
could lead to inaccuracies in the final
02:16 - 02:20
output here is an example of common
02:18 - 02:22
endof day payment transactions in
02:20 - 02:25
banking along with some sample fields
02:22 - 02:26
that might contain transaction type such
02:25 - 02:28
as debit card purchases credit card
02:26 - 02:30
payments the date and time of the
02:28 - 02:32
transaction and reference number which
02:30 - 02:34
could be a unique identifier for the
02:32 - 02:36
transaction now certain data processing
02:34 - 02:39
tasks such as archiving filtering and
02:36 - 02:41
computation can be resource intensive
02:39 - 02:42
when performed on individual records
02:41 - 02:45
we'll explore this further in the
02:42 - 02:47
subsequent technical section but to
02:45 - 02:48
illustrate here consider a trip to the
02:47 - 02:51
supermarket where you purchase 10
02:48 - 02:53
products it's far more efficient to
02:51 - 02:55
check out all items in one go rather
02:53 - 02:57
than making 10 separate trips back and
02:55 - 02:59
forth this same efficiency principle
02:57 - 03:02
under prints batch processing which is
02:59 - 03:03
widely employed across various domains
03:02 - 03:05
the majority of batch processing jobs
03:03 - 03:07
follow a repetitive schedule whether
03:12 - 03:17
daily or monthly developers leverage
03:15 - 03:19
scheduling mechanisms to automate batch
03:17 - 03:21
jobs reducing the need of manual
03:19 - 03:23
intervention and enhancing overall
03:21 - 03:25
efficiency at a high level batch
03:23 - 03:27
processing involves three steps data
03:25 - 03:30
collection data processing and data
03:27 - 03:32
storage or output in batch processing
03:30 - 03:34
data is collected over a period of time
03:32 - 03:36
until a sufficient amount is accumulated
03:34 - 03:38
for processing this data can come from
03:36 - 03:41
various sources such as database logs
03:38 - 03:42
files or even streaming sources where
03:41 - 03:43
the data is collected and stored for
03:43 - 03:48
analysis once a batch of data is
03:45 - 03:50
collected it is processed in bulk this
03:48 - 03:52
involves applying various operations
03:50 - 03:53
like filtering sorting aggregating and
03:52 - 03:55
analyzing the data according to
03:53 - 03:57
predefined criteria or business logic
03:55 - 03:58
after processing the results are
03:57 - 04:00
typically stored in a data warehouse
03:58 - 04:03
database or other storage systems for
04:00 - 04:05
further analysis reporting or decision-
04:03 - 04:07
making batch processing jobs may also
04:05 - 04:09
generate reports or visualizations that
04:07 - 04:12
provide insights into the process
04:09 - 04:14
data now let's take a closer look at the
04:12 - 04:16
technical aspects and how to design a
04:14 - 04:19
batch processing
04:16 - 04:21
system a batch is a group of records
04:19 - 04:23
with same attributes each record can be
04:21 - 04:25
facted like a bank transaction imagine
04:23 - 04:27
the following example you implement a
04:25 - 04:30
logic to sum the quantity of all the
04:27 - 04:33
products in a batch Ro with non-numeric
04:30 - 04:35
values will certainly break the code a
04:33 - 04:37
common strategy involves implementing a
04:35 - 04:39
schema check to the batch you can store
04:37 - 04:40
data in a relational database with a
04:39 - 04:43
data type defined you can also use a
04:40 - 04:45
schema file such as Apache Avro protuff
04:43 - 04:48
or xsd which stands for XML schema
04:45 - 04:50
definition to examine the data the more
04:48 - 04:53
rigid the schema the more resilient the
04:50 - 04:55
code becomes making it less prone to
04:53 - 04:56
Breaking let's look at an example of
04:55 - 04:58
large Bank cring millions of
04:56 - 05:00
transactions daily the bank aims to
04:58 - 05:02
generate hourly reports to assess the
05:00 - 05:04
total transactions within the hour for
05:02 - 05:07
each payment method example Mastercard
05:04 - 05:10
or VISA Etc now the question is how
05:07 - 05:12
would you design a batch a batch per day
05:10 - 05:15
per hour per minute or maybe a per
05:12 - 05:17
payment method we can certainly create a
05:15 - 05:20
batch per hour for example for all
05:17 - 05:23
records from Jan 1st 2024 10: a.m. to
05:20 - 05:25
Jan 1st 2024 11:
05:23 - 05:28
a.m. so now if you need to calculate the
05:25 - 05:30
sum for each payment method your query
05:28 - 05:32
might look like this
05:30 - 05:34
notice that here we utilize Group by
05:32 - 05:36
Clause to compute the sum for each
05:34 - 05:38
payment method separately what if we
05:36 - 05:40
calculate the sum for all payment
05:38 - 05:42
methods simultaneously in fact we'll
05:40 - 05:44
create one batch for each payment method
05:42 - 05:46
per hour compute the sum for each batch
05:44 - 05:48
in parallel and combine the results in
05:46 - 05:50
the end in this way we create smaller
05:48 - 05:52
batches but will potentially improve the
05:50 - 05:54
performance what we are attempting to
05:52 - 05:55
accomplish here is to mimic a
05:54 - 05:57
distributed data processing system by
05:55 - 06:00
splitting a large batch into smaller
05:57 - 06:01
batches and processing them concurrently
06:00 - 06:04
to achieve the best
06:01 - 06:06
performance but what if there are more
06:04 - 06:09
than 10 million records in one batch
06:06 - 06:10
with a powerful machine we may manage
06:09 - 06:13
without such capacity the calculation
06:10 - 06:15
will become inefficient so choosing a
06:13 - 06:17
bad size involves lot of trade-offs a
06:15 - 06:20
large batch simplifies operations as we
06:17 - 06:22
only need to deal with single file and
06:20 - 06:24
less IO operations however the
06:22 - 06:25
performance can be bottl NE by singular
06:25 - 06:30
resource at the same time we don't want
06:28 - 06:32
to create too many small batches because
06:30 - 06:35
merging them will become a heavy task
06:32 - 06:37
nullifying the time saved earlier using
06:35 - 06:39
High cardinality columns like IDs or
06:37 - 06:41
time stamps for which bat spitting is a
06:39 - 06:44
recipe for creating many small and
06:41 - 06:46
ineffective batches so always aim for
06:44 - 06:48
low cardinality columns like date
06:46 - 06:50
category or method instead as they have
06:48 - 06:53
fewer unique values and will lead to
06:50 - 06:55
larger more efficient batches that get
06:53 - 06:57
your work done quicker additionally the
06:55 - 06:59
quantity of small batches should ideally
06:57 - 07:01
align with the available resources for
06:59 - 07:03
instance if there are 100 small batches
07:01 - 07:05
but only 10 servers available a maximum
07:03 - 07:07
of 10 batches will be processed
07:05 - 07:09
concurrently rather than the entire set
07:07 - 07:12
of 100 so it's crucial to identify the
07:09 - 07:13
bottle leg to enhance the situation we
07:12 - 07:15
can either augment the number of
07:13 - 07:18
resources or much small batches into
07:15 - 07:20
mediumsized ones effectively leveraging
07:18 - 07:23
the available resources to their maximum
07:20 - 07:25
capacity now we have touched on the
07:23 - 07:27
time-saving advantages of batch
07:25 - 07:30
processing but where exactly let's
07:27 - 07:32
revisit the supermarket analogy which
07:30 - 07:34
task consumes the most time is it
07:32 - 07:37
scanning the product or calculating the
07:34 - 07:39
amount not quite it's a journey to
07:37 - 07:41
retrieve a product from the shelf and
07:39 - 07:44
return to the checkout counter the
07:41 - 07:46
farther the Shelf the longer the process
07:44 - 07:48
takes this task of fetching products
07:46 - 07:51
corresponds to an IO operation in a data
07:48 - 07:54
processing job when data resides in
07:51 - 07:56
memory the distance is shorter however
07:54 - 07:58
if the data is stored on a remote server
07:56 - 08:00
the distance becomes more significant
07:58 - 08:02
the concept is clear batch processing
08:00 - 08:05
significantly improves job performance
08:02 - 08:06
by reducing the number of IO operations
08:05 - 08:09
required now every batch process follows
08:06 - 08:10
a life cycle and understanding this life
08:09 - 08:13
cycle is crucial for efficient
08:10 - 08:15
processing let's revisit our example of
08:13 - 08:17
the endof day payments transaction of a
08:15 - 08:20
bank while the list of transaction forms
08:17 - 08:22
the core content of batch supplementary
08:20 - 08:24
details such as the bat start and end
08:22 - 08:26
times from a business perspective and
08:24 - 08:28
the bat injection time from a technical
08:26 - 08:30
perspective are equally essential these
08:28 - 08:32
addition pieces of information are
08:30 - 08:34
particularly valuable during
08:32 - 08:36
reprocessing they aid developers in
08:34 - 08:39
comprehending the bad status and
08:36 - 08:41
detecting any anomis for instance if a
08:39 - 08:44
batch containing payments made between
08:41 - 08:48
January 1st 20241 a.m. and January 1st
08:44 - 08:52
2024 11 a.m. is only ingested on January
08:48 - 08:54
2nd 2024 it indicates a potential issue
08:52 - 08:56
with the inje layer tracking the inje
08:54 - 08:58
time also helps in preventing duplicate
08:56 - 09:00
entries furthermore associating a code
08:58 - 09:02
version with the batch processing
09:00 - 09:04
enables developers to trace and
09:02 - 09:07
troubleshoot issues linked to specific
09:04 - 09:08
code versions these metadata elements
09:07 - 09:11
can be incorporated into batch in
09:08 - 09:13
various ways one approach is to include
09:11 - 09:16
a metadata column within the batch
09:13 - 09:18
itself while this simplifies information
09:16 - 09:20
retrieval it adds overhead to the
09:18 - 09:21
storage for example in Apache Kafka
09:20 - 09:23
although primarily a real-time
09:21 - 09:25
processing engine it operates with
09:23 - 09:27
batches with each message containing
09:25 - 09:29
metadata in its header alternatively
09:27 - 09:31
metadata can be stored separate and link
09:29 - 09:34
to the batch enhancing the efficiency of
09:31 - 09:36
metadata quering all right onto the
09:34 - 09:38
final Point another prevalent use of
09:36 - 09:40
batch processing is to handle a
09:38 - 09:42
collection of CDC or change data capture
09:40 - 09:44
events for instance let's consider a
09:42 - 09:47
supermarket aiming to monitor the daily
09:44 - 09:50
inventory status of all products with a
09:47 - 09:53
data source being a stream of restock
09:50 - 09:54
and sell events so how do we tackle this
09:53 - 09:57
now a straightforward solution might
09:54 - 09:59
involve aggregating all events since the
09:57 - 10:02
Shop's Inception every day
09:59 - 10:04
I know it doesn't sound quite right the
10:02 - 10:06
challenge here lies in the fact that the
10:04 - 10:08
data source consist of deltas and
10:06 - 10:11
aggregating only today's events won't
10:08 - 10:13
yield the accurate final inventory since
10:11 - 10:15
it merely represents the total Delta for
10:13 - 10:18
the day a more effective approach is to
10:15 - 10:22
generate daily inventory snapshots let's
10:18 - 10:24
assume the shop opens on Jan 1st 2024
10:22 - 10:27
and initially restocks 100 apples and
10:24 - 10:29
200 bananas this initial stock serves as
10:27 - 10:32
the inventory for GI for
10:29 - 10:34
2024 on the subsequent day instead of
10:32 - 10:37
aggregating events from two days we
10:34 - 10:39
merge the inventory from day one with
10:37 - 10:41
the Delta events from the day two
10:39 - 10:43
similarly on day three we obtain the
10:41 - 10:46
result by combining the inventory from
10:43 - 10:49
day two with the Delta events from day
10:46 - 10:50
three this method significantly boost
10:49 - 10:53
efficiency by producing a daily
10:50 - 10:56
inventory snapshot thereby reducing the
10:53 - 10:58
data size for each job we can further
10:56 - 11:01
minimize the data size by generating
10:58 - 11:03
hourly snapshot and so on Additionally
11:01 - 11:05
the daily snapshot table can Aid the
11:03 - 11:07
business in making strategic decisions
11:05 - 11:09
regarding Inventory management since
11:07 - 11:11
badge processing deals with large data
11:09 - 11:13
sets it often requires distributed
11:11 - 11:15
computing Frameworks like Apache Hado or
11:13 - 11:18
Apache spark to efficiently process the
11:15 - 11:20
data in parallel across multiple notes
11:18 - 11:22
badge processing is good choice for many
11:20 - 11:23
tasks such as generating reports badge
11:22 - 11:26
processing is often used to generate
11:23 - 11:28
reports on historical data for example a
11:26 - 11:30
company might use batch processing to
11:28 - 11:33
generate monthly sales reports or a bank
11:30 - 11:34
might use batch processing to process
11:33 - 11:37
daily transactions at
11:34 - 11:39
night batch processing can be also used
11:37 - 11:41
to train machine learning models on
11:39 - 11:43
large data sets in fact even a streaming
11:41 - 11:45
service might use batch processing to
11:43 - 11:47
train recommendation models this would
11:45 - 11:50
involve collecting large data set of
11:47 - 11:52
user viewing history and then processing
11:50 - 11:54
it in a single bad job to train the
11:52 - 11:56
model while badge processing offers
11:54 - 11:58
advantages in handling large volumes of
11:56 - 12:00
data efficiently there are scenarios
11:58 - 12:02
where it may not be the most suitable
12:00 - 12:04
approach when realtime data processing
12:02 - 12:07
is crucial such as in financial trading
12:04 - 12:09
online gaming or systems that involve
12:07 - 12:11
continuous streaming streaming data that
12:09 - 12:14
requires immediate response badge
12:11 - 12:16
processing is not the best method and I
12:14 - 12:18
highly recommend diving deeper into
12:16 - 12:19
real-time processing methods a topic
12:18 - 12:22
that I have extensively covered in two
12:19 - 12:24
of my previous videos in my basics of
12:22 - 12:26
streaming video installment I provide a
12:24 - 12:29
comprehensive overview with the example
12:26 - 12:31
usage in AWS and with microservices
12:29 - 12:33
while in my dedicated video streaming
12:31 - 12:34
video I delve into the intricacies of
12:33 - 12:36
high quality real-time streaming and
12:34 - 12:38
protocols so if you are keen on
12:36 - 12:41
mastering the art of seamless streaming
12:38 - 12:43
experiences be sure to check out those
12:41 - 12:45
resources now while badge processing may
12:43 - 12:47
not provide real-time insites like
12:45 - 12:49
stream processing it is well suited for
12:47 - 12:51
tasks that involve historical analysis
12:49 - 12:53
periodic reporting and batch oriented
12:51 - 12:55
data processing tasks by understanding
12:53 - 12:56
the principles and practical
12:55 - 12:59
applications of batch processing
12:56 - 13:01
organizations can harness its potential
12:59 - 13:03
to gain valuable insights and drive
13:01 - 13:07
informed decision