00:00 - 00:04

in the world of data processing there

00:02 - 00:06

are two main approaches batch processing

00:04 - 00:07

and stream processing now we have

00:06 - 00:09

already discussed stream processing in

00:07 - 00:11

my previous video which is the process

00:09 - 00:14

of analyzing and processing data in real

00:11 - 00:15

time as it is being generated badge

00:14 - 00:17

processing on the other hand is the

00:15 - 00:19

process of analyzing and processing data

00:17 - 00:22

in large junks or patches after it has

00:19 - 00:24

been collected and stored unlike stream

00:22 - 00:26

processing where data is analyzed as it

00:24 - 00:28

generated badge processing collects and

00:26 - 00:31

process data over a period of time

00:28 - 00:32

typically in scheduled LS so while

00:31 - 00:35

stream processing is ideal for real-time

00:32 - 00:37

insights batch processing is often used

00:35 - 00:39

for tasks that don't require immediate

00:37 - 00:40

results or where data can be collected

00:39 - 00:43

over time before

00:40 - 00:44

analysis badge processing is old school

00:43 - 00:46

but still a very powerful data

00:44 - 00:48

processing method that every software

00:46 - 00:50

engineer should know in this video I'll

00:48 - 00:52

start with the use cases for batch

00:50 - 00:54

processing how businesses can benefit

00:52 - 00:56

from it followed by its core technical

00:54 - 00:58

aspects by the end of this video you

00:56 - 00:59

should have an idea of how to work with

00:58 - 01:04

batches effectively in your working

00:59 - 01:04

environment so let's get

01:04 - 01:09

started in today's interconnected world

01:07 - 01:11

every Human Action translates into an

01:09 - 01:13

event within the system whether it's

01:11 - 01:15

buying clothes online or in person

01:13 - 01:18

browsing through social media feeds or

01:15 - 01:19

riding with services like uber naturally

01:18 - 01:22

each of these events undergo some form

01:19 - 01:24

of processing some events demand Swift

01:22 - 01:26

action and are processed instantly for

01:24 - 01:28

example after concluding a trip with

01:26 - 01:30

Uber you promptly receive the ride

01:28 - 01:32

receipt within moments typically the

01:30 - 01:36

relationship between input and output in

01:32 - 01:38

such cases is one to one alternatively

01:36 - 01:40

certain events are more valuable when

01:38 - 01:42

processed together in background for

01:40 - 01:44

instance think about creating monthly

01:42 - 01:47

reports where all transaction from that

01:44 - 01:51

month are combined here many inputs

01:47 - 01:53

result in one output meaning many to one

01:51 - 01:55

this is known as batch processing

01:53 - 01:57

typically we opt for batch processing

01:55 - 02:00

for two primary reasons business

01:57 - 02:02

necessity and efficiency some outputs

02:00 - 02:05

rely on the availability of series of

02:02 - 02:07

records for instance generating end of

02:05 - 02:09

month reports Crossing payroll managing

02:07 - 02:12

billing and invoicing systems all

02:09 - 02:14

necessitate a continuous stream of data

02:12 - 02:16

missing even a single days transaction

02:14 - 02:18

could lead to inaccuracies in the final

02:16 - 02:20

output here is an example of common

02:18 - 02:22

endof day payment transactions in

02:20 - 02:25

banking along with some sample fields

02:22 - 02:26

that might contain transaction type such

02:25 - 02:28

as debit card purchases credit card

02:26 - 02:30

payments the date and time of the

02:28 - 02:32

transaction and reference number which

02:30 - 02:34

could be a unique identifier for the

02:32 - 02:36

transaction now certain data processing

02:34 - 02:39

tasks such as archiving filtering and

02:36 - 02:41

computation can be resource intensive

02:39 - 02:42

when performed on individual records

02:41 - 02:45

we'll explore this further in the

02:42 - 02:47

subsequent technical section but to

02:45 - 02:48

illustrate here consider a trip to the

02:47 - 02:51

supermarket where you purchase 10

02:48 - 02:53

products it's far more efficient to

02:51 - 02:55

check out all items in one go rather

02:53 - 02:57

than making 10 separate trips back and

02:55 - 02:59

forth this same efficiency principle

02:57 - 03:02

under prints batch processing which is

02:59 - 03:03

widely employed across various domains

03:02 - 03:05

the majority of batch processing jobs

03:03 - 03:07

follow a repetitive schedule whether

03:05 - 03:10

executed

03:07 - 03:10

hourly

03:12 - 03:17

daily or monthly developers leverage

03:15 - 03:19

scheduling mechanisms to automate batch

03:17 - 03:21

jobs reducing the need of manual

03:19 - 03:23

intervention and enhancing overall

03:21 - 03:25

efficiency at a high level batch

03:23 - 03:27

processing involves three steps data

03:25 - 03:30

collection data processing and data

03:27 - 03:32

storage or output in batch processing

03:30 - 03:34

data is collected over a period of time

03:32 - 03:36

until a sufficient amount is accumulated

03:34 - 03:38

for processing this data can come from

03:36 - 03:41

various sources such as database logs

03:38 - 03:42

files or even streaming sources where

03:41 - 03:43

the data is collected and stored for

03:42 - 03:45

later

03:43 - 03:48

analysis once a batch of data is

03:45 - 03:50

collected it is processed in bulk this

03:48 - 03:52

involves applying various operations

03:50 - 03:53

like filtering sorting aggregating and

03:52 - 03:55

analyzing the data according to

03:53 - 03:57

predefined criteria or business logic

03:55 - 03:58

after processing the results are

03:57 - 04:00

typically stored in a data warehouse

03:58 - 04:03

database or other storage systems for

04:00 - 04:05

further analysis reporting or decision-

04:03 - 04:07

making batch processing jobs may also

04:05 - 04:09

generate reports or visualizations that

04:07 - 04:12

provide insights into the process

04:09 - 04:14

data now let's take a closer look at the

04:12 - 04:16

technical aspects and how to design a

04:14 - 04:19

batch processing

04:16 - 04:21

system a batch is a group of records

04:19 - 04:23

with same attributes each record can be

04:21 - 04:25

facted like a bank transaction imagine

04:23 - 04:27

the following example you implement a

04:25 - 04:30

logic to sum the quantity of all the

04:27 - 04:33

products in a batch Ro with non-numeric

04:30 - 04:35

values will certainly break the code a

04:33 - 04:37

common strategy involves implementing a

04:35 - 04:39

schema check to the batch you can store

04:37 - 04:40

data in a relational database with a

04:39 - 04:43

data type defined you can also use a

04:40 - 04:45

schema file such as Apache Avro protuff

04:43 - 04:48

or xsd which stands for XML schema

04:45 - 04:50

definition to examine the data the more

04:48 - 04:53

rigid the schema the more resilient the

04:50 - 04:55

code becomes making it less prone to

04:53 - 04:56

Breaking let's look at an example of

04:55 - 04:58

large Bank cring millions of

04:56 - 05:00

transactions daily the bank aims to

04:58 - 05:02

generate hourly reports to assess the

05:00 - 05:04

total transactions within the hour for

05:02 - 05:07

each payment method example Mastercard

05:04 - 05:10

or VISA Etc now the question is how

05:07 - 05:12

would you design a batch a batch per day

05:10 - 05:15

per hour per minute or maybe a per

05:12 - 05:17

payment method we can certainly create a

05:15 - 05:20

batch per hour for example for all

05:17 - 05:23

records from Jan 1st 2024 10: a.m. to

05:20 - 05:25

Jan 1st 2024 11:

05:23 - 05:28

a.m. so now if you need to calculate the

05:25 - 05:30

sum for each payment method your query

05:28 - 05:32

might look like this

05:30 - 05:34

notice that here we utilize Group by

05:32 - 05:36

Clause to compute the sum for each

05:34 - 05:38

payment method separately what if we

05:36 - 05:40

calculate the sum for all payment

05:38 - 05:42

methods simultaneously in fact we'll

05:40 - 05:44

create one batch for each payment method

05:42 - 05:46

per hour compute the sum for each batch

05:44 - 05:48

in parallel and combine the results in

05:46 - 05:50

the end in this way we create smaller

05:48 - 05:52

batches but will potentially improve the

05:50 - 05:54

performance what we are attempting to

05:52 - 05:55

accomplish here is to mimic a

05:54 - 05:57

distributed data processing system by

05:55 - 06:00

splitting a large batch into smaller

05:57 - 06:01

batches and processing them concurrently

06:00 - 06:04

to achieve the best

06:01 - 06:06

performance but what if there are more

06:04 - 06:09

than 10 million records in one batch

06:06 - 06:10

with a powerful machine we may manage

06:09 - 06:13

without such capacity the calculation

06:10 - 06:15

will become inefficient so choosing a

06:13 - 06:17

bad size involves lot of trade-offs a

06:15 - 06:20

large batch simplifies operations as we

06:17 - 06:22

only need to deal with single file and

06:20 - 06:24

less IO operations however the

06:22 - 06:25

performance can be bottl NE by singular

06:24 - 06:28

computation

06:25 - 06:30

resource at the same time we don't want

06:28 - 06:32

to create too many small batches because

06:30 - 06:35

merging them will become a heavy task

06:32 - 06:37

nullifying the time saved earlier using

06:35 - 06:39

High cardinality columns like IDs or

06:37 - 06:41

time stamps for which bat spitting is a

06:39 - 06:44

recipe for creating many small and

06:41 - 06:46

ineffective batches so always aim for

06:44 - 06:48

low cardinality columns like date

06:46 - 06:50

category or method instead as they have

06:48 - 06:53

fewer unique values and will lead to

06:50 - 06:55

larger more efficient batches that get

06:53 - 06:57

your work done quicker additionally the

06:55 - 06:59

quantity of small batches should ideally

06:57 - 07:01

align with the available resources for

06:59 - 07:03

instance if there are 100 small batches

07:01 - 07:05

but only 10 servers available a maximum

07:03 - 07:07

of 10 batches will be processed

07:05 - 07:09

concurrently rather than the entire set

07:07 - 07:12

of 100 so it's crucial to identify the

07:09 - 07:13

bottle leg to enhance the situation we

07:12 - 07:15

can either augment the number of

07:13 - 07:18

resources or much small batches into

07:15 - 07:20

mediumsized ones effectively leveraging

07:18 - 07:23

the available resources to their maximum

07:20 - 07:25

capacity now we have touched on the

07:23 - 07:27

time-saving advantages of batch

07:25 - 07:30

processing but where exactly let's

07:27 - 07:32

revisit the supermarket analogy which

07:30 - 07:34

task consumes the most time is it

07:32 - 07:37

scanning the product or calculating the

07:34 - 07:39

amount not quite it's a journey to

07:37 - 07:41

retrieve a product from the shelf and

07:39 - 07:44

return to the checkout counter the

07:41 - 07:46

farther the Shelf the longer the process

07:44 - 07:48

takes this task of fetching products

07:46 - 07:51

corresponds to an IO operation in a data

07:48 - 07:54

processing job when data resides in

07:51 - 07:56

memory the distance is shorter however

07:54 - 07:58

if the data is stored on a remote server

07:56 - 08:00

the distance becomes more significant

07:58 - 08:02

the concept is clear batch processing

08:00 - 08:05

significantly improves job performance

08:02 - 08:06

by reducing the number of IO operations

08:05 - 08:09

required now every batch process follows

08:06 - 08:10

a life cycle and understanding this life

08:09 - 08:13

cycle is crucial for efficient

08:10 - 08:15

processing let's revisit our example of

08:13 - 08:17

the endof day payments transaction of a

08:15 - 08:20

bank while the list of transaction forms

08:17 - 08:22

the core content of batch supplementary

08:20 - 08:24

details such as the bat start and end

08:22 - 08:26

times from a business perspective and

08:24 - 08:28

the bat injection time from a technical

08:26 - 08:30

perspective are equally essential these

08:28 - 08:32

addition pieces of information are

08:30 - 08:34

particularly valuable during

08:32 - 08:36

reprocessing they aid developers in

08:34 - 08:39

comprehending the bad status and

08:36 - 08:41

detecting any anomis for instance if a

08:39 - 08:44

batch containing payments made between

08:41 - 08:48

January 1st 20241 a.m. and January 1st

08:44 - 08:52

2024 11 a.m. is only ingested on January

08:48 - 08:54

2nd 2024 it indicates a potential issue

08:52 - 08:56

with the inje layer tracking the inje

08:54 - 08:58

time also helps in preventing duplicate

08:56 - 09:00

entries furthermore associating a code

08:58 - 09:02

version with the batch processing

09:00 - 09:04

enables developers to trace and

09:02 - 09:07

troubleshoot issues linked to specific

09:04 - 09:08

code versions these metadata elements

09:07 - 09:11

can be incorporated into batch in

09:08 - 09:13

various ways one approach is to include

09:11 - 09:16

a metadata column within the batch

09:13 - 09:18

itself while this simplifies information

09:16 - 09:20

retrieval it adds overhead to the

09:18 - 09:21

storage for example in Apache Kafka

09:20 - 09:23

although primarily a real-time

09:21 - 09:25

processing engine it operates with

09:23 - 09:27

batches with each message containing

09:25 - 09:29

metadata in its header alternatively

09:27 - 09:31

metadata can be stored separate and link

09:29 - 09:34

to the batch enhancing the efficiency of

09:31 - 09:36

metadata quering all right onto the

09:34 - 09:38

final Point another prevalent use of

09:36 - 09:40

batch processing is to handle a

09:38 - 09:42

collection of CDC or change data capture

09:40 - 09:44

events for instance let's consider a

09:42 - 09:47

supermarket aiming to monitor the daily

09:44 - 09:50

inventory status of all products with a

09:47 - 09:53

data source being a stream of restock

09:50 - 09:54

and sell events so how do we tackle this

09:53 - 09:57

now a straightforward solution might

09:54 - 09:59

involve aggregating all events since the

09:57 - 10:02

Shop's Inception every day

09:59 - 10:04

I know it doesn't sound quite right the

10:02 - 10:06

challenge here lies in the fact that the

10:04 - 10:08

data source consist of deltas and

10:06 - 10:11

aggregating only today's events won't

10:08 - 10:13

yield the accurate final inventory since

10:11 - 10:15

it merely represents the total Delta for

10:13 - 10:18

the day a more effective approach is to

10:15 - 10:22

generate daily inventory snapshots let's

10:18 - 10:24

assume the shop opens on Jan 1st 2024

10:22 - 10:27

and initially restocks 100 apples and

10:24 - 10:29

200 bananas this initial stock serves as

10:27 - 10:32

the inventory for GI for

10:29 - 10:34

2024 on the subsequent day instead of

10:32 - 10:37

aggregating events from two days we

10:34 - 10:39

merge the inventory from day one with

10:37 - 10:41

the Delta events from the day two

10:39 - 10:43

similarly on day three we obtain the

10:41 - 10:46

result by combining the inventory from

10:43 - 10:49

day two with the Delta events from day

10:46 - 10:50

three this method significantly boost

10:49 - 10:53

efficiency by producing a daily

10:50 - 10:56

inventory snapshot thereby reducing the

10:53 - 10:58

data size for each job we can further

10:56 - 11:01

minimize the data size by generating

10:58 - 11:03

hourly snapshot and so on Additionally

11:01 - 11:05

the daily snapshot table can Aid the

11:03 - 11:07

business in making strategic decisions

11:05 - 11:09

regarding Inventory management since

11:07 - 11:11

badge processing deals with large data

11:09 - 11:13

sets it often requires distributed

11:11 - 11:15

computing Frameworks like Apache Hado or

11:13 - 11:18

Apache spark to efficiently process the

11:15 - 11:20

data in parallel across multiple notes

11:18 - 11:22

badge processing is good choice for many

11:20 - 11:23

tasks such as generating reports badge

11:22 - 11:26

processing is often used to generate

11:23 - 11:28

reports on historical data for example a

11:26 - 11:30

company might use batch processing to

11:28 - 11:33

generate monthly sales reports or a bank

11:30 - 11:34

might use batch processing to process

11:33 - 11:37

daily transactions at

11:34 - 11:39

night batch processing can be also used

11:37 - 11:41

to train machine learning models on

11:39 - 11:43

large data sets in fact even a streaming

11:41 - 11:45

service might use batch processing to

11:43 - 11:47

train recommendation models this would

11:45 - 11:50

involve collecting large data set of

11:47 - 11:52

user viewing history and then processing

11:50 - 11:54

it in a single bad job to train the

11:52 - 11:56

model while badge processing offers

11:54 - 11:58

advantages in handling large volumes of

11:56 - 12:00

data efficiently there are scenarios

11:58 - 12:02

where it may not be the most suitable

12:00 - 12:04

approach when realtime data processing

12:02 - 12:07

is crucial such as in financial trading

12:04 - 12:09

online gaming or systems that involve

12:07 - 12:11

continuous streaming streaming data that

12:09 - 12:14

requires immediate response badge

12:11 - 12:16

processing is not the best method and I

12:14 - 12:18

highly recommend diving deeper into

12:16 - 12:19

real-time processing methods a topic

12:18 - 12:22

that I have extensively covered in two

12:19 - 12:24

of my previous videos in my basics of

12:22 - 12:26

streaming video installment I provide a

12:24 - 12:29

comprehensive overview with the example

12:26 - 12:31

usage in AWS and with microservices

12:29 - 12:33

while in my dedicated video streaming

12:31 - 12:34

video I delve into the intricacies of

12:33 - 12:36

high quality real-time streaming and

12:34 - 12:38

protocols so if you are keen on

12:36 - 12:41

mastering the art of seamless streaming

12:38 - 12:43

experiences be sure to check out those

12:41 - 12:45

resources now while badge processing may

12:43 - 12:47

not provide real-time insites like

12:45 - 12:49

stream processing it is well suited for

12:47 - 12:51

tasks that involve historical analysis

12:49 - 12:53

periodic reporting and batch oriented

12:51 - 12:55

data processing tasks by understanding

12:53 - 12:56

the principles and practical

12:55 - 12:59

applications of batch processing

12:56 - 13:01

organizations can harness its potential

12:59 - 13:03

to gain valuable insights and drive

13:01 - 13:07

informed decision

13:03 - 13:07

[Music]

13:12 - 13:15

making

Unveiling the Power: Batch Processing in Data Science

Discover the world of batch processing, a timeless yet powerful data processing method every software engineer should master. This method involves analyzing and processing data in large junks or patches after collection, making it ideal for tasks that don't require immediate results. In this article, we will explore the benefits, technical aspects, and design considerations of batch processing, shedding light on how businesses can leverage this method effectively to gain valuable insights.

Understanding Batch Processing: Benefits and Use Cases

By processing data in scheduled chunks over time, batch processing ensures data accuracy and consistency, making it indispensable for tasks like end-of-month reports, payroll management, and billing systems. Ensuring a continuous stream of data, batch processing prevents inaccuracies in final output and is vital for business necessity and efficiency. For instance, in banking, analyzing end-of-day payment transactions requires continuous and accurate data collection, highlighting the importance of batch processing for reliable results.

Technical Insights: Designing a Batch Processing System

Designing a robust batch processing system involves structuring data into batches with consistent attributes and effective schema checks to ensure data integrity. Considerations like batch size, schema rigidity, and resource optimization play a crucial role in enhancing performance. By creating smaller, efficiently processed batches and leveraging available resources, you can maximize processing efficiency and streamline data analysis effectively.

Batch Lifecycle and Metadata Integration

Understanding the lifecycle of batch processing, from collection to processing and storage, is essential for efficient data processing. Incorporating metadata elements into batch processing enhances error detection, processing efficiency, and traceability, enabling developers to troubleshoot issues effectively. By associating metadata with each batch, developers can ensure data integrity, prevent duplicate entries, and improve overall batch processing performance.

Batch Processing for Historical Analysis and Strategic Decision-Making

Batch processing offers significant advantages in handling large data sets for historical analysis, report generation, and training machine learning models. Organizations can leverage batch processing to generate monthly reports, process large volumes of data efficiently, and make informed decisions based on historical data insights. Despite its strengths, batch processing may not be suitable for real-time data processing scenarios, emphasizing the importance of exploring real-time processing methods for immediate response requirements.

As you delve into the world of batch processing, mastering its principles and practical applications can empower organizations to unlock valuable insights, drive informed decision-making, and optimize data processing efficiency. Embrace the versatility of batch processing and elevate your data processing capabilities to new heights!

In the realm of data science, batch processing stands as a stalwart method, offering a timeless approach to data analysis and insights. Dive into the realm of batch processing and harness its potential for transformative data processing experiences!