본문 바로가기

Hadoop Ecosystem

Partitioning - MapReduce


역시 diagram 을 보면 이해가 한결 수월해진다. 

개인적인 느낌이지만 IT 기술분야를 학습해 나가면서 느끼는 것은 IT 는 똑같은 분야를 놓고 책을 3, 4권 정도 봐야 제대로 된 설명을 찾아내거나 상호간의 비교 및 조합을 통해서 너무 추상화된 개념을 또는 허공에 뜬 개념을 이해할 수 있는 땅바닥으로 끌어 내릴 수 있는 듯 하다. 



The partitioning pattern moves the records into categories (i.e., shards, partitions, or

bins) but it doesn’t really care about the order of records.


Intent

The intent is to take similar records in a data set and partition them into distinct, smaller

data sets.


Motivation

If you want to look at a particular set of data—such as postings made on a particular

date—the data items are normally spread out across the entire data set. So looking at

just one of these subsets requires an entire scan of all of the data. Partitioning means

breaking a large set of data into smaller subsets, which can be chosen by some criterion

relevant to your analysis. To improve performance, you can run a job that takes the data

set and breaks the partitions out into separate files. Then, when a particular subset for

the data is to be analyzed, the job needs only to look at that data.



 몇일 전에 번역되어 나왔다. yes24 에서 주문완료 했다. 

   빅데이터가 돈을 벌게 하는게 아니라 새로이 한 분야의 책을 구입하느라고 돈을 왕창 쓰게 만드네 쩝 ㅠ.ㅠ

혹시 빅데이터가 책을 사게 만들어서 돈을 버는 비즈니스는 아니겠지 . . . 


[ source : MapReduce Design Patter  - Fiigure 4-2  The structure of the partitioning pattern ]




Known uses

- Partition pruning by continuous value

You have some sort of continuous variable, such as a date or numerical value, and

at any one time you care about only a certain subset of that data. Partitioning the

data into bins will allow your jobs to load only pertinent data.


- Partition pruning by category

Instead of having some sort of continuous variable, the records fit into one of several

clearly defined categories, such as country, phone area code, or language.


 - Sharding

A system in your architecture has divisions of data—such as different disks—and

you need to partition the data into these existing shards.


Resemblances

SQL

Some SQL databases allow for automatically partitioned tables. This allows “partition

pruning” which allows the database to exclude large portions of irrelevant

data before running the SQL.



Performance analysis

The main performance concern with this pattern is that the resulting partitions will

likely not have similar number of records. Perhaps one partition turns out to hold 50%

of the data of a very large data set. If implemented naively, all of this data will get sent

to one reducer and will slow down processing significantly.


It’s pretty easy to get around this, though. Split very large partitions into several smaller

partitions, even if just randomly. Assign multiple reducers to one partition and then

randomly assign records into each to spread it out a bit better.


For example, consider the “last access date” field for a user in StackOverflow. If we

partitioned on this property equally over months, the most recent month will very likely

be much larger than any other month. To prevent skew, it may make sense to partition

the most recent month into days, or perhaps just randomly.


This method doesn’t affect processing over partitions, since you know that these set of

files represent one larger partition. Just include all of them as input.