역시 diagram 을 보면 이해가 한결 수월해진다.
개인적인 느낌이지만 IT 기술분야를 학습해 나가면서 느끼는 것은 IT 는 똑같은 분야를 놓고 책을 3, 4권 정도 봐야 제대로 된 설명을 찾아내거나 상호간의 비교 및 조합을 통해서 너무 추상화된 개념을 또는 허공에 뜬 개념을 이해할 수 있는 땅바닥으로 끌어 내릴 수 있는 듯 하다.
The partitioning pattern moves the records into categories (i.e., shards, partitions, or
bins) but it doesn’t really care about the order of records.
Intent
The intent is to take similar records in a data set and partition them into distinct, smaller
data sets.
Motivation
If you want to look at a particular set of data—such as postings made on a particular
date—the data items are normally spread out across the entire data set. So looking at
just one of these subsets requires an entire scan of all of the data. Partitioning means
breaking a large set of data into smaller subsets, which can be chosen by some criterion
relevant to your analysis. To improve performance, you can run a job that takes the data
set and breaks the partitions out into separate files. Then, when a particular subset for
the data is to be analyzed, the job needs only to look at that data.
몇일 전에 번역되어 나왔다. yes24 에서 주문완료 했다.
빅데이터가 돈을 벌게 하는게 아니라 새로이 한 분야의 책을 구입하느라고 돈을 왕창 쓰게 만드네 쩝 ㅠ.ㅠ
혹시 빅데이터가 책을 사게 만들어서 돈을 버는 비즈니스는 아니겠지 . . .
[ source : MapReduce Design Patter - Fiigure 4-2 The structure of the partitioning pattern ]
Known uses
- Partition pruning by continuous value
You have some sort of continuous variable, such as a date or numerical value, and
at any one time you care about only a certain subset of that data. Partitioning the
data into bins will allow your jobs to load only pertinent data.
- Partition pruning by category
Instead of having some sort of continuous variable, the records fit into one of several
clearly defined categories, such as country, phone area code, or language.
- Sharding
A system in your architecture has divisions of data—such as different disks—and
you need to partition the data into these existing shards.
Resemblances
SQL
Some SQL databases allow for automatically partitioned tables. This allows “partition
pruning” which allows the database to exclude large portions of irrelevant
data before running the SQL.
Performance analysis
The main performance concern with this pattern is that the resulting partitions will
likely not have similar number of records. Perhaps one partition turns out to hold 50%
of the data of a very large data set. If implemented naively, all of this data will get sent
to one reducer and will slow down processing significantly.
It’s pretty easy to get around this, though. Split very large partitions into several smaller
partitions, even if just randomly. Assign multiple reducers to one partition and then
randomly assign records into each to spread it out a bit better.
For example, consider the “last access date” field for a user in StackOverflow. If we
partitioned on this property equally over months, the most recent month will very likely
be much larger than any other month. To prevent skew, it may make sense to partition
the most recent month into days, or perhaps just randomly.
This method doesn’t affect processing over partitions, since you know that these set of
files represent one larger partition. Just include all of them as input.
'Hadoop Ecosystem' 카테고리의 다른 글
Map Output Key 변경변화 확인 Year, Month 에 UniqueCarrier 및 로직 추가 (0) | 2013.07.11 |
---|---|
cloud : kt ucloud vm - openAPI test 01 (1) | 2013.07.11 |
Shuffling - MapReduce (0) | 2013.07.07 |
flume install (0) | 2013.07.06 |
Data Locality, The heart of MapReduce (0) | 2013.07.05 |