Shuffling 의 목적과 왜 하는지에 대한 설명이다 .
The total order sorting and shuffling patterns are opposites in terms of effect, but the
latter is also concerned with the order of data in records.
Intent
You have a set of records that you want to completely randomize.
Motivation
This whole chapter has been about applying some sort of order to your data set except
for this pattern which is instead about completely destroying the order.
The use cases for doing such a thing are definitely few and far between, but two stand
out. One is shuffling the data for the purposes of anonymizing it. Another is randomizing
the data set for repeatable random sampling.
Anonymizing data has recently become important for organizations that want to maintain
their users’ privacy, but still run analytics. The order of the data can provide some
information that might lead to the identity of a user. By shuffling the entire data set, the
organization is taking an extra step to anonymize the data.
Another reason for shuffling data is to be able to perform some sort of repeatable random
sampling. For example, the first hundred records will be a simple random sampling.
Every time we pull the first hundred records, we’ll get the same sample. This allows
analytics that run over a random sample to have a repeatable result. Also, a separate job
won’t have to be run to produce a simple random sampling every time you need a new
sample.
Structure
- All the mapper does is output the record as the value along with a random key.
- The reducer sorts the random keys, further randomizing the data.
In other words, each record is sent to a random reducer. Then, each reducer sorts on
the random keys in the records, producing a random order in that reducer.
Consequences
Each reducer outputs a file containing random records.
'Hadoop Ecosystem' 카테고리의 다른 글
cloud : kt ucloud vm - openAPI test 01 (1) | 2013.07.11 |
---|---|
Partitioning - MapReduce (0) | 2013.07.08 |
flume install (0) | 2013.07.06 |
Data Locality, The heart of MapReduce (0) | 2013.07.05 |
sqoop import failure (0) | 2013.07.01 |