본문 바로가기

Hadoop Ecosystem

Shuffling - MapReduce


Shuffling 의 목적과 왜 하는지에 대한 설명이다 .


The total order sorting and shuffling patterns are opposites in terms of effect, but the

latter is also concerned with the order of data in records.


Intent

You have a set of records that you want to completely randomize.


Motivation

This whole chapter has been about applying some sort of order to your data set except

for this pattern which is instead about completely destroying the order.

The use cases for doing such a thing are definitely few and far between, but two stand

out. One is shuffling the data for the purposes of anonymizing it. Another is randomizing

the data set for repeatable random sampling.


Anonymizing data has recently become important for organizations that want to maintain

their users’ privacy, but still run analytics. The order of the data can provide some

information that might lead to the identity of a user. By shuffling the entire data set, the

organization is taking an extra step to anonymize the data.


Another reason for shuffling data is to be able to perform some sort of repeatable random

sampling. For example, the first hundred records will be a simple random sampling.

Every time we pull the first hundred records, we’ll get the same sample. This allows

analytics that run over a random sample to have a repeatable result. Also, a separate job

won’t have to be run to produce a simple random sampling every time you need a new

sample.




Structure

 - All the mapper does is output the record as the value along with a random key.

 - The reducer sorts the random keys, further randomizing the data.


In other words, each record is sent to a random reducer. Then, each reducer sorts on

the random keys in the records, producing a random order in that reducer.


Consequences

Each reducer outputs a file containing random records.




'Hadoop Ecosystem' 카테고리의 다른 글

cloud : kt ucloud vm - openAPI test 01  (1) 2013.07.11
Partitioning - MapReduce  (0) 2013.07.08
flume install  (0) 2013.07.06
Data Locality, The heart of MapReduce  (0) 2013.07.05
sqoop import failure  (0) 2013.07.01