본문 바로가기

Dev tips and tips

hadoop 완벽가이드 기상데이터 처리

[refered to the sites below]

http://hadoopbook.com/code.html

The book’s example code is available from GitHub at http://github.com/tomwhite/hadoop-book/

The code for the third edition is at https://github.com/tomwhite/hadoop-book/tree/3e

A sample of the NCDC weather dataset that is used throughout the book can be found at https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all




[hadoop@h001 MaxTemperaturebyMonth]$ 
                                       javac -classpath  /home/hadoop/hadoop/hadoop-core-1.0.4.jar  -d .  *.java
                                                                      [ " -d  " <-- destination   " *.java " <-- compile target ]

[hadoop@h001 javafolder]$ jar -cvf   ./FindMax.jar   ./*.class

added manifest

adding: MaxTemperature.class(in = 1418) (out= 800)(deflated 43%)

adding: MaxTemperatureMapper.class(in = 1876) (out= 804)(deflated 57%)

adding: MaxTemperatureReducer.class(in = 1660) (out= 704)(deflated 57%)


cf. [hadoop@h001 Temp]$ jar  xf  FindMaxTemperature.jar

[hadoop@h001 hadoop]$ ./bin/hadoop jar FindMax.jar MaxTemperature /user/hadoop/wx/ /user/hadoop/wx/out
                                       [ Need to add package name like  kr.jacob.mr.MaxTemperature if you used package ]
13/07/05 19:45:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/05 19:45:19 INFO input.FileInputFormat: Total input paths to process : 2
13/07/05 19:45:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/07/05 19:45:20 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/05 19:45:20 INFO mapred.JobClient: Running job: job_201307051837_0001
13/07/05 19:45:21 INFO mapred.JobClient:  map 0% reduce 0%
13/07/05 19:45:38 INFO mapred.JobClient:  map 50% reduce 0%
13/07/05 19:45:44 INFO mapred.JobClient:  map 100% reduce 0%
13/07/05 19:45:53 INFO mapred.JobClient:  map 100% reduce 100%
13/07/05 19:45:58 INFO mapred.JobClient: Job complete: job_201307051837_0001
13/07/05 19:45:58 INFO mapred.JobClient: Counters: 29
13/07/05 19:45:58 INFO mapred.JobClient:   Job Counters 
13/07/05 19:45:58 INFO mapred.JobClient:     Launched reduce tasks=1
13/07/05 19:45:58 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=24398
13/07/05 19:45:58 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/05 19:45:58 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/05 19:45:58 INFO mapred.JobClient:     Launched map tasks=2
13/07/05 19:45:58 INFO mapred.JobClient:     Data-local map tasks=2
13/07/05 19:45:58 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=12526
13/07/05 19:45:58 INFO mapred.JobClient:   File Output Format Counters 
13/07/05 19:45:58 INFO mapred.JobClient:     Bytes Written=18
13/07/05 19:45:58 INFO mapred.JobClient:   FileSystemCounters
13/07/05 19:45:58 INFO mapred.JobClient:     FILE_BYTES_READ=144425
13/07/05 19:45:58 INFO mapred.JobClient:     HDFS_BYTES_READ=1777370
13/07/05 19:45:58 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=353220
13/07/05 19:45:58 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=18
13/07/05 19:45:58 INFO mapred.JobClient:   File Input Format Counters 
13/07/05 19:45:58 INFO mapred.JobClient:     Bytes Read=1777168
13/07/05 19:45:58 INFO mapred.JobClient:   Map-Reduce Framework
13/07/05 19:45:58 INFO mapred.JobClient:     Map output materialized bytes=144431
13/07/05 19:45:58 INFO mapred.JobClient:     Map input records=13130
13/07/05 19:45:58 INFO mapred.JobClient:     Reduce shuffle bytes=144431
13/07/05 19:45:58 INFO mapred.JobClient:     Spilled Records=26258
13/07/05 19:45:59 INFO mapred.JobClient:     Map output bytes=118161
13/07/05 19:45:59 INFO mapred.JobClient:     Total committed heap usage (bytes)=336338944
13/07/05 19:45:59 INFO mapred.JobClient:     CPU time spent (ms)=5610
13/07/05 19:45:59 INFO mapred.JobClient:     Combine input records=0
13/07/05 19:45:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=202
13/07/05 19:45:59 INFO mapred.JobClient:     Reduce input records=13129
13/07/05 19:45:59 INFO mapred.JobClient:     Reduce input groups=2
13/07/05 19:45:59 INFO mapred.JobClient:     Combine output records=0
13/07/05 19:45:59 INFO mapred.JobClient:     Physical memory (bytes) snapshot=430219264
13/07/05 19:45:59 INFO mapred.JobClient:     Reduce output records=2
13/07/05 19:45:59 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2167259136
13/07/05 19:45:59 INFO mapred.JobClient:     Map output records=13129
[hadoop@h001 hadoop]$ 

[hadoop@h001 MaxTemperaturebyMonth]$ hadoop fs -cat /user/hadoop/wx/out/part-r-00000
1901 317
1902 244

>> Year 에서 Month 로 변경 후 Map, Combine, Reduce 개수의 Input Output 개수의 변화 확인.

13/07/05 20:09:08 INFO mapred.JobClient:   Map-Reduce Framework
13/07/05 20:09:08 INFO mapred.JobClient:     Map output materialized bytes=170689
13/07/05 20:09:08 INFO mapred.JobClient:     Map input records=13130
13/07/05 20:09:08 INFO mapred.JobClient:     Reduce shuffle bytes=170689
13/07/05 20:09:08 INFO mapred.JobClient:     Spilled Records=26258
13/07/05 20:09:08 INFO mapred.JobClient:     Map output bytes=144419
13/07/05 20:09:08 INFO mapred.JobClient:     Total committed heap usage (bytes)=336338944
13/07/05 20:09:08 INFO mapred.JobClient:     CPU time spent (ms)=4870
13/07/05 20:09:08 INFO mapred.JobClient:     Combine input records=0
13/07/05 20:09:08 INFO mapred.JobClient:     SPLIT_RAW_BYTES=202
13/07/05 20:09:08 INFO mapred.JobClient:     Reduce input records=13129
13/07/05 20:09:08 INFO mapred.JobClient:     Reduce input groups=24
13/07/05 20:09:08 INFO mapred.JobClient:     Combine output records=0
13/07/05 20:09:08 INFO mapred.JobClient:     Physical memory (bytes) snapshot=435335168
13/07/05 20:09:08 INFO mapred.JobClient:     Reduce output records=24
13/07/05 20:09:08 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2167193600
13/07/05 20:09:08 INFO mapred.JobClient:     Map output records=13129


[hadoop@h001 MaxTemperaturebyMonth]$ hadoop fs -cat /user/hadoop/wxout02/part-r-00000
190101  44
190102  17
190103  50
190104  194
190105  256
190106  278
190107  317
190108  283
190109  211
190110  156
190111  89
190112  117
190201  33
190202  117
190203  44
190204  83
190205  211
190206  239
190207  244
190208  206
190209  183
190210  106
190211  94
190212 50