본문 바로가기

Bigdata 동영상

RHadoop install [하둡 동영상 강의]

 동영상은 조만간 upload 예정입니다. 


[refered to http://cran.r-project.org/bin/linux/suse/README.html]

[refered to http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/]

[refered to http://kertz.egloos.com/199360]


Centos / not Ubuntu

> install.packages( c('RJSONIO', 'itertools', 'digest', 'Rcpp', 'bitops', 'functional', 'stringr', 'plyr', 'reshape2', 'rJava'), repos='http://cran.revolutionanalytics.com')  

download packages to install

otherwise

R CMD INSTALL rmr2_2.2.1.tar.gz

R CMD INSTALL rhdfs_1.0.6.tar.gz


R CMD INSTALL rhbase_1.2.0.tar.gz  // 


[Follwing the instruction before rhbase package install

   in https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase]

#yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel


Download thrift-0.9.0.tar.gz

[hadoop@h001 thrift-0.9.0]$ ./configure --with-boost=/usr/local

configure: error: "Error: libcrypto required."

[root@h001 ~]# yum install libssl-dev       [ exist in ubuntu, so not works for me ]

[root@h001 ~]# yum install openssl-devel  [exist in Centos ]



# /usr/bin/hbase thrift start



[After install RHadoop on NameNode only]

13/06/29 02:38:39 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201306290032_0003_m_000000

13/06/29 02:38:39 INFO streaming.StreamJob: killJob...

Streaming Command Failed!

Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce,  : 

  hadoop streaming failed with error code 1


--Jobtracker-log

2013-06-29 02:38:27,382 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201306290032_0003_m_000001_3: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)

at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)




[root@h001 Downloads]# R CMD INSTALL rmr2_2.2.1.tar.gz
WARNING: ignoring environment value of R_HOME
* installing to library ‘/usr/lib64/R/library’
ERROR: dependencies ‘Rcpp’, ‘bitops’, ‘functional’, ‘stringr’, ‘plyr’, ‘reshape2’ are not available for package ‘rmr2’
* removing ‘/usr/lib64/R/library/rmr2’
[root@h001 Downloads]# install.packages( c( ‘Rcpp’, ‘bitops’, ‘functional’, ‘stringr’, ‘plyr’, ‘reshape2’),repos='http://cran.revolutionanalytics.com')


> install.packages( 'plys', repos='http://cran.revolutionanalytics.com')
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
Warning message:
package ‘plys’ is not available (for R version 3.0.1)

[root@h001 Downloads]# R CMD INSTALL rmr2_2.2.1.tar.gz

> library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2

> install.packages('rJava')


[root@h001 Downloads]# R CMD INSTALL rhdfs_1.0.6.tar.gz


> library(rJava)
> library(rhdfs)
HADOOP_CMD=/home/hadoop/hadoop/bin/hadoop
Be sure to run hdfs.init()
> hdfs.init()


[hadoop@h001 Downloads]$ echo $JAVA_HOME
/usr/local/java
[hadoop@h001 Downloads]$ echo $R_HOME
/home/hadoop/R
[hadoop@h001 Downloads]$ R CMD javareconf
WARNING: ignoring environment value of R_HOME
Java interpreter : /usr/local/java/jre/bin/java
Java version     : 1.7.0_21
Java home path   : /usr/local/java
Java compiler    : /usr/local/java/bin/javac
Java headers gen.: /usr/local/java/bin/javah
Java archive tool: /usr/local/java/bin/jar

trying to compile and link a JNI progam
detected JNI cpp flags    : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
detected JNI linker flags : -L/usr/local/jdk1.7.0_21/jre/lib/amd64/server -ljvm
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -I/usr/local/java/include -I/usr/local/java/include/linux -I/usr/local/include    -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic  -c conftest.c -o conftest.o
gcc -m64 -std=gnu99 -shared -L/usr/local/lib64 -o conftest.so conftest.o -L/usr/local/jdk1.7.0_21/jre/lib/amd64/server -ljvm -L/usr/lib64/R/lib -lR


Java library path: /usr/local/jdk1.7.0_21/jre/lib/amd64/server
JNI cpp flags    : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
JNI linker flags : -L/usr/local/jdk1.7.0_21/jre/lib/amd64/server -ljvm
Updating Java configuration in /usr/lib64/R
/usr/lib64/R/bin/javareconf: line 396: /usr/lib64/R/etc/Makeconf.new: Permission denied
*** cannot create /usr/lib64/R/etc/Makeconf.new
*** Please run as root if required.

[hadoop@h001 Downloads]$


1) R
a.time <- proc.time()
small.ints2=1:900000
result.normal = sapply(small.ints2, function(x) x^2)
proc.time() - a.time


2) RHadoop
b.time <- proc.time()
small.ints= to.dfs(1:900000)
result = mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2))
proc.time() - b.time


rmr

https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr

Overview

This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster.





rhbase
https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase


Overview

This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS. The following functions are part of this package

File Manipulations
hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get
File Read/Write
hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file
Directory
hdfs.dircreate, hdfs.mkdir
Utility
hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
Initialization
hdfs.init, hdfs.defaults


rhbase

https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase
Overview

This R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE. The following functions are part of this package

Table Maninpulation
hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table
Read/Write
hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan, hb.scan.ex
Utility
hb.list.tables
Initialization
hb.defaults, hb.init


R and Hadoop can complement each other very well, they are a natural match in big data analytics and visualization. One of the most well-known R packages to support Hadoop functionalities is RHadoop that was developed by RevolutionAnalytics.


Installing RHadoop


RHadoop is a collection of three R packages: rmr, rhdfs and rhbase. 


rmr package provides Hadoop MapReduce functionality in R, 

rhdfs provides HDFS file management in R,

rhbase provides HBase database management from within R.



we need to install RHadoop packages with their dependencies.  
rmr requires RCpp, RJSONIO, digest, functional, stringr and plyr, 
while rhdfs requires rJava.


The installation requires the corresponding tar.gz archives to be downloaded 

and then we can run R CMD INSTALL command with sudo privileges.

As part of the installation, 

we need to reconfigure Java for rJava package 

and we also need to set HADOOP_CMD variable for rhdfs package. 


#  R CMD INSTALL Rcpp Rcpp_0.10.2.tar.gz

#  R CMD INSTALL RJSONIO RJSONIO_1.0-1.tar.gz

#  R CMD INSTALL digest digest_0.6.2.tar.gz

#  R CMD INSTALL functional functional_0.1.tar.gz

#  R CMD INSTALL stringr stringr_0.6.2.tar.g

#  R CMD INSTALL plyr plyr_1.8.tar.gz

#  R CMD INSTALL rmr rmr2_2.0.2.tar.gz


## JAVA_HOME=/usr/local/java/bin

 R CMD javareconf

#  R CMD INSTALL rJava rJava_0.9-3.tar.gz 

#  HADOOP_CMD=/home/hadoop/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz 

#  R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz



sudo R CMD INSTALL Rcpp Rcpp_0.10.2.tar.gz

sudo R CMD INSTALL RJSONIO RJSONIO_1.0-1.tar.gz

sudo R CMD INSTALL digest digest_0.6.2.tar.gz

sudo R CMD INSTALL functional functional_0.1.tar.gz

sudo R CMD INSTALL stringr stringr_0.6.2.tar.g

sudo R CMD INSTALL plyr plyr_1.8.tar.gz

sudo R CMD INSTALL rmr rmr2_2.0.2.tar.gz


sudo JAVA_HOME=/home/istvan/jdk1.6.0_38/jre R CMD javareconf

sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz 

sudo HADOOP_CMD=/home/istvan/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz 

sudo R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz


I added this below to /etc/profile and applied it.

Sys.setenv(HADOOP_HOME="/home/istvan/hadoop")
Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop")