본문 바로가기

Data Analytics

QL WITH BIGDATA


CAPS  동시일관성 > SCN System Commit Number  UNDO OWI Mechnism Hash chain ==> 병목구간 발생

 하지만 이를 만족시키려면 어떤 문제를 직면하는가 ?   RAC  Real Application Cluster  => 고비용

 EXA


설계는 Business understanding 으로부터 시작한다. 

뭘 하고자 하는지 알지 못한다면 아무것도 만들 수 없다. 아니면 


대량의 처리 병렬처리 CUBRID SHARDING  ,  MONGO DB ==> SHARDING KEY  REPLICA SET



1. QL


2. MODELING


SQL VS NOSQL & RDB VS HDFS & STRUCTURED VS UNSTRUCTURED

NEWSQL ? ? ?  H-STORE  PARALLEL MASSIVELY   SHARED_NOTHING

http://newsql.sourceforge.net/


RDB : ACID ( Atomicity, Consistency, Isolation, Durability)

RDB _ FILE BASED  ==> MONGO DB JOIN , HVIE JOIN  

  RDB ANALYTICS => ANALYTICS FUNCTION ,   PARTITION 


저장구조의 관점 / I/O 의 관점 / 데이터 퍼오기 / 데이터 어디서 읽을것인지  Buffer Read Physical I/O Read

기저에서 dbf 파일을 쓴다 라는 관점은 동일


파일을 저장하는 구조를 지닌다.  

=> 들어갈 값을 정해서 넣는다. 

     들어갈 값을 정하지 않고 넣는다. => 경계가 없다. 어떤값이 들어올지 정의하지 않는다. 

      어떤값이 들어오는지 신경쓰지 않는다. 


    ==> 그럼 끝인가 ? 

           분석을 하려면 어느정도 사람이 인식할 수 있는 패턴과 분리해 낼 수 있는 기준이 있어야 한다. 

            그리고 내가 원하는 데이터를 만들어 보고자 해도 Raw 데이터가 역시 너무 지나치게 크다 

                  => become more intelligence >     


전처리의 문제가 발생 REG_EXP 원하는 데이터를 만들 수 있어야 


테이블 A   JOIN   테이블  B


아니 우리는 데이터를 그냥 넣겠어 


중복제거시 100 건의 데이터 사이즈는 ? 

중복을 감수하는 100 건의 데이터 사이즈는 ?  = > Interface Abstract Common Class 를 만들어서 재사용성 을 고려하는데 재사용을 고려하지 않고 소스를 만든다. ?


HADOOP _ FILE BASED


HQL, HBASE, PIG

HIVEQL

R-LANGUAGE


RDB VS NOSQL DB


SQL-ON-HADOOP ( HIVE TAJO IMPALA )


STORM, ESPER

진짜 그렇게 큰 데이터를 실시간으로 분석한다고 

그런데 왜 실시간 분석을 하지 => 실시간 지금 일어나는 일을 캐치하고 싶다. 

그럼 과거데이터는 어떤 역할 을 할 수 있지 


지금 일어나는 일이 과거에 일어난 어느 일과 유사한가 ==>  지금 일어나는 일의 정의는 무엇인고 그 단위는 무엇인가 ?  Pattern 형태를 말하는 것일까 ....


Oooooooooooh -- I've seen same figures and charts



Big Data Use Cases 

 Example #1 : Machine-Generated Data

                    Online Reservations

                    Multi_Channel Marketing and Sentiment Analysis




MODELING


RDB MODELING : NORMALIZATION


NOSQL MODELING ?  : WHY NOT ? 


STATISTICAL MODELING

 - ANANLYTICS -


RDB -> DW APPLIANCE (EXADATA, HANA SAP)  OLAP

 UNSTRUCTURED => VISUALIZE UNSTRUCTURE ==> IMPOSSIBLE



빅데이터 무엇을 하고 싶은 것인가 ? 

데이터 분석이야기





Big data analysis involves making “sense” out of large volumes of varied data that in its raw form lacks a data 

model to define what each element means in the context of the others. There are several new issues you should 

consider as you embark on this new type of analysis: 

• Discovery – In many cases you don’t really know what you have and how different data sets relate to each 

other. You must figure it out through a process of exploration and discovery. 

• Iteration – Because the actual relationships are not always known in advance, uncovering insight is often an 

iterative process as you find the answers that you seek. The nature of iteration is that it sometimes leads you 

down a path that turns out to be a dead end. That’s okay – experimentation is part of the process. Many 

analysts and industry experts suggest that you start with small, well-defined projects, learn from each 

iteration, and gradually move on to the next idea or field of inquiry. 

• Flexible Capacity – Because of the iterative nature of big data analysis, be prepared to spend more time and 

utilize more resources to solve problems. 

• Mining and Predicting – Big data analysis is not black and white. You don’t always know how the various 

data elements relate to each other. As you mine the data to discover patterns and relationships, predictive 

analytics can yield the insights that you seek. 

• Decision Management – Consider the transaction volume and velocity. If you are using big data analytics to 

drive many operational decisions (such as personalizing a web site or prompting call center agents about the 

habits and activities of consumers) then you need to consider how to automate and optimize the 

implementation of all those actions. 

For example you may have no idea whether or not social data sheds light on sales trends. The challenge comes 

with figuring out which data elements relate to which other data elements, and in what capacity. The process of 

discovery not only involves exploring the data to understand how you can use it but also determining how it 

relates to your traditional enterprise data.



New types of inquiry entail not only what happened, but why. For example, a key metric for many companies is 

customer churn. It’s fairly easy to quantify churn. But why does it happen? Studying call data records, customer 

support inquiries, social media commentary, and other customer feedback can all help explain why customers 

defect. Similar approaches can be used with other types of data and in other situations. Why did sales fall in a 

given store? Why do certain patients survive longer than others? The trick is to find the right data, discover the 

hidden relationships, and analyze it correctly.




분석 - 빅데이터 분석 ?  전통적인 통계, 분석과 무엇이 다르다는 것인가. ? 


         단지 데이터의 중요성이 너 자주 비중있게 다뤄지는 것 아닌가 ? 



What is the difference between tradititional statistical analytics and bigdata analytics 




bigdataanalyticswpoaa-1930891.pdf





The first computer program I ever wrote (in 1979, if you must know) was in the statistical package SPSS (Statistical Package for the Social Sciences), and the second computer platform I used was SAS (Statistical Analysis System). Both of these systems are still around today—SPSS was acquired by IBM as part of its BI portfolio, and SAS is now the world’s largest privately held software company. The longevity of these platforms—they have essentially outlived almost all contemporary software packages—speaks to the perennial importance of data analysis to computing.

Packages such as SAS and SPSS gained traction in academic settings because they allowed scientists and researchers to analyze experimental and research data without the tedium of coding in low level languages such as FORTRAN and COBOL. As computing moved into the mainstream of business process, these statistical packages became an important part of decision support systems that seeded the current massive market for business intelligence tools.  Not surprisingly SAS and SPSS rode this wave to commercial success.

Ironically, the success of these academically spawned packages made them less attractive for academia. Price tags increased, while the focus on business intelligence did not always align with academic desires. 

As a result, professional statisticians sought alternatives to commercial packages. The “S” language, which was designed for statistical programming, seemed an attractive foundation technology. Eventually, an open source implementation of S—called “R”—was released in the late 1990s. 

Bo Cowgill from Google summed up R nicely when he said, “The best thing about R is that it was developed by statisticians. The worst thing about R is that ... it was developed by statisticians.” R has a syntax that is idiosyncratic and disconnected from most other languages. However, R makes up for this in extensibility. Anyone can add a statistical routine to R, and thousands of such routines are available in the CRAN package repository. This repository probably represents the most significant open collection of statistical computer algorithms ever assembled.

Possibly the greatest current weakness of R is scalability. R originally was designed to process in-memory sets using single processor machines. Multithreaded computers and massively large data sets pose a real problem for R.

Revolution Analytics has released a commercial distribution of R based on the open source core that addresses some of these multithreading and memory issues, by linking R to multi-threaded math libraries and adding packages for large data set processing. 

Last year, Oracle released a version of R integrated within its database and big data appliance. The Oracle distribution of R also attempts to provide better threading and memory handling in the base product. In addition, Oracle has included versions of R packages in which the core processing is offloaded into the database. These packages allow the database engine to parallelize the core number crunching (sums, sums of squares, etc.) that is at the foundation of many statistical techniques.

If the term “big data analytics” has any concrete meaning today, it is in the analytics of fine-grained, massively large data sets in Hadoop and similar systems such as Cassandra. So, it’s not surprising that R and Hadoop are two of the key technologies that form the big data analytic stack. Unfortunately, R’s in-memory and threading limitations don’t align well with Hadoop’s massive parallelism and data scale. Not surprisingly, there are significant efforts underway to tie the two together—projects such as RHadoop, RHIPE, and RHIVE are all worth taking a look at.

R arguably represents the most accessible and feature-rich set of statistical routines available. Despite some limitations, it seems poised to be a key technology in big data. 


Latch  나 데이터 읽고 있는 중이야  바꾸지마 라고 포스트잇을 붙힌 셈 1 부터 999 를 읽으려고 100 까지 읽었는데 700 이 바뀌었다.    700 의 A 라고 읽어야 하는가 A' 라고 읽어야 하는가? 

그래서 1을 읽은 시점의 1~700 데이터를 기준으로 읽는다.  읽기 일관성 이다. 

그리고 700 을 바꾸려고 하는 Transaction 도 OK 시킨다. 가능하냐 그렇게 하고 있다. 


얼마나 빨리 읽어서 처리하느냐에 대한 방법들의 전쟁이다. 

내가 빨라 아닐걸 난 메모리에 올리고 이런저런 방식으로 속도 업이당

메모리에 올리니 빠른데 용량을  최대한 많이 달아 주어야 겠네 

분산처리해서 빨리 끝내려면 SHARED NOTHING 으로 하자고 그래야 끝난것이 끝난거지

난 아키텍쳐를 뜯어 고쳐 주지 어떤 어떤 것 필요하다고 ... 어떻게 해주면 되는거야 


SINGLE CORE MULTI CORE CPU MEMORY DISK SSD NETWORK RACK TO RACK   ALGORITHM THREAD


A 를 넣었는데 A 를 또 넣는다. 도대체 어떤것이 맞는거야 >..<


OldSQL

+ Legacy RDBMS vendors

 NoSQL

+ Give up SQL and ACID for performance

 NewSQL

+ Preserve SQL and ACID

+ Get performance from a new architecture