CAPS 동시일관성 > SCN System Commit Number UNDO OWI Mechnism Hash chain ==> 병목구간 발생
하지만 이를 만족시키려면 어떤 문제를 직면하는가 ? RAC Real Application Cluster => 고비용
EXA
설계는 Business understanding 으로부터 시작한다.
뭘 하고자 하는지 알지 못한다면 아무것도 만들 수 없다. 아니면
대량의 처리 병렬처리 CUBRID SHARDING , MONGO DB ==> SHARDING KEY REPLICA SET
1. QL
2. MODELING
SQL VS NOSQL & RDB VS HDFS & STRUCTURED VS UNSTRUCTURED
NEWSQL ? ? ? H-STORE PARALLEL MASSIVELY SHARED_NOTHING
http://newsql.sourceforge.net/
RDB : ACID ( Atomicity, Consistency, Isolation, Durability)
RDB _ FILE BASED ==> MONGO DB JOIN , HVIE JOIN
RDB ANALYTICS => ANALYTICS FUNCTION , PARTITION
저장구조의 관점 / I/O 의 관점 / 데이터 퍼오기 / 데이터 어디서 읽을것인지 Buffer Read Physical I/O Read
기저에서 dbf 파일을 쓴다 라는 관점은 동일
파일을 저장하는 구조를 지닌다.
=> 들어갈 값을 정해서 넣는다.
들어갈 값을 정하지 않고 넣는다. => 경계가 없다. 어떤값이 들어올지 정의하지 않는다.
어떤값이 들어오는지 신경쓰지 않는다.
==> 그럼 끝인가 ?
분석을 하려면 어느정도 사람이 인식할 수 있는 패턴과 분리해 낼 수 있는 기준이 있어야 한다.
그리고 내가 원하는 데이터를 만들어 보고자 해도 Raw 데이터가 역시 너무 지나치게 크다
=> become more intelligence >
전처리의 문제가 발생 REG_EXP 원하는 데이터를 만들 수 있어야
테이블 A JOIN 테이블 B
아니 우리는 데이터를 그냥 넣겠어
중복제거시 100 건의 데이터 사이즈는 ?
중복을 감수하는 100 건의 데이터 사이즈는 ? = > Interface Abstract Common Class 를 만들어서 재사용성 을 고려하는데 재사용을 고려하지 않고 소스를 만든다. ?
HADOOP _ FILE BASED
HQL, HBASE, PIG
HIVEQL
R-LANGUAGE
RDB VS NOSQL DB
SQL-ON-HADOOP ( HIVE TAJO IMPALA )
STORM, ESPER
진짜 그렇게 큰 데이터를 실시간으로 분석한다고
그런데 왜 실시간 분석을 하지 => 실시간 지금 일어나는 일을 캐치하고 싶다.
그럼 과거데이터는 어떤 역할 을 할 수 있지
지금 일어나는 일이 과거에 일어난 어느 일과 유사한가 ==> 지금 일어나는 일의 정의는 무엇인고 그 단위는 무엇인가 ? Pattern 형태를 말하는 것일까 ....
Oooooooooooh -- I've seen same figures and charts
Big Data Use Cases
Example #1 : Machine-Generated Data
Online Reservations
Multi_Channel Marketing and Sentiment Analysis
MODELING
RDB MODELING : NORMALIZATION
NOSQL MODELING ? : WHY NOT ?
STATISTICAL MODELING
- ANANLYTICS -
RDB -> DW APPLIANCE (EXADATA, HANA SAP) OLAP
UNSTRUCTURED => VISUALIZE UNSTRUCTURE ==> IMPOSSIBLE
빅데이터 무엇을 하고 싶은 것인가 ?
데이터 분석이야기
Big data analysis involves making “sense” out of large volumes of varied data that in its raw form lacks a data
model to define what each element means in the context of the others. There are several new issues you should
consider as you embark on this new type of analysis:
• Discovery – In many cases you don’t really know what you have and how different data sets relate to each
other. You must figure it out through a process of exploration and discovery.
• Iteration – Because the actual relationships are not always known in advance, uncovering insight is often an
iterative process as you find the answers that you seek. The nature of iteration is that it sometimes leads you
down a path that turns out to be a dead end. That’s okay – experimentation is part of the process. Many
analysts and industry experts suggest that you start with small, well-defined projects, learn from each
iteration, and gradually move on to the next idea or field of inquiry.
• Flexible Capacity – Because of the iterative nature of big data analysis, be prepared to spend more time and
utilize more resources to solve problems.
• Mining and Predicting – Big data analysis is not black and white. You don’t always know how the various
data elements relate to each other. As you mine the data to discover patterns and relationships, predictive
analytics can yield the insights that you seek.
• Decision Management – Consider the transaction volume and velocity. If you are using big data analytics to
drive many operational decisions (such as personalizing a web site or prompting call center agents about the
habits and activities of consumers) then you need to consider how to automate and optimize the
implementation of all those actions.
For example you may have no idea whether or not social data sheds light on sales trends. The challenge comes
with figuring out which data elements relate to which other data elements, and in what capacity. The process of
discovery not only involves exploring the data to understand how you can use it but also determining how it
relates to your traditional enterprise data.
New types of inquiry entail not only what happened, but why. For example, a key metric for many companies is
customer churn. It’s fairly easy to quantify churn. But why does it happen? Studying call data records, customer
support inquiries, social media commentary, and other customer feedback can all help explain why customers
defect. Similar approaches can be used with other types of data and in other situations. Why did sales fall in a
given store? Why do certain patients survive longer than others? The trick is to find the right data, discover the
hidden relationships, and analyze it correctly.
분석 - 빅데이터 분석 ? 전통적인 통계, 분석과 무엇이 다르다는 것인가. ?
단지 데이터의 중요성이 너 자주 비중있게 다뤄지는 것 아닌가 ?
What is the difference between tradititional statistical analytics and bigdata analytics
bigdataanalyticswpoaa-1930891.pdf
The first computer program I ever wrote (in 1979, if you must know) was in the statistical package SPSS (Statistical Package for the Social Sciences), and the second computer platform I used was SAS (Statistical Analysis System). Both of these systems are still around today—SPSS was acquired by IBM as part of its BI portfolio, and SAS is now the world’s largest privately held software company. The longevity of these platforms—they have essentially outlived almost all contemporary software packages—speaks to the perennial importance of data analysis to computing.
Packages such as SAS and SPSS gained traction in academic settings because they allowed scientists and researchers to analyze experimental and research data without the tedium of coding in low level languages such as FORTRAN and COBOL. As computing moved into the mainstream of business process, these statistical packages became an important part of decision support systems that seeded the current massive market for business intelligence tools. Not surprisingly SAS and SPSS rode this wave to commercial success.
Ironically, the success of these academically spawned packages made them less attractive for academia. Price tags increased, while the focus on business intelligence did not always align with academic desires.
As a result, professional statisticians sought alternatives to commercial packages. The “S” language, which was designed for statistical programming, seemed an attractive foundation technology. Eventually, an open source implementation of S—called “R”—was released in the late 1990s.
Bo Cowgill from Google summed up R nicely when he said, “The best thing about R is that it was developed by statisticians. The worst thing about R is that ... it was developed by statisticians.” R has a syntax that is idiosyncratic and disconnected from most other languages. However, R makes up for this in extensibility. Anyone can add a statistical routine to R, and thousands of such routines are available in the CRAN package repository. This repository probably represents the most significant open collection of statistical computer algorithms ever assembled.
Possibly the greatest current weakness of R is scalability. R originally was designed to process in-memory sets using single processor machines. Multithreaded computers and massively large data sets pose a real problem for R.
Revolution Analytics has released a commercial distribution of R based on the open source core that addresses some of these multithreading and memory issues, by linking R to multi-threaded math libraries and adding packages for large data set processing.
Last year, Oracle released a version of R integrated within its database and big data appliance. The Oracle distribution of R also attempts to provide better threading and memory handling in the base product. In addition, Oracle has included versions of R packages in which the core processing is offloaded into the database. These packages allow the database engine to parallelize the core number crunching (sums, sums of squares, etc.) that is at the foundation of many statistical techniques.
If the term “big data analytics” has any concrete meaning today, it is in the analytics of fine-grained, massively large data sets in Hadoop and similar systems such as Cassandra. So, it’s not surprising that R and Hadoop are two of the key technologies that form the big data analytic stack. Unfortunately, R’s in-memory and threading limitations don’t align well with Hadoop’s massive parallelism and data scale. Not surprisingly, there are significant efforts underway to tie the two together—projects such as RHadoop, RHIPE, and RHIVE are all worth taking a look at.
R arguably represents the most accessible and feature-rich set of statistical routines available. Despite some limitations, it seems poised to be a key technology in big data.
Latch 나 데이터 읽고 있는 중이야 바꾸지마 라고 포스트잇을 붙힌 셈 1 부터 999 를 읽으려고 100 까지 읽었는데 700 이 바뀌었다. 700 의 A 라고 읽어야 하는가 A' 라고 읽어야 하는가?
그래서 1을 읽은 시점의 1~700 데이터를 기준으로 읽는다. 읽기 일관성 이다.
그리고 700 을 바꾸려고 하는 Transaction 도 OK 시킨다. 가능하냐 그렇게 하고 있다.
얼마나 빨리 읽어서 처리하느냐에 대한 방법들의 전쟁이다.
내가 빨라 아닐걸 난 메모리에 올리고 이런저런 방식으로 속도 업이당
메모리에 올리니 빠른데 용량을 최대한 많이 달아 주어야 겠네
분산처리해서 빨리 끝내려면 SHARED NOTHING 으로 하자고 그래야 끝난것이 끝난거지
난 아키텍쳐를 뜯어 고쳐 주지 어떤 어떤 것 필요하다고 ... 어떻게 해주면 되는거야
SINGLE CORE MULTI CORE CPU MEMORY DISK SSD NETWORK RACK TO RACK ALGORITHM THREAD
A 를 넣었는데 A 를 또 넣는다. 도대체 어떤것이 맞는거야 >..<
OldSQL
+ Legacy RDBMS vendors
NoSQL
+ Give up SQL and ACID for performance
NewSQL
+ Preserve SQL and ACID
+ Get performance from a new architecture
'Data Analytics' 카테고리의 다른 글
data.frame(x=xvar, y=yvar, z=zvar) (0) | 2013.12.22 |
---|---|
ggplot2 aes(x=rating) geom_histogram() geom_density() (0) | 2013.12.22 |
R 기본 예제 돌려보기 20131212 (0) | 2013.12.13 |
[책] 데이터는 알고 있다 데이터가 만드는 세상 BIG DATA - 2 (0) | 2013.07.03 |
[책] 데이터는 알고 있다 데이터가 만드는 세상 BIG DATA (0) | 2013.07.03 |