SQL Performance Study

SQL 실험장 // Experiment
Scenarios of Workloads
Point Select
Point Select + Range Scan + Random I/O
Point Select +
Range Scan + Aggregation
First of all Start,
====================== 
100. Basic Check with your machine and H/W spec (cpus / disks / networks) by simple sysbench saturation test.
200. Configuration (Size of Buffer Pool or Number of connections) and Settings
300. Real-like data and SQL query test
400. Painful Query Test (point select / range scan included / many joins related / the heaviest table included / the most frequent and intensive workloads )
500. Main Business Query Test 

In order to win PoCs, 
I should have explored SQL query plan and performance with ClustrixDB (aka Real Distributed System and Cloud RDB)
It has uniqueness.
We have some necessary and efficient factor built-in (Cloud RDB에 걸맞게)
 - Cloud RDB has different major factors
   1. Network
   2. Distribution (should be evenly across nodes)

   (Distribution in Data Level)
   *Container(Unique) Key for btrees / Distribution(= a prefix of Container Key) Key for Consistent Hashing - Slice Selection (Row Level)
   *Slices (Paralellism) / 
   *Replicas(ALLNODES for less network forwards & ranking replica for select) / 
   *Distribute Key (how to evenly distribute or to store data w/ access pattern) /
   Splitting Slices and Re-ranking replica for better balance with node workloads (avoid one hot and busy node situation) 
   Column Order 


SQL 바르게 쓰기가 서비스/업무/제품/업무/생산성/비용에 미치는 효과에 대한 연구
=> 현 상황에서 무엇을 도울 수 있는가 ? 
=> Query Case 를 보라 ? 만만치 않다
   당신의 요구사항이 복잡할 수록 쿼리는 복잡해지고 길어지고 많은 Resource 를 요구한다
   그럼에도 User Experience / Response Time / getting it done in timely manner 당연하게 나올 것이라 생각하는가 
   일정의 무리가 생기고 대충 매우 적은 데이터 / 비현실적인 데이터 건수 수준을 가지고 결과가 나오면 된 것 처럼 생각하고 가고
   원하는 데이터가 맞게 나오는지 검증도 없이 쓰는 경우가 존재해 버린다. (*실시간 대쉬보드 수치와 같이 잘못된 값임에도 쉽게 검증이 어려운 경우가 그러하다)

  *데이터디자인
    - PK 가 충분히 검토되어 최적정으로 선정하지 않았을 때  (At first, please make sure that PK is not AutoIncrement Column in general)
    - PK 가 Access Pattern 에 대한 고민 없이 만들어 졌을 때 (Where 절에 사용될 수 있는 정도 / Business Understanding 이해)
    - 임시처방적으로 인덱스를 설계하거나 중복적인 인덱스가 있을 때 // 과연 정말 필요한 인덱스인가 (수/컬럼구성/컬럼구성순서/컬럼타입/값)
    - 테이블을 통합형태가 아닌 지나친 분리를 시도하였을 때 불필요한 UNION ALL 들이 (여러 테이블에 대한 다중 접근이 발생) 나타나는 경향이 있음
    - Index SkipScan 형태 그리고 인덱스의 앞선 컬럼의 데이터타입에 따라 뒤에 따라오는 값이 인덱스를 타지 않는 경우 ex. price / quantity 
    - 사람은 아는 특정값 패턴임에도 기계(통계정보)가 확보해서 사용/적용할 수 없는 경우
    - 논리적으로 스마트한 조건추가로 데이터 범위 줄이기
    - x BETWEEN A and A9999/A235959 =>바꿔써라=>   x >= A and x < B 명확하게


  *Data Skewness 
    - 일반적인 데이터 값에서는 잘 작동하나, 극단적인 값(특정 Account / 자정거래 / 특정 인기 폭발 앱)의 경우에는 잘 작동하지 않을 때 (그러나 비즈니스 차원에서는 중요)
    - 기존 Index 도 돕지 못하는 경우 (*괜찮은 인덱스 임에도 // 좀 더 치밀한 설계가 필요)

  *Query Patterns ()
    - Subquery 
    - INLIST (Lots of items such as tens thousands)
    - Not exist
    - status_code <> 'Y' , status_code = 'N' or status_code IN ('N', 'U', 'K')
    - Left Join
    - Data Quality (because of NULL unexpected)
    - Union all and Union
    - SELECT for UPDATE 남발 (꼭 필요한 때에 / 범위를 줄여서)
    - Batch 형 Update 를 비효율적으로 짜는 경우
    - Batch 형태를 단 한방에 하고 싶은 경우 (봐서 어떤 때는 Divide and Conquer 을 써라)
    - 불안해서 사용하는 NULL 처리 (데이터 품질 문제)
    - i.e. FNULL(SUM(negative)
    - select 절 에서 * 남발


    - Unnecessary Lookup wider range scan more than required
    - Not Access but filtering after looking(reading) up
    - (occasionally) Wrong results without proper columns explicitly related to GROUP BY clause
    - Index Ignored (wrong selection of Driving Table)
    - No Bind Variables used - causing expensive Hard Parsing(Generating Execution Plan)
    - Too many indexes related to tables with intensvie writes(insert and update)
    - SELECT FOR UPDATE WITH excessive range / divide and conquer strategy required
    - Long lasting transactions which cause lots of locks with wider range more than required (Begin and Commit)
    - No smart conditions narrowing data with inline view(subset)
    - Much functions and operations in early stage of the query (inline view - inner query) *cf. unnecessary ops in earlier step 

    -*Many things come from and originate from Schema Designs (not well desgined)
    - Longer access pattern (i.e. Always Lookup index then random access to table because of no right primary key)

*Compared to Oracle
    - less smart Query Execution Plan Generation (Join Order / Push Predicate)
    - Smart Scan Offloading


  *Tuning Points
   - Lookup data as much as necessrily
   - Narrow the data range as possible as you can
   - Check the rows of table and cardinality/selectivity when lookup data (group by a, b, c => count(*))
   - Wrong selection of Driving Table
   - Adding Logically Possible Conditinos
   
*현실의 쿼리 (in real battle field) 

*우리가 가진 무기
  Index 의 종류
  Partition 의 종류
  Container 의 종류
  Explain 유형
  Functions

* Hot Potatoes
  Primary Key vs Surragte Key 
  Surrogate Key :: Auto_Increment
  Foreign Key (especially with Distributed System like ClustrixDB, due to Network burden)

* Must Consider Followings  
  Selectivity
  Cardinality
  Access Pattern
  Degree of Narrowness
  Clustering Factor (in Oracle)
 
저작자표시 비영리 변경금지
'SQL' 카테고리의 다른 글

부하 테스트 (0)	2017.05.09
Clustrix, one of the NewSQL (0)	2017.05.02
oracle hint 정리 단순참고용 (0)	2011.08.26
SQLP 2회 후기 (작성중) (0)	2011.08.03
2011. 03 월 2회 SQLP 도전 --- 결과 ^^ (작성중) (1)	2011.08.03
Jacob's Cafe

SQL Performance Study

'SQL' 카테고리의 다른 글

티스토리툴바

SQL Performance Study

'SQL' 카테고리의 다른 글

'SQL' Related Articles

티스토리툴바