Index – Ordering of Columns

Post my previous Blog on Consistent Gets for an Index Scan, there were few other queries that I received over the chat. The queries were related to Clustering Factor of an Index and therefore, the recommendation or comment was that the Columns should be ordered to keep the value of a Clustering Factor as low as possible. The other comment was that High Cardinality or Low Cardinality plays a critical role with an exception for Unique or Primary Key Indexes. Again, Clustering Factor was cited as the reason behind this comment. Thought of clarifying this as well. In my opinion, this is again a MYTH.

The myth that I am going to address here is – Indexes and Column Ordering should be designed such that the Clustering Factor is as low as possible. This means, an Index on (B,A) should be preferred than (A,B) if the clustering factor of (A,B) is higher than (B,A).

Assuming, I have application queries, like :

select C, D from TABLE_A where A=:b1;

select C, D from TABLE_A where B=:b1 and A=:b2;

If I go by the above myth, shall I then create an Index on (B,A) ? Will this index be used for Query#1 ? Remember, Index Skip Scan is an Optimizer decision and is chosen if the leading column is a low cardinality column. The definition of low cardinality would be an Optimizer decision based on the costing. If an index on (B,A) is not used by Query#1, then shall I go ahead and create another Index on A ? Indexes improve query performance but are overhead for DML’s and therefore, less the number of Indexes, better the DML performance would be. Further, if a column is badly clustered, whether it is a leading column or a trailing column, it doesn’t matter and we will see this with an example. Therefore, as I mentioned in earlier and in my previous blog, Column Ordering of an Index should be basis your application queries and should not be designed basis any other table or column level statistics.

In this case, I am creating a table VIVEK_TEST with 100K rows. Column BC stands for BAD_CLUSTERED and will dominate the clustering factor. There are other 2 columns of interest – Object_ID which is a non unique column and RNUM_UNQ which is a Unique Column.


exec dbms_random.seed(0);

create table vivek_test as
with test as
(select * from all_objects where rownum between 1 and 10000)
select  trunc((rownum-1)/1000) object_id,
	rownum	rnum_unq,
        round(dbms_random.value(1,100),0) bc,
        a.object_name,
        a.temporary,
        a.created
from    test a,
        test b
where  rownum between 1 and 10000;

exec dbms_stats.gather_table_stats(user,'VIVEK_TEST');

column owner for a30
select owner, num_rows, blocks from dba_tables where table_name='VIVEK_TEST';

OWNER                  NUM_ROWS     BLOCKS
-------------------- ---------- ----------
SCOTT                     10000         62

First, I will create 4 Indexes – 2 of these are Single Column Indexes and 2 are Multi-Column.

create index vivek_test_oid on vivek_test(object_id);
create index vivek_test_oid_bc on vivek_test(object_id, bc);
create index vivek_test_bc on vivek_test(bc);
create index vivek_test_bc_oid on vivek_test(bc,object_id);

column index_name for a30
select index_name, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
from    dba_indexes
where   table_name='VIVEK_TEST'
order by 1;

INDEX_NAME                       NUM_ROWS     BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
------------------------------ ---------- ---------- ----------- ------------- -----------------
VIVEK_TEST_BC                       10000          1          20           100              4642 <-- Single Column on BC (Bad_Clustered)
VIVEK_TEST_BC_OID                   10000          1          24          1000              4642 <-- BC is the Leading Column
VIVEK_TEST_OID                      10000          1          20            10                56
VIVEK_TEST_OID_BC                   10000          1          24          1000              4856 <-- BC is the Trailing Column

As seen in the example above, column ordering doesn’t matter when it comes to the Clustering Factor. My choice of Indexing would be :

VIVEK_TEST_BC_OID if most of my queries are on 
	BC=:b1;
	BC =:b1 and object_id=:b2;

VIVEK_TEST_OID_BC if most of my queries are on 
	object_id=:b1;
	BC=:b1 and object_id=:b2;

Now, I will drop these indexes and create new indexes on the combination of BC and RNUM_UNQ. RNUM_UNQ is 100% Unique Column.

create index vivek_test_rnum on vivek_test(rnum_unq);
create index vivek_test_rnum_bc on vivek_test(rnum_unq, bc);
create index vivek_test_bc on vivek_test(bc);
create index vivek_test_bc_rnum on vivek_test(bc, rnum_unq);


column index_name for a30
select index_name, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
from    dba_indexes
where   table_name='VIVEK_TEST'
order by 1;

INDEX_NAME                       NUM_ROWS     BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
------------------------------ ---------- ---------- ----------- ------------- -----------------
VIVEK_TEST_BC                       10000          1          20           100              4642 <-- Single Column on BC (Bad_Clustered)
VIVEK_TEST_BC_RNUM                  10000          1          26         10000              4642 <-- Leading Column dominates CF
VIVEK_TEST_RNUM                     10000          1          21         10000                56
VIVEK_TEST_RNUM_BC                  10000          1          26         10000                56


For an Index with a Column, either leading or trailing, with 100% Unique values, the leading column would dictate the calculation of Clustering Factor. This is obvious as for the Clustering factor calculation, the values are sorted on the first_column then on Second_Column. For a better idea of how is it calculated, you may run the following queries :

## CF Calculation for BC, RNUM_UNQ (it matches the value as per DBA_INDEXES)

select  sum(block_change) from (
select  block_fno, bc, RNUM_UNQ, prev_bfno,
        (case when nvl(prev_bfno,0)!=block_fno then 1 else 0 end) block_change from (
        select  block_fno, bc, RNUM_UNQ, lag(block_fno) over (order by bc, rnum_unq) prev_bfno from (
                select  dbms_rowid.rowid_block_number(rowid)||'.'||
                        dbms_rowid.ROWID_TO_ABSOLUTE_FNO(rowid,'&Schema','&table_name') block_fno,
                        bc, RNUM_UNQ
                from  VIVEK_TEST
                order by bc, RNUM_UNQ, block_fno)
        )
);

SUM(BLOCK_CHANGE)
-----------------
             4642


## CF Calculation for RNUM_UNQ, BC (it matches the value as per DBA_INDEXES)

select  sum(block_change) from (
select  block_fno, RNUM_UNQ, bc, prev_bfno,
        (case when nvl(prev_bfno,0)!=block_fno then 1 else 0 end) block_change from (
        select  block_fno, RNUM_UNQ, bc, lag(block_fno) over (order by rnum_unq, bc) prev_bfno from (
                select  dbms_rowid.rowid_block_number(rowid)||'.'||
                        dbms_rowid.ROWID_TO_ABSOLUTE_FNO(rowid,'&Schema','&table_name') block_fno,
                        bc, RNUM_UNQ
                from  VIVEK_TEST
                order by RNUM_UNQ, bc, block_fno)
        )
);

SUM(BLOCK_CHANGE)
-----------------
               56

This query can be further modified to get the calculation. Remove the SUM(BLOCK_CHANGE) SELECT from the query and you could generate the output with the calculation. For every Index Entry, if the table block is changed, the BLOCK_CHANGE value is set to 1. This query will help you understand the calculation. This holds true for a combination of Non-Unique columns as well. It is not that for column involving Unique Column, the calculation is different and this will be clear if you deep dive into the queries pasted above.

During my session, I also mentioned a FLAW in Clustering Factor. It is that it doesn’t take into consideration the caching effect and to address this, Oracle introduced TABLE_CACHED_BLOCKS preference. I am against setting manually setting clustering_factor to a lower value as it will be overwritten by the next statistics gathering process. Also, I am STRONGLY against CTAS using ORDER BY to improve the clustering_factor. By using CTAS, you can improve the clustering_factor of one index, but it will impact this for other indexes. The best way is to use the TABLE_CACHED_BLOCKS preference as depicted below.

I will use the queries pasted above to come out with the actual clustering_factor and will then set the TABLE_CACHED_BLOCKS preference. For demonstration, I will work on an Index on BC, RNUM_UNQ. This process can be used for any of the indexes.

## I will create a temporary table to hold the clustering_factor calculation


create table cluf_factor as
select bc, blkno,
        lag(blkno,1,blkno) over(order by bc) prev_blkno,
        case when blkno!=lag(blkno,1,blkno) over(order by bc) or rownum=1
           then 1 else null end cluf_ft from
(select bc, rnum_unq, dbms_rowid.rowid_block_number(rowid) blkno
from    VIVEK_TEST
where   BC is not null
order by bc);

compute sum of cnt on report
break on report
select blkno, count(*) cnt from cluf_factor group by blkno order by 1;

     BLKNO        CNT
---------- ----------
    255883        188
    255884        185
    255885        185
    255886        185
    255887        185
    255936        182
    255937        180
    255938        180
    255939        180
    255940        180
    255941        180
    255942        180
    255943        180
    255945        180
    255946        180
    255947        180
    255948        180
    255949        180
    255950        180
    255951        180
    258896        180
    258897        180
    258898        180
    258899        180
    258900        180
    258901        180
    258902        180
    258903        180
    258905        180
    258906        180
    258907        180
    258908        180
    258909        180
    258910        180
    258911        180
    260480        180
    260481        180
    260482        180
    260483        180
    260484        180
    260485        180
    260486        180
    260487        180
    260489        180
    260490        180
    260491        180
    260492        180
    260493        180
    260494        180
    260495        180
    260496        180
    260497        180
    260498        180
    260499        180
    260500        180
    260501         70
           ----------
sum             10000

56 rows selected.

As per the output, each of the block was touched around 180 times. Each of these touch were considered as it these were read from the Disk. However, Oracle purely works in Memory (Buffer_Cache, Shared_Pool etc) and once the block is read from disk, it is cached in the Buffer_cache thus avoiding the disk i/o. This flaw causes unwanted high clustering_factor and therefore, is addressed by way of TABLE_CACHED_BLOCKS.

TABLE_CACHED_BLOCKS value defaults to 1. Increasing this value (to, say 250) means, clustering_factor for an Index will not be incremented, if the table block being referenced by the current Index entry has been referenced by any of the previous 250 Index Entries. In our case, each of the table blocks were referenced by previous 180 index entries. Therefore, I will change this value to 180.

select index_name, leaf_blocks, clustering_factor from dba_indexes
where table_name = 'VIVEK_TEST';

INDEX_NAME                     LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ----------- -----------------
VIVEK_TEST_RNUM                         21                56
VIVEK_TEST_RNUM_BC                      26                56
VIVEK_TEST_BC                           20              4642
VIVEK_TEST_BC_RNUM                      26              4642

select dbms_stats.get_prefs(pname=>'TABLE_CACHED_BLOCKS',ownname=>'SCOTT',tabname=>'VIVEK_TEST') preference from dual;

PREFERENCE
-----------------------------------------------------------
1

## Setting TABLE_CACHED_BLOCKS to 180
exec dbms_stats.set_table_prefs(ownname=>'SCOTT',tabname=>'VIVEK_TEST',pname=>'TABLE_CACHED_BLOCKS',PVALUE=>180);

select dbms_stats.get_prefs(pname=>'TABLE_CACHED_BLOCKS',ownname=>'SCOTT',tabname=>'VIVEK_TEST') preference from dual;

PREFERENCE
------------------------------------------------------------
180

exec dbms_stats.gather_table_stats(user,'VIVEK_TEST',method_opt=>'for all columns size 1', no_invalidate=>false);

select index_name, leaf_blocks, clustering_factor from dba_indexes
where table_name = 'VIVEK_TEST';

INDEX_NAME                     LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ----------- -----------------
VIVEK_TEST_RNUM                         21                56
VIVEK_TEST_RNUM_BC                      26                56
VIVEK_TEST_BC                           20                56
VIVEK_TEST_BC_RNUM                      26                56

Clustering Factor is calculated for Unique Indexes as well and is critical for costing. The only time when clustering_factos is not considered or not relevant for Unique or Primary Keys is when querying for all the columns part of the Unique or Primary Key i.e. INDEX UNIQUE SCAN. However, clustering_factor will play a critical role when accessing the Unique or Primary Key using INDEX RANGE SCAN.

With this blog, I tried addressing following Myth’s –

- 	Indexes and Column Ordering should be designed such that the Clustering Factor is as low as possible. 
- 	Clustering_Factor is irrelevant for Unique and Primary Key Indexes.
-	High Cardinality or Low Cardinality play important roles with an exception for Unique & Primary Key.

Be it a Unique Index / Primary Key Index or any other Index, column ordering has to be basis your application queries.

One interesting fact on TABLE_CACHED_BLOCKS : While testing this, I actually created a table with 100k rows. The number of blocks in the table were around 469 and the table block were referenced by around 230 previous index entries. Therefore, I modified this preference to 230 and regathered the statistics. It didn’t work. The same table creation script with 10k rows, everything worked as expected. I am investigating this and will update the blog, once I have something to share.

Advertisements

Consistent Gets for an Index Scan

This question was raised by a participant during my User Group Session on “Indexing : Fact & Myth Session”. Therefore, I thought of writing about it with the same example that I demonstrated..

I was demonstrating on a Myth that “High Cardinality Column should be a Leading Column of an Index” and for this, I created following table with 2 Indexes.

create table t1 as
select * from all_objects;

exec dbms_stats.gather_table_stats(user,'T1');

SQL> select owner, num_rows, blocks from dba_tables where table_name='T1' and owner='SCOTT';

OWNER                  NUM_ROWS     BLOCKS
-------------------- ---------- ----------
SCOTT                     68605       1377

SQL> select column_name, num_distinct, num_nulls from dba_tab_columns
where   owner='SCOTT'
and     table_name='T1'
and     column_name in ('OBJECT_ID','TEMPORARY')
order by 1;

COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS
------------------------------ ------------ ----------
OBJECT_ID                             68605          0
TEMPORARY                                 2          0

SQL> create index t1_ot on t1(object_id, temporary);

Index created.

SQL> create index t1_to on t1(temporary,object_id);

Index created.

SQL> select index_name, blevel, leaf_blocks, clustering_factor from dba_indexes
where   table_name='T1'
order by 1;

INDEX_NAME                         BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ----------- -----------------
T1_OT                                   1         171              1458
T1_TO                                   1         171              1494

So, I have a table with 68605 rows. Object_ID is 100% Distinct and Temporary has 2 Distinct Values. I have 2 indexes on it, which are on object_id & temporary. T1_OT is on (Object_ID, Temporary) and T1_TO is on (Temporary, Object_ID). The naming convention stands for the first letter of the columns in the order they are in the Index. So, for T1_OT O->Object_id and T->Temporary.

The Index Statistics shows that the two indexes are almost same, in terms of Height (BLEVEL), number of Leaf Blocks. A minor different in the Clustering_factor though.

I than executed the queries to demonstrate that the I/O’s done by queries using any of the 2 indexes is exactly same. For this, I executed the query and it used Index T1_OT and and then hinted the query to make use of T1_TO Index. The Cost and IO’s for both the queries are exactly same, which leads to a conclusion that cardinality doesn’t matter when designing a Index Strategy. Index Columns should be based on Application Queries and the Columns. The queries were executed twice to ensure that the consistent gets that we are post the hard parsing.


## IO's for the Query using an Index T1_OT

select owner, object_name from t1
where      object_id = 58
and        temporary='N';

OWNER                OBJECT_NAME
-------------------- ------------------------------
SYS                  I_CCOL2

set autot trace
select owner, object_name from t1
where      object_id = 58
and        temporary='N';

Execution Plan
----------------------------------------------------------
Plan hash value: 3109227855

---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |     1 |    48 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |     1 |    48 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T1_OT |     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_ID"=58 AND "TEMPORARY"='N')

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets 
          0  physical reads
          0  redo size
...
...
          1  rows processed

## IO's with Index T1_TO

SQL> select /*+ index(t1,t1_to) */ owner, object_name from t1
     where      object_id = 58
     and        temporary='N';

OWNER                OBJECT_NAME
-------------------- ------------------------------
SYS                  I_CCOL2

Elapsed: 00:00:00.00
SQL> pause;

set autot trace
select /*+ index(t1,t1_to) */ owner, object_name from t1
where      object_id = 58
and        temporary='N';


Execution Plan
----------------------------------------------------------
Plan hash value: 1129381402

---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |     1 |    48 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |     1 |    48 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T1_TO |     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("TEMPORARY"='N' AND "OBJECT_ID"=58)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
...
...
          1  rows processed

While I have 2 Indexes on the same columns only the ordered changed, optimizer chose an Index on Object_ID and Temporary. As my sessions are interactive (and this sometimes mean that my sessions take more time than alloted :)), I asked the participants the reason behind this optimizer decision and there were lot many assumptions. Anju Garg came out with the correct guess. However, will disclose this later.

At this point, the question raised was, why there are 4 Consistent I/O’s? The assumption here was that it should be 3 (BLEVEL + LEAF BLOCK Access + Table Block).

Next, I dropped any one of the Index and re-created it as a Unique Index. Remember, Object_ID is 100% Distinct. So, I will drop and re-create T1_OT.

SQL> drop index t1_ot;

Index dropped.

Elapsed: 00:00:00.04
SQL> create unique index t1_ot_uq on t1(object_id, temporary);

Index created.

SQL> select index_name, blevel, leaf_blocks, clustering_factor, uniqueness from dba_indexes
where   table_name='T1'
order by 1;
  2    3
INDEX_NAME                         BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR UNIQUENES
------------------------------ ---------- ----------- ----------------- ---------
T1_OT_UQ                                1         162              1457 UNIQUE
T1_TO                                   1         171              1493 NONUNIQUE

I will then execute the queries to check for the consistent gets. This time, the consistent gets for the execution plan with Unique Index is 3 (as against 4 for the same non-unique index).

select owner, object_name from t1
where      object_id = 58
and        temporary='N';

Execution Plan
----------------------------------------------------------
Plan hash value: 1959391432

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |     1 |    48 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1       |     1 |    48 |     2   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | T1_OT_UQ |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_ID"=58 AND "TEMPORARY"='N')


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets
          0  physical reads

There is a difference in the Consistent Gets with Unique and Non-Unique Index. The height of both these Indexes are exactly same. This difference was important to get to original question. So, I generated a 10046 trace for the 2 Queries (with Unique and Without Unique Scan) and the relevant portion from the trace is as under, which will explain the reason behind 4 Consistent Gets.

## 10046 portion for Non-Unique index. Please see the BOLD and UNDERLINED portion. The extra cr=1 for FETCH#18446604434610142176.
## In this case, once the Blocks are fetched from an Index (cr=2) and Table (cr=1) Total cr=3, there is an extra cr immediately after 
## SQL*Net message to client. So, the total cr = 2 + 1 + 1 (extra) = 4.

PARSE #18446604434610142176:c=118,e=119,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1259244381,tim=7093069466
EXEC #18446604434610142176:c=88,e=93,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1259244381,tim=7093070117
WAIT #18446604434610142176: nam='SQL*Net message to client' ela= 8 driver id=1413697536 #bytes=1 p3=0 obj#=0 tim=7093070421
WAIT #18446604434610142176: nam='db file scattered read' ela= 2292 file#=12 block#=275312 blocks=8 obj#=77545 tim=7093073241
WAIT #18446604434610142176: nam='db file sequential read' ela= 49 file#=12 block#=255875 blocks=1 obj#=77543 tim=7093073719
FETCH #18446604434610142176:c=2088,e=3043,p=9,cr=3,cu=0,mis=0,r=1,dep=0,og=1,plh=1259244381,tim=7093073891
WAIT #18446604434610142176: nam='SQL*Net message from client' ela= 624 driver id=1413697536 #bytes=1 p3=0 obj#=77543 tim=7093074878
FETCH #18446604434610142176:c=104,e=104,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=1259244381,tim=7093075164
STAT #18446604434610142176 id=1 cnt=1 pid=0 pos=1 obj=77543 op='TABLE ACCESS BY INDEX ROWID T1 (cr=4 pr=9 pw=0 str=1 time=3109 us cost=2 size=48 card=1)'
STAT #18446604434610142176 id=2 cnt=1 pid=1 pos=1 obj=77545 op='INDEX RANGE SCAN T1_TO (cr=3 pr=8 pw=0 str=1 time=2835 us cost=1 size=0 card=1)'
WAIT #18446604434610142176: nam='SQL*Net message to client' ela= 6 driver id=1413697536 #bytes=1 p3=0 obj#=77543 tim=7093076331

## 10046 portion for Unique index. In this case, once the Blocks are fetched from an Index (cr=2) and Table (cr=1), there is no extra cr.

PARSE #18446604434610123376:c=134,e=135,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1959391432,tim=7120639467
EXEC #18446604434610123376:c=89,e=90,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1959391432,tim=7120640009
WAIT #18446604434610123376: nam='SQL*Net message to client' ela= 8 driver id=1413697536 #bytes=1 p3=0 obj#=0 tim=7120640248
WAIT #18446604434610123376: nam='db file scattered read' ela= 5875 file#=12 block#=260496 blocks=8 obj#=77546 tim=7120646397
WAIT #18446604434610123376: nam='db file sequential read' ela= 64 file#=12 block#=255875 blocks=1 obj#=77543 tim=7120646786
FETCH #18446604434610123376:c=1407,e=6569,p=9,cr=3,cu=0,mis=0,r=1,dep=0,og=1,plh=1959391432,tim=7120646947
STAT #18446604434610123376 id=1 cnt=1 pid=0 pos=1 obj=77543 op='TABLE ACCESS BY INDEX ROWID T1 (cr=3 pr=9 pw=0 str=1 time=6549 us cost=2 size=48 card=1)'
STAT #18446604434610123376 id=2 cnt=1 pid=1 pos=1 obj=77546 op='INDEX UNIQUE SCAN T1_OT_UQ (cr=2 pr=8 pw=0 str=1 time=6240 us cost=1 size=0 card=1)'
WAIT #18446604434610123376: nam='SQL*Net message from client' ela= 583 driver id=1413697536 #bytes=1 p3=0 obj#=77543 tim=7120652705
FETCH #18446604434610123376:c=15,e=15,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,plh=1959391432,tim=7120652925
WAIT #18446604434610123376: nam='SQL*Net message to client' ela= 5 driver id=1413697536 #bytes=1 p3=0 obj#=77543 tim=7120653074

From this, I can assume that the steps carried out for a Non-Unique Index is as under (as it an Index-Range Scan).

-	Read the Root Block to get the address of the Leaf Block (IO = 1). 
- 	Read the Leaf Block to check for the matching Object_ID and Temporary Column. Get the ROWID for the table block. (IO = 2).
-	Go to the Table Block to read the other required columns listed in the SELECT list. (IO=3).
-	Go back to the Index Block to check for any other Object_ID and Temporary Values. (IO=4).
-	It is here that it gets to know that there are no more rows.

Bullet Point 4 is not required for a Unique Scan as Oracle is aware that it is a Unique Index and therefore, there will be no additional read required. Further, to confirm this, I executed a query on OBJECT_ID using an Unique Index. Remember, while Object_Id is 100% Unique, but the Unique Index is on the 2 columns and I am not referring the 2nd column in the query, which will change the plan from UNIQUE SCAN to RANGE SCAN.

select owner, object_name from t1
where      object_id = 58;

Execution Plan
----------------------------------------------------------
Plan hash value: 1834913555

------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |          |     1 |    46 |     3   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1       |     1 |    46 |     3   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T1_OT_UQ |     1 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_ID"=58)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets 

So, this answers the question raised during the session on the rationale behind 4 Consistent Reads. Unique and Non-Unique Indexes are internally same with only a difference in the way these are accessed. With Index Range Scan, the behaviour of both the indexes are exactly same.

Now, for the another question on why the optimizer used T1_OT (by default) and not T1_TO ? The reason was a TIE between the 2 indexes, which is usually very rare with Cost Based Optimization and due to the TIE, optimizer preferred an Index in an alphabetical order. TI_OT comes before TI_TO. To demonstrate this, I dropped and recreated T1_TO as T1_1TO and optimized started using T1_1TO by default.

Cloud Day : North India User Group Chapter

Presenting multiple sessions during Cloud Day scheduled in Gurgaon on 11th November 2017. Will be speaking on “Oracle IaaS and PaaS”, “Oracle In-Memory Database” and “Indexes : Myths and Misconceptions”. For registration, click on :

https://www.meraevents.com/event/aioug-nic-cloud-day

Performance Tuning Day – Mumbai Chapter

Presenting a Full Day session on Performance Optimisation for Mumbai Chapter of All India Oracle User Group. This is on 5th August 2017. For registration, click on this Performance Tuning Day. I have revised the content of this session, to incorporate the latest Optimiser behaviour and changes.

See you all there…

Oracle OpenWorld 2017 Delhi – My Session

I will be presenting at the Oracle OpenWorld 2017, Delhi scheduled for 9th and 10th May 2017. My session detail :

Session ID: CON1092
Session Title: High-Performance Database in Oracle Bare Metal Cloud Services

Hope to see you all at the OOW.

Create Table like External – Hive

There is no end to the Technical Learning. I have been working on Oracle Database Technologies for almost 15 years and thought of learning something new. This thirst got me into Big Data and Hadoop. I started a kind of a self-learning and believe me, I found it very interesting. This blog is about one of my encounter during a practice session on Hive. Usually, I blog to help my readers to know about my experiences. This is again on the same line, but will need resolution or validation from experts across the globe.

My understanding on External and Internal Table is as under:

  • Internal Tables store the data populated into a directory specified under metastore.warehouse.dir or if the location is explicitly specified during table creation. Once the table is dropped, the underlying data is removed as well. This gives you full control over the data. I can relate this to Oracle Tables with a difference that the data is stored inside the database and it is a logical structure.
  • External Tables do not store data but points to the data. Hive doesn’t have the control over the data as it is shared by other tools like Pig etc. Dropping of External table does not  remove the data from the storage. This is similar to the External Tables of Oracle, where we create the structure of the table similar to the format of the txt or csv file.

I created an Internal Table and could observe the behavior explained above. Dropping a table removes the data as well. External table, on the other hand, does not remove it and therefore, my understanding looks fine here. Further, I was also working on a CREATE TABLE syntax and the understanding here was :

  • CREATE TABLE will create an Internal Table as this is the default.
  • For External Table, we need to specify CREATE EXTERNAL TABLE command
  • However, CREATE TABLE table_name like external_table_name will create an External table as I am creating a Table from an External Table.

 

The first 2 bullet points are fine. The problem was with the third bullet point. The outcome is as under :

## let me use the database that I created for my own practice
use vivek;

drop table emp;

create table emp (
empno int comment 'This is Employee Number',
ename string comment 'Employee Name',
country string,
city string,
zipcode int)
row format delimited fields terminated by ',';

load data local inpath '/home/cloudera/emp.txt' into table emp;

hive (vivek)> select * from emp limit 5;
OK
emp.empno       emp.ename       emp.country     emp.city        emp.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005

## Describing the table. See the Bold and underlined text

hive (vivek)> describe formatted emp;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:22:40 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/hive/warehouse/vivek.db/emp
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   true
        numFiles                1
        totalSize               3501
        transient_lastDdlTime   1486016570

Time taken: 0.255 seconds, Fetched: 34 row(s)

## Now, lets create an External Table. The data for which is already copied onto Hadoop

create external table emp_ext (
empno int comment 'This is Employee Number',
ename string comment 'Employee Name',
country string,
city string,
zipcode int)
row format delimited fields terminated by ','
location '/user/cloudera/emp';

hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.116 seconds, Fetched: 5 row(s)

## Describing the table. See the Bold and underlined text

hive (vivek)> describe formatted emp_ext;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:31:25 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486017085

Time taken: 0.216 seconds, Fetched: 37 row(s)

Both the tables contain same data with the difference that EMP is Internal Table and EMP_EXT is an External Table. Internal Table created a directory into HDFS as can be seen from the DESCRIBE FORMATTED OUTPUT for EMP. The directory is /user/hive/warehouse/vivek.db/emp. Dropping EMP will remove this data as well, which is tried and tested. Next, I will create another table from EMP_EXT and will not specify INTERNAL or EXTERNAL keyword.


hive (vivek)> create table emp_ext1 like emp_ext location '/user/cloudera/emp';
OK
Time taken: 0.21 seconds
hive (vivek)> select * from emp_ext1 limit 5;
OK
emp_ext1.empno  emp_ext1.ename  emp_ext1.country        emp_ext1.city   emp_ext1.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.107 seconds, Fetched: 5 row(s)

## Next, Describe the new table. See the bold and underlined text

hive (vivek)> describe formatted emp_ext1;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:41:23 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486017683

Time taken: 0.275 seconds, Fetched: 36 row(s)

The table metadata says that its an Internal Table also referred as MANAGED Tables. My understanding that creating a table from an external table will implicitly create an External table looks to be wrong here. Now, in this case, the location is /user/cloudera/emp which is also shared by EMP_EXT table. I dropped the EMP_EXT1 table to check the impact of this on EMP_EXT.

hive (vivek)> drop table emp_ext1;
OK
Time taken: 0.482 seconds
hive (vivek)> show tables;
OK
tab_name
emp
emp_ext
Time taken: 0.097 seconds, Fetched: 2 row(s)
hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
Time taken: 0.122 seconds

The drop command removed the data from HDFS and querying EMP_EXT fetched no results. This means, even if you create a table from an External table, without specifying EXTERNAL keyword for the CREATE TABLE command, it will create an INTERNAL TABLE. I reconstructed the data again i.e. dropped EMP_EXT and EMP_EXT1.

[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/emp
Found 1 items
-rw-r--r--   1 cloudera cloudera       3501 2017-02-01 22:49 /user/cloudera/emp/emp.txt

use vivek;

hive (vivek)> create external table emp_ext (
            > empno int comment 'This is Employee Number',
            > ename string comment 'Employee Name',
            > country string,
            > city string,
            > zipcode int)
            > row format delimited fields terminated by ','
            > location '/user/cloudera/emp';
OK
Time taken: 0.601 seconds

hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.698 seconds, Fetched: 5 row(s)

hive (vivek)> describe formatted emp_ext;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:51:46 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486018306

Time taken: 0.287 seconds, Fetched: 37 row(s)

## Will Create EMP_EXT1

hive (vivek)> create table emp_ext1 like emp_ext location '/user/cloudera/emp';
OK
Time taken: 0.155 seconds
hive (vivek)> select * from emp_ext1 limit 5;
OK
emp_ext1.empno  emp_ext1.ename  emp_ext1.country        emp_ext1.city   emp_ext1.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.131 seconds, Fetched: 5 row(s)

hive (vivek)> describe formatted emp_ext1;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:53:31 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486018411

Time taken: 0.258 seconds, Fetched: 36 row(s)

The output is same as what it was during our previous execution. I am investigating it further whether my understanding “CREATE TABLE table_name like external_table_name will create an External table as I am creating a Table from an External Table” is wrong or whether it was valid for the earlier versions of Hive. Just for curiosity, I made following changes and Wow..dropping the table didn’t remove the underlying data from HDFS.

hive (vivek)> alter table emp_ext1 set TBLPROPERTIES('table type'='EXTERNAL_TABLE');
OK
Time taken: 0.375 seconds
hive (vivek)> alter table emp_ext1 set TBLPROPERTIES('EXTERNAL'='TRUE');
OK
Time taken: 0.268 seconds
hive (vivek)> describe formatted emp_ext1;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:53:31 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        last_modified_by        cloudera
        last_modified_time      1486019158
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        table type              EXTERNAL_TABLE
        totalSize               0
        transient_lastDdlTime   1486019158

Time taken: 0.206 seconds, Fetched: 40 row(s)
 
hive (vivek)> select * from emp_ext1 limit 5;
OK
emp_ext1.empno  emp_ext1.ename  emp_ext1.country        emp_ext1.city   emp_ext1.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.157 seconds, Fetched: 5 row(s)
hive (vivek)> drop table emp_ext1;
OK
Time taken: 0.207 seconds

hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.159 seconds, Fetched: 5 row(s)

The ALTER TABLE command changed the properties of the Table from Managed to External and droping this table didn’t drop tbe underlying data from HDFS, which was evident from the fact that I could query from EMP_EXT.

Looking for the answer to my query – CREATE TABLE table_name LIKE external_table – Will it create an Internal Table or an External Table ?

Mumbai Chapter AIOUG Event – 21st Jan 2017

Mumbai Chapter of All India Oracle User group has organized a Tech Day on 21st January 2017. I am presenting too and this time on a new topic “Basics of Big Data”. Please do join us and motivate the Mumbai Chapter Team for more such events in future. For Registration and Detailed Agenda, please click on the following link:

Mumbai Chapter 21st Jan 2017

Effective SQL: 61 specific ways of writing effective SQL

Few months back, I completed a technical review of an upcoming book titled “Effective SQL: 61 specific ways of writing effective SQL”. This book focuses on widely used Relational Databases and, no doubt, Oracle is one of these. My review was specifically on Oracle SQL. Got to know from the publisher that this book is scheduled to be out in the market by Friday, 23rd December 2016. The book can be ordered from the following link:

effective-sql-61-specific-ways-to-write-better-sql-9780134578897

 effectivesql

Optimizer – Part IV (12c Enhancements)

This is the final part of my 4 part series on Optimizer. The previous 3 parts can be viewed clicking the following links:

In this blog, I will go through the 12c enhancements. Before we go through the 12c enhancement, let me also briefly cover the real life example that motivated me to write this series. I covered this in Part III. In that post, I missed to paste the relevant output of 10053 trace file, so let me take this up again in this post.

The problem query and it’s run time plan is pasted below.

select count(*) from nca.s_p_attributes a1
WHERE   a1.value='olwassenen';

    COUNT(*)
------------
      591168

SQL> select plan_table_output from table(dbms_xplan.display_cursor);

PLAN_TABLE_OUTPUT
----------------------------------------------------------
SQL_ID	79dfpvydpk710, child number 0
-------------------------------------
select count(*) from nca.s_p_attributes a1 WHERE
a1.value='olwassenen’
----------------------------------------------------------
| Id  | Operation	        | Name 	        | Rows	| 
----------------------------------------------------------
|   0 | SELECT STATEMENT 	|		|	|
|   1 |  SORT AGGREGATE  	|		|      1| 
|*  2 |   INDEX SKIP SCAN	| SP_P_IND3     |      8| 
----------------------------------------------------------

As can be seen, the optimizer calculation is way out (Actuals = 591168 v/s Assumptions=8). Believe me, the table and the column used in the WHERE predicate has a height balanced histogram on it. This will be clearly visible in the 10053 trace. The relevant portion of the 10053 trace is as under.

SINGLE TABLE ACCESS PATH 
  Single Table Cardinality Estimation for S_P_ATTRIBUTES[A1] 
  Column (#3): 
    NewDensity:0.000000, OldDensity:0.001202 BktCnt:254, PopBktCnt:122, PopValCnt:20, NDV:35078144
  Column (#3): VALUE(
    AvgLen: 11 NDV: 35078144 Nulls: 0 Density: 0.000000
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 153
  Table: S_P_ATTRIBUTES  Alias: A1
    Card: Original: 541600373.000000  Rounded: 8  Computed: 8.02  Non Adjusted: 8.02

This was a Non-Popular value and therefore, the selectivity calculation for Non-Popular value (NewDensity in 10053) is as under:


[(NPBKTCNT)/(BKTCNT * (NDV – POPVALCNT))]
[(254-122)/(254 * (35078144-20))] = 132/(254 * 35078124) = 132/8909843496 = 0.0000000148

The calculated selectivity multiplied by the number of rows in the table is the expected cardinality i.e. 0.0000000148*541600373=8.

Now, let’s briefly discuss the 12c Enhancements. Oracle Database 12c introduced additional histograms. These are : Top-n Frequency and Hybrid Histograms. There are few criteria’s for these histograms to be generated. The criteria is :

Following variables are used -
n - Number of Buckets either explicitly specified or left to default (254)
NDV - Number of Distinct Values in the Column
p - Internal Percentage, which is calculated as (1-(1/n))*100

If Data Skew Observed ?
   If NDV > n ?
      If estimate_percent => auto_sample_size ?
         if %of rows for topn frequent values >= p ?
            generate topn frequency histogram;
         else if
            generate hybrid histogram;
         end if;
      else if
         generate height balanced histogram;
      end if;
   else if
      generate frequency histogram;
   end if;
end if;

For the simplicity, I have color coded each IF-ELSE-ENDIF statement. From the preceding pseudo-code, it is clear that to generate 12c specific histograms, we need to specify estimate_percent as auto_sample_size, which is a default. So, if your statistics gathering policy states manual percentage for estimate, then it won’t gather these new histograms. The new auto_sample_size algorithm has improved a lot and therefore, you can safely change the manual estimate to auto.

The value of p requires some explanation. It is calculated as (1-(1/n))*100, where n is the number of buckets specified during stats gathering. Consider following 4 examples :

Example 1 : PROD_ID has 72 Distinct Values. As the bucket size specified is 100, which is more than the number of distinct values, Optimizer will generate Frequency Histogram.

execute dbms_stats.gather_table_stats(ownname=>'SH',tabname=>'SALES',method_opt=>'for columns PROD_ID size 100');

Example 2 : PROD_ID has 72 Distinct Values. In this case, the number of bucket (n) is less than the number of distinct values (72). However, the script explicitly estimates the percent as 30 and therefore, it will generate Height Balanced Histogram.

execute dbms_stats.gather_table_stats(ownname=>'SH',tabname=>'SALES',method_opt=>'for columns PROD_ID size 50', estimate_percent=>30);

Example 3 : PROD_ID has 72 Distinct Values. In this case, the number of bucket (n) is less than the number of distinct values (72). No estimate_percent specified means, it is left to default, which is AUTO_SAMPLE_SIZE. In this case, the decision on Hybrid or Top-N Frequency will be basis the value of p. The calcution for p is (1-(1/50))*100 = 98%. This means, if the top 50 distinct PROD_ID occupy more than 98% rows then, it will be a TopN Frequency, else it will be a Hybrid Histogram. We shall see this in action.

execute dbms_stats.gather_table_stats(ownname=>'SH',tabname=>'SALES',method_opt=>'for columns PROD_ID size 50');

Example : Our own table (TEST_SELECTIVITY) that we created for testing purpose in Optimizer – Part I. In this case, I am generating statistics using method_opt=>’for all columns size AUTO’. The columns will be considered basis the data available in COL_USAGE$, which is populated post execution of any query against a table and column. Column AMOUNT_SOLD has 636 Distinct values, which is higher than 254 (default buckets). The value of n is 254 and p will be (1-(1/254))*100 = 99.61. This means, if the number of rows occupied by top 254 distinct values is more than 99.61%, then it will be TopN Frequency, else it will be Hybrid.

execute dbms_stats.gather_table_stats(user,'TEST_SELECTIVITY',method_opt=>'FOR ALL COLUMNS SIZE AUTO');

select column_id, column_name, num_distinct, num_nulls,
	density, histogram
from	dba_tab_columns
where	owner='SCOTT'
and	table_name='TEST_SELECTIVITY'
and	column_name='AMOUNT_SOLD'
order by 1;

 COLUMN_ID COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM
---------- ------------------------------ ------------ ---------- ---------- -------------------------
         6 AMOUNT_SOLD                             636          0    .000361 HYBRID

SQL> select round((1-(1/254))*100,2) value_p from dual;

   VALUE_P
----------
     99.61

## I have 800000 Rows in the table. So, 99.61% of 800000 is 796850.
## My Statistics Gathering command generated HYBRID, which means, 
## the top 254 distinct values occupy less than 99.61% or less than 796850 rows

SQL> select round((1-(1/254))*800000,0) value_p from dual;

   VALUE_P
----------
    796850

## Query to check Running Count and %Age 
## My Top 254 values occupy 776524, which is less than 796850
## My Top 254 values occupy 97.09%, which is less than 99.61%
## Therefore, the script generated HYBRID Histogram.

SQL> select rownum, amount_sold, cnt, running_cnt, running_perc from (
  2  select amount_sold, cnt, running_cnt,
  3                  sum(r_perc) over(order by cnt desc) running_perc from (
  4  select amount_sold, cnt, sum(cnt) over(order by cnt desc) running_cnt,
  5                  round(ratio_to_report(cnt) over()*100,2) r_perc from (
  6  select amount_sold, count(*) cnt from test_selectivity group by amount_sold order by 2 desc))
  7  order by cnt desc);

    ROWNUM AMOUNT_SOLD        CNT RUNNING_CNT RUNNING_PERC
---------- ----------- ---------- ----------- ------------
         1          10      33600       33600          4.2
         2           9      31964       65564          8.2
         3          11      28850       94414        11.81
         4          13      26968      121382        15.18
         5          47      26934      148316        18.55
         6          48      26114      174430        21.81
         7           8      24906      199336        24.92
         8          49      24239      223575        27.95
         9          46      18784      242359         30.3
        10          23      17014      259373        32.43
        11          21      16965      276338        34.55
        12          24      16246      292584        36.58
        13          22      15954      308538        38.57
        14          25      15782      324320        40.54
        15          51      14906      339226         42.4
        16          30      12932      352158        44.02
        17           7      12030      364188        45.52
        18          26      11958      376146        47.01
        19          17      11585      387731        48.46
        20          12      11547      399278         49.9
        21          52      11344      410622        51.32
        22          28      11227      421849        52.72
        23          29      10728      432577        54.06
        24          14      10180      442757        55.33
        25          53      10059      452816        56.59
        26          31       9907      462723        57.83
        27          19       9682      472405        59.04
        28          39       9571      481976        60.24
        29          33       9145      491121        61.38
        30          38       9027      500148        62.51
        31          54       8468      508616        63.57
        32          42       8177      516793        64.59
        33          32       8017      524810        65.59
        34          27       7651      532461        66.55
        35          34       7617      540078         67.5
        36          63       6607      546685        68.33
        37          20       6582      553267        69.15
        38          64       6575      559842        69.97
        39          56       6204      566046        70.75
        40          40       6060      572106        71.51
        41          57       5723      577829        72.23
        42          50       5511      583340        72.92
        43          35       5388      588728        73.59
        44          15       5219      593947        74.24
        45          60       5195      599142        74.89
        46          36       5062      604204        75.52
        47          58       5031      609235        76.15
        48          16       4903      614138        76.76
        49          41       4831      618969        77.36
        50          18       4562      623531        77.93
        51          62       4542      628073         78.5
        52          59       4421      632494        79.05
        53          45       4311      636805        79.59
        54          43       3946      640751        80.08
        55          65       3625      644376        80.53
        56          55       3372      647748        80.95
        57         123       2852      650600        81.31
        58          44       2327      652927         81.6
        59         126       2296      655223        81.89
        60         203       2286      657509        82.18
        61         115       2200      659709        82.46
        62          61       2187      661896        82.73
        63         211       2105      664001        82.99
        64         213       2032      666033        83.24
        65          91       2022      668055        83.49
        66         128       1953      670008        83.73
        67         113       1837      671845        83.96
        68         629       1799      673644        84.18
        69          73       1656      675300        84.39
        70         100       1641      676941         84.6
        71          72       1596      678537         84.8
        72         602       1577      680114           85
        73          37       1547      681661        85.19
        74         101       1533      683194        85.38
        75          66       1450      684644        85.56
        76         127       1402      686046        85.74
        77          71       1333      687379        85.91
        78          74       1314      688693        86.07
        79         163       1243      689936        86.23
        80          94       1237      691173        86.38
        81         210       1214      692387        86.53
        82         121       1139      693526        86.67
        83         160       1127      694653        86.81
        84          69       1111      695764        86.95
        85          70       1106      696870        87.09
        86         225       1101      697971        87.23
        87         168       1097      699068        87.37
        88         116       1088      700156        87.51
        89         307       1069      701225        87.64
        90         152       1063      702288        87.77
        91         134       1055      703343         87.9
        92         170       1034      704377        88.03
        93          77       1020      705397        88.16
        94          97       1012      706409        88.29
        95         151       1011      707420        88.42
        96         594        992      708412        88.54
        97         630        919      709331        88.65
        98          67        906      711143        88.87
        99         214        906      711143        88.87
       100          76        905      712048        88.98
       101          96        897      712945        89.09
       102         180        864      713809         89.2
       103         136        855      714664        89.31
       104         208        847      715511        89.42
       105         114        835      716346        89.52
       106         135        829      717175        89.62
       107         610        809      717984        89.72
       108         167        803      718787        89.82
       109         202        787      719574        89.92
       110         133        784      720358        90.02
       111         600        779      721137        90.12
       112         120        774      721911        90.22
       113          84        771      722682        90.32
       114         178        769      723451        90.42
       115         117        764      724215        90.52
       116          95        761      724976        90.62
       117        1000        758      725734        90.71
       118         306        735      726469         90.8
       119         199        734      727203        90.89
       120         125        718      727921        90.98
       121         303        688      728609        91.07
       122         112        686      729295        91.16
       123         639        679      729974        91.24
       124         179        670      730644        91.32
       125         228        665      731309         91.4
       126         177        663      731972        91.48
       127        1577        646      732618        91.56
       128         159        630      733248        91.64
       129         539        628      733876        91.72
       130         205        623      734499         91.8
       131          89        614      735113        91.88
       132          90        599      735712        91.95
       133         216        582      736294        92.02
       134         547        579      736873        92.09
       135        1050        564      737437        92.16
       136         557        557      737994        92.23
       137         119        552      738546         92.3
       138          99        536      739082        92.37
       139        1016        528      739610        92.44
       140          78        521      740131        92.51
       141          93        510      740641        92.57
       142         217        485      741126        92.63
       143         304        481      741607        92.69
       144        1053        480      742087        92.75
       145        1068        477      742564        92.81
       146        1496        476      743040        92.87
       147        1014        475      743515        92.93
       148         206        474      743989        92.99
       149        1004        468      744457        93.05
       150        1065        462      744919        93.11
       151        1566        461      745380        93.17
       152         156        458      745838        93.23
       153         207        457      746295        93.29
       154         140        452      746747        93.35
       155         142        451      747198        93.41
       156        1260        448      747646        93.47
       157         552        447      748093        93.53
       158        1599        442      748535        93.59
       159         222        433      748968        93.64
       160         212        424      749392        93.69
       161        1698        422      749814        93.74
       162         215        416      750230        93.79
       163        1240        415      750645        93.84
       164          85        408      751053        93.89
       165         155        403      751456        93.94
       166         148        401      751857        93.99
       167         554        399      752256        94.04
       168         900        395      752651        94.09
       169         556        391      753042        94.14
       170        1118        388      753430        94.19
       171          75        384      753814        94.24
       172          88        382      754196        94.29
       173         302        375      754571        94.34
       174        1057        372      754943        94.39
       175        1029        368      755311        94.44
       176        1321        366      755677        94.49
       177        1109        365      756042        94.54
       178          79        356      756398        94.58
       179         102        354      756752        94.62
       180        1729        344      757096        94.66
       181         204        341      757437         94.7
       182         659        339      758454        94.82
       183         195        339      758454        94.82
       184        1551        339      758454        94.82
       185          98        337      758791        94.86
       186         169        323      759114         94.9
       187         562        322      759436        94.94
       188        1195        320      759756        94.98
       189        1215        318      760074        95.02
       190         936        317      760391        95.06
       191          68        313      760704         95.1
       192         187        310      761014        95.14
       193         548        309      761632        95.22
       194        1714        309      761632        95.22
       195        1217        308      761940        95.26
       196         468        307      762554        95.34
       197        1297        307      762554        95.34
       198        1656        305      762859        95.38
       199         608        303      763162        95.42
       200         124        301      763463        95.46
       201         150        299      763762         95.5
       202        1206        294      764056        95.54
       203         181        293      764642        95.62
       204         106        293      764642        95.62
       205         288        288      764930        95.66
       206        1078        286      765216         95.7
       207         158        283      765499        95.74
       208        1633        282      765781        95.78
       209         154        277      766058        95.81
       210        1556        274      766606        95.87
       211        1061        274      766606        95.87
       212         161        273      766879         95.9
       213        1176        270      767149        95.93
       214         914        268      767417        95.96
       215         490        267      767684        95.99
       216         289        261      767945        96.02
       217        1758        256      768713        96.11
       218        1353        256      768713        96.11
       219        1045        256      768713        96.11
       220        1508        255      768968        96.14
       221         196        249      769217        96.17
       222         183        248      769465         96.2
       223        1674        246      769711        96.23
       224        1237        244      769955        96.26
       225        1531        243      770198        96.29
       226        1421        242      771166        96.41
       227         118        242      771166        96.41
       228         544        242      771166        96.41
       229        1501        242      771166        96.41
       230        1304        235      771636        96.47
       231        1076        235      771636        96.47
       232         842        232      771868         96.5
       233         561        230      772098        96.53
       234         143        228      772326        96.56
       235         632        222      772548        96.59
       236         193        216      772980        96.65
       237         305        216      772980        96.65
       238         596        215      773195        96.68
       239         194        209      773613        96.74
       240        1003        209      773613        96.74
       241         131        207      773820        96.77
       242        1553        206      774026         96.8
       243         937        202      774430        96.86
       244         580        202      774430        96.86
       245        1125        201      774631        96.89
       246         200        194      774825        96.91
       247         524        193      775018        96.93
       248         947        192      775594        96.99
       249         141        192      775594        96.99
       250        1709        192      775594        96.99
       251        1488        189      775783        97.01
       252        1334        187      775970        97.03
       253         281        186      776156        97.05
       254        1227        184      776524        97.09 <-- Top 254 Distinct Values
       255        1703        184      776524        97.09
.....
.....
.....
.....
       624        1231          1      800000        99.94
       625        1177          1      800000        99.94
       626        1638          1      800000        99.94
       627        1189          1      800000        99.94
       628         162          1      800000        99.94
       629          81          1      800000        99.94
       630         647          1      800000        99.94
       631         584          1      800000        99.94
       632         619          1      800000        99.94
       633        1036          1      800000        99.94
       634         708          1      800000        99.94
       635         518          1      800000        99.94
       636         522          1      800000        99.94

Next, I gather statistics on PRODUCTS table under SH Schema. This table has 72 rows with 22 Distinct values in PROD_SUBCATEGORY_ID column. First, I gather statistics using 10 Buckets. Optimizer will generate HYBRID Histogram, as the number of rows occupied by Top 10 Distinct values is less than than the value of p (1-(1/10)) = 90%. Then, I will generate using 18 Buckets and this time, it will be TopN Frequency, as the number of rows occupied by Top 18 Distinct Values is more than the value of p (1-(1/18)) = 94%.

SQL> select table_name, num_rows from dba_tables where table_name='PRODUCTS';

TABLE_NAME                       NUM_ROWS
------------------------------ ----------
PRODUCTS                               72

SQL> select column_name, num_distinct from dba_tab_columns
  2  where      table_name='PRODUCTS'
  3  and        owner='SH'
  4  and        column_name='PROD_SUBCATEGORY_ID';

COLUMN_NAME                    NUM_DISTINCT
------------------------------ ------------
PROD_SUBCATEGORY_ID                      22


SQL> select 1-(1/10) from dual;

  1-(1/10)
----------
        .9

SQL> select round((1-(1/10))*72,0) from dual;

ROUND((1-(1/10))*72,0)
----------------------
                    65


SQL> compute sum of running_sum on report
SQL> break on report
SQL> select rownum, prod_subcategory_id, cnt, running_cnt, running_perc from (
  2  select prod_subcategory_id, cnt,
  3                  sum(cnt) over(order by cnt desc) running_cnt,
  4                  sum(r_perc) over(order by cnt desc) running_perc from (
  5  select prod_subcategory_id, cnt, round(ratio_to_report(cnt) over()*100,2) r_perc from (
  6  select prod_subcategory_id, count(*) cnt from products group by prod_subcategory_id order by 2 desc))
  7  order by cnt desc);

    ROWNUM PROD_SUBCATEGORY_ID        CNT RUNNING_CNT RUNNING_PERC
---------- ------------------- ---------- ----------- ------------
         1                2014          8           8        11.11
         2                2055          7          15        20.83
         3                2032          6          27        37.49
         4                2054          6          27        37.49
         5                2056          5          47        65.25
         6                2051          5          47        65.25
         7                2031          5          47        65.25
         8                2042          5          47        65.25
         9                2036          4          51        70.81
        10                2043          3          54        74.98 <-- Less than 90% or 65 Rows (HYBRID)
        11                2033          2          66        91.66
        12                2035          2          66        91.66
        13                2053          2          66        91.66
        14                2012          2          66        91.66
        15                2013          2          66        91.66
        16                2034          2          66        91.66
        17                2021          1          72          100
        18                2011          1          72          100
        19                2044          1          72          100
        20                2041          1          72          100
        21                2022          1          72          100
        22                2052          1          72          100

22 rows selected.

## STATS WITH BUCKET 10
exec dbms_stats.gather_table_stats(ownname=>'SH',tabname=>'PRODUCTS',method_opt=>'for columns prod_subcategory_id size 10');



SQL> select column_id, column_name, num_distinct, num_nulls,
  2                  density, histogram
  3  from       dba_tab_columns
  4  where      owner='SH'
  5  and        table_name='PRODUCTS'
  6  and             column_name='PROD_SUBCATEGORY_ID';

 COLUMN_ID COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM
---------- ------------------------------ ------------ ---------- ---------- -------------------------
         5 PROD_SUBCATEGORY_ID                      22          0    .044976 HYBRID


SQL> select (1-(1/18)) from dual;

(1-(1/18))
----------
.944444444


SQL> select round((1-(1/18))*72,0) from dual;

ROUND((1-(1/18))*72,0)
----------------------
                    68

SQL> select rownum, prod_subcategory_id, cnt, running_cnt, running_perc from (
  2  select prod_subcategory_id, cnt,
  3                  sum(cnt) over(order by cnt desc) running_cnt,
  4                  sum(r_perc) over(order by cnt desc) running_perc from (
  5  select prod_subcategory_id, cnt, round(ratio_to_report(cnt) over()*100,2) r_perc from (
  6  select prod_subcategory_id, count(*) cnt from products group by prod_subcategory_id order by 2 desc))
  7  order by cnt desc);

    ROWNUM PROD_SUBCATEGORY_ID        CNT RUNNING_CNT RUNNING_PERC
---------- ------------------- ---------- ----------- ------------
         1                2014          8           8        11.11
         2                2055          7          15        20.83
         3                2032          6          27        37.49
         4                2054          6          27        37.49
         5                2056          5          47        65.25
         6                2051          5          47        65.25
         7                2031          5          47        65.25
         8                2042          5          47        65.25
         9                2036          4          51        70.81
        10                2043          3          54        74.98
        11                2033          2          66        91.66
        12                2035          2          66        91.66
        13                2053          2          66        91.66
        14                2012          2          66        91.66
        15                2013          2          66        91.66
        16                2034          2          66        91.66
        17                2021          1          72          100
        18                2011          1          72          100 <-- More than 94% or 68 Rows
        19                2044          1          72          100
        20                2041          1          72          100
        21                2022          1          72          100
        22                2052          1          72          100

22 rows selected.

SQL> exec dbms_stats.gather_table_stats(ownname=>'SH',tabname=>'PRODUCTS',method_opt=>'for columns prod_subcategory_id size 18');

PL/SQL procedure successfully completed.

SQL> select column_id, column_name, num_distinct, num_nulls,
  2                  density, histogram
  3  from       dba_tab_columns
  4  where      owner='SH'
  5  and        table_name='PRODUCTS'
  6  and             column_name='PROD_SUBCATEGORY_ID';

 COLUMN_ID COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM
---------- ------------------------------ ------------ ---------- ---------- -------------------------
         5 PROD_SUBCATEGORY_ID                      22          0 .006944444 TOP-FREQUENCY

In this blog, I have tried to explain the underlying calculation by way of which 12c introduced histograms are generated. Hope, the explanation were clear. As always, comments are welcome.

Sangam 2016

Sangam 2016, an Annual Event organized by All India Oracle User Group is just a 35 days from now. It is scheduled for 11th and 12th November in Bangalore. A chance to meet and listen to World renowned speakers from all over the world. Visit Sangam 2016 for more details about the event, the speakers, their sessions etc.

I am presenting 2 back to back sessions on Cost Based Optimizer. My session is on 12th November 2016. Hope to see you all in large numbers :). I will also be hosting an In-Memory Demo Booth, along with the Product Management Team from US.

i_am_speaking

%d bloggers like this: