Post my previous Blog on Consistent Gets for an Index Scan, there were few other queries that I received over the chat. The queries were related to Clustering Factor of an Index and therefore, the recommendation or comment was that the Columns should be ordered to keep the value of a Clustering Factor as low as possible. The other comment was that High Cardinality or Low Cardinality plays a critical role with an exception for Unique or Primary Key Indexes. Again, Clustering Factor was cited as the reason behind this comment. Thought of clarifying this as well. In my opinion, this is again a MYTH.
The myth that I am going to address here is – Indexes and Column Ordering should be designed such that the Clustering Factor is as low as possible. This means, an Index on (B,A) should be preferred than (A,B) if the clustering factor of (A,B) is higher than (B,A).
Assuming, I have application queries, like :
select C, D from TABLE_A where A=:b1;
select C, D from TABLE_A where B=:b1 and A=:b2;
If I go by the above myth, shall I then create an Index on (B,A) ? Will this index be used for Query#1 ? Remember, Index Skip Scan is an Optimizer decision and is chosen if the leading column is a low cardinality column. The definition of low cardinality would be an Optimizer decision based on the costing. If an index on (B,A) is not used by Query#1, then shall I go ahead and create another Index on A ? Indexes improve query performance but are overhead for DML’s and therefore, less the number of Indexes, better the DML performance would be. Further, if a column is badly clustered, whether it is a leading column or a trailing column, it doesn’t matter and we will see this with an example. Therefore, as I mentioned in earlier and in my previous blog, Column Ordering of an Index should be basis your application queries and should not be designed basis any other table or column level statistics.
In this case, I am creating a table VIVEK_TEST with 100K rows. Column BC stands for BAD_CLUSTERED and will dominate the clustering factor. There are other 2 columns of interest – Object_ID which is a non unique column and RNUM_UNQ which is a Unique Column.
exec dbms_random.seed(0);
create table vivek_test as
with test as
(select * from all_objects where rownum between 1 and 10000)
select trunc((rownum-1)/1000) object_id,
rownum rnum_unq,
round(dbms_random.value(1,100),0) bc,
a.object_name,
a.temporary,
a.created
from test a,
test b
where rownum between 1 and 10000;
exec dbms_stats.gather_table_stats(user,'VIVEK_TEST');
column owner for a30
select owner, num_rows, blocks from dba_tables where table_name='VIVEK_TEST';
OWNER NUM_ROWS BLOCKS
-------------------- ---------- ----------
SCOTT 10000 62
First, I will create 4 Indexes – 2 of these are Single Column Indexes and 2 are Multi-Column.
create index vivek_test_oid on vivek_test(object_id);
create index vivek_test_oid_bc on vivek_test(object_id, bc);
create index vivek_test_bc on vivek_test(bc);
create index vivek_test_bc_oid on vivek_test(bc,object_id);
column index_name for a30
select index_name, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
from dba_indexes
where table_name='VIVEK_TEST'
order by 1;
INDEX_NAME NUM_ROWS BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
------------------------------ ---------- ---------- ----------- ------------- -----------------
VIVEK_TEST_BC 10000 1 20 100 4642 <-- Single Column on BC (Bad_Clustered)
VIVEK_TEST_BC_OID 10000 1 24 1000 4642 <-- BC is the Leading Column
VIVEK_TEST_OID 10000 1 20 10 56
VIVEK_TEST_OID_BC 10000 1 24 1000 4856 <-- BC is the Trailing Column
As seen in the example above, column ordering doesn’t matter when it comes to the Clustering Factor. My choice of Indexing would be :
VIVEK_TEST_BC_OID if most of my queries are on
BC=:b1;
BC =:b1 and object_id=:b2;
VIVEK_TEST_OID_BC if most of my queries are on
object_id=:b1;
BC=:b1 and object_id=:b2;
Now, I will drop these indexes and create new indexes on the combination of BC and RNUM_UNQ. RNUM_UNQ is 100% Unique Column.
create index vivek_test_rnum on vivek_test(rnum_unq);
create index vivek_test_rnum_bc on vivek_test(rnum_unq, bc);
create index vivek_test_bc on vivek_test(bc);
create index vivek_test_bc_rnum on vivek_test(bc, rnum_unq);
column index_name for a30
select index_name, num_rows, blevel, leaf_blocks, distinct_keys, clustering_factor
from dba_indexes
where table_name='VIVEK_TEST'
order by 1;
INDEX_NAME NUM_ROWS BLEVEL LEAF_BLOCKS DISTINCT_KEYS CLUSTERING_FACTOR
------------------------------ ---------- ---------- ----------- ------------- -----------------
VIVEK_TEST_BC 10000 1 20 100 4642 <-- Single Column on BC (Bad_Clustered)
VIVEK_TEST_BC_RNUM 10000 1 26 10000 4642 <-- Leading Column dominates CF
VIVEK_TEST_RNUM 10000 1 21 10000 56
VIVEK_TEST_RNUM_BC 10000 1 26 10000 56
For an Index with a Column, either leading or trailing, with 100% Unique values, the leading column would dictate the calculation of Clustering Factor. This is obvious as for the Clustering factor calculation, the values are sorted on the first_column then on Second_Column. For a better idea of how is it calculated, you may run the following queries :
## CF Calculation for BC, RNUM_UNQ (it matches the value as per DBA_INDEXES)
select sum(block_change) from (
select block_fno, bc, RNUM_UNQ, prev_bfno,
(case when nvl(prev_bfno,0)!=block_fno then 1 else 0 end) block_change from (
select block_fno, bc, RNUM_UNQ, lag(block_fno) over (order by bc, rnum_unq) prev_bfno from (
select dbms_rowid.rowid_block_number(rowid)||'.'||
dbms_rowid.ROWID_TO_ABSOLUTE_FNO(rowid,'&Schema','&table_name') block_fno,
bc, RNUM_UNQ
from VIVEK_TEST
order by bc, RNUM_UNQ, block_fno)
)
);
SUM(BLOCK_CHANGE)
-----------------
4642
## CF Calculation for RNUM_UNQ, BC (it matches the value as per DBA_INDEXES)
select sum(block_change) from (
select block_fno, RNUM_UNQ, bc, prev_bfno,
(case when nvl(prev_bfno,0)!=block_fno then 1 else 0 end) block_change from (
select block_fno, RNUM_UNQ, bc, lag(block_fno) over (order by rnum_unq, bc) prev_bfno from (
select dbms_rowid.rowid_block_number(rowid)||'.'||
dbms_rowid.ROWID_TO_ABSOLUTE_FNO(rowid,'&Schema','&table_name') block_fno,
bc, RNUM_UNQ
from VIVEK_TEST
order by RNUM_UNQ, bc, block_fno)
)
);
SUM(BLOCK_CHANGE)
-----------------
56
This query can be further modified to get the calculation. Remove the SUM(BLOCK_CHANGE) SELECT from the query and you could generate the output with the calculation. For every Index Entry, if the table block is changed, the BLOCK_CHANGE value is set to 1. This query will help you understand the calculation. This holds true for a combination of Non-Unique columns as well. It is not that for column involving Unique Column, the calculation is different and this will be clear if you deep dive into the queries pasted above.
During my session, I also mentioned a FLAW in Clustering Factor. It is that it doesn’t take into consideration the caching effect and to address this, Oracle introduced TABLE_CACHED_BLOCKS preference. I am against setting manually setting clustering_factor to a lower value as it will be overwritten by the next statistics gathering process. Also, I am STRONGLY against CTAS using ORDER BY to improve the clustering_factor. By using CTAS, you can improve the clustering_factor of one index, but it will impact this for other indexes. The best way is to use the TABLE_CACHED_BLOCKS preference as depicted below.
I will use the queries pasted above to come out with the actual clustering_factor and will then set the TABLE_CACHED_BLOCKS preference. For demonstration, I will work on an Index on BC, RNUM_UNQ. This process can be used for any of the indexes.
## I will create a temporary table to hold the clustering_factor calculation
create table cluf_factor as
select bc, blkno,
lag(blkno,1,blkno) over(order by bc) prev_blkno,
case when blkno!=lag(blkno,1,blkno) over(order by bc) or rownum=1
then 1 else null end cluf_ft from
(select bc, rnum_unq, dbms_rowid.rowid_block_number(rowid) blkno
from VIVEK_TEST
where BC is not null
order by bc);
compute sum of cnt on report
break on report
select blkno, count(*) cnt from cluf_factor group by blkno order by 1;
BLKNO CNT
---------- ----------
255883 188
255884 185
255885 185
255886 185
255887 185
255936 182
255937 180
255938 180
255939 180
255940 180
255941 180
255942 180
255943 180
255945 180
255946 180
255947 180
255948 180
255949 180
255950 180
255951 180
258896 180
258897 180
258898 180
258899 180
258900 180
258901 180
258902 180
258903 180
258905 180
258906 180
258907 180
258908 180
258909 180
258910 180
258911 180
260480 180
260481 180
260482 180
260483 180
260484 180
260485 180
260486 180
260487 180
260489 180
260490 180
260491 180
260492 180
260493 180
260494 180
260495 180
260496 180
260497 180
260498 180
260499 180
260500 180
260501 70
----------
sum 10000
56 rows selected.
As per the output, each of the block was touched around 180 times. Each of these touch were considered as it these were read from the Disk. However, Oracle purely works in Memory (Buffer_Cache, Shared_Pool etc) and once the block is read from disk, it is cached in the Buffer_cache thus avoiding the disk i/o. This flaw causes unwanted high clustering_factor and therefore, is addressed by way of TABLE_CACHED_BLOCKS.
TABLE_CACHED_BLOCKS value defaults to 1. Increasing this value (to, say 250) means, clustering_factor for an Index will not be incremented, if the table block being referenced by the current Index entry has been referenced by any of the previous 250 Index Entries. In our case, each of the table blocks were referenced by previous 180 index entries. Therefore, I will change this value to 180.
select index_name, leaf_blocks, clustering_factor from dba_indexes
where table_name = 'VIVEK_TEST';
INDEX_NAME LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ----------- -----------------
VIVEK_TEST_RNUM 21 56
VIVEK_TEST_RNUM_BC 26 56
VIVEK_TEST_BC 20 4642
VIVEK_TEST_BC_RNUM 26 4642
select dbms_stats.get_prefs(pname=>'TABLE_CACHED_BLOCKS',ownname=>'SCOTT',tabname=>'VIVEK_TEST') preference from dual;
PREFERENCE
-----------------------------------------------------------
1
## Setting TABLE_CACHED_BLOCKS to 180
exec dbms_stats.set_table_prefs(ownname=>'SCOTT',tabname=>'VIVEK_TEST',pname=>'TABLE_CACHED_BLOCKS',PVALUE=>180);
select dbms_stats.get_prefs(pname=>'TABLE_CACHED_BLOCKS',ownname=>'SCOTT',tabname=>'VIVEK_TEST') preference from dual;
PREFERENCE
------------------------------------------------------------
180
exec dbms_stats.gather_table_stats(user,'VIVEK_TEST',method_opt=>'for all columns size 1', no_invalidate=>false);
select index_name, leaf_blocks, clustering_factor from dba_indexes
where table_name = 'VIVEK_TEST';
INDEX_NAME LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ----------- -----------------
VIVEK_TEST_RNUM 21 56
VIVEK_TEST_RNUM_BC 26 56
VIVEK_TEST_BC 20 56
VIVEK_TEST_BC_RNUM 26 56
Clustering Factor is calculated for Unique Indexes as well and is critical for costing. The only time when clustering_factos is not considered or not relevant for Unique or Primary Keys is when querying for all the columns part of the Unique or Primary Key i.e. INDEX UNIQUE SCAN. However, clustering_factor will play a critical role when accessing the Unique or Primary Key using INDEX RANGE SCAN.
With this blog, I tried addressing following Myth’s –
- Indexes and Column Ordering should be designed such that the Clustering Factor is as low as possible.
- Clustering_Factor is irrelevant for Unique and Primary Key Indexes.
- High Cardinality or Low Cardinality play important roles with an exception for Unique & Primary Key.
Be it a Unique Index / Primary Key Index or any other Index, column ordering has to be basis your application queries.
One interesting fact on TABLE_CACHED_BLOCKS : While testing this, I actually created a table with 100k rows. The number of blocks in the table were around 469 and the table block were referenced by around 230 previous index entries. Therefore, I modified this preference to 230 and regathered the statistics. It didn’t work. The same table creation script with 10k rows, everything worked as expected. I am investigating this and will update the blog, once I have something to share.
Like this:
Like Loading...