Benefits of SQL Baseline ! Yet another performance issue post 10g Upgrade

First, let me apologies for a brief pause from blogging. I was out of action for most of these days and as I always maintain, I write only if I am active and have something to share with Oracle Community.

Few years back, I had suggested a customer to maintain SQL Baseline that collects the execution plan and other details like SQL_TEXT etc. While the customer had scheduled a job that periodically collects these informations into some tables, these collected information were never used, infact, a need never arouse to get into the collected information. Finally, post upgrade of this database to 10g, a severe performance degradation was reported and the information collected using SQL Baseline came handy. Usually, before upgrading a database to 10g, I strongly recommend performance testing, but in this case, the customer went ahead with the production upgrade with minimal testing.

Another challenge in this Upgrade was the Stats gathering policy. During the pre-upgrade discussion, the Customer mentioned :

  1. Stats for very few tables are generated periodically,
  2. Stats for most of the tables are stale and if gathered, impacts the performance,
  3. Stats for many of the tables are not collected at all.

Performance issues, with very minimal testing and Optimizer Stats issues, was inevitable and customer was well aware of this fact. However, post-upgrade the impact was unexpectedly very severe and the database was almost unworkable. As a result, and as expected in a firefighting mode, various optimization techniques were ineffectively implemented. At this point of time, our investigation started.

As mentioned earlier, the challenge here was unavailability of test results from 9i instance. There were significant waits on CBC Latches and "read by other session" and one common table used in these impacted queries (almost in 90% of the queries) was TH_DBF. During my investigation, I concentrated on one top query that was executed multiple times and was waiting on CBC latch for most of the time. The text of the query is as mentioned below, alongwith the execution plan from 10g.

select T1.TIMESTAMP, T1.M_IDENTITY, T1.M_TRN_GRP, T1.M_TRN_TYPO, T1.M_BINTERNAL, T1.M_BENTITY, 
T1.M_SINTERNAL, T1.M_SENTITY, T1.M_TRN_STATUS, T1.M_TRN_EXP, T1.M_VAL_STATUS, 
T1.M_MOP_LAST,T2.TIMESTAMP, T2.M_IDENTITY, T2.M_NB, T2.M_LINE, T2.M_SRC_PFOLIO, T2.M_DST_PFOLIO, 
T2.M_CNTRP, T2.M_FEE, T2.M_CUR, T2.M_CODE, T2.M_AUTF, T2.M_AUTS, T2.M_TYPE, 
T2.M_VALUE_TYPE, T2.M_VALUE_DATE, T2.M_AVP 
from 	TH_DBF T1,
	TB_DBF T2  
where 	(T1.M_NB   =  T2.M_NB ) 
and 	((((((((T1.M_BINTERNAL like 'N%') or (T1.M_SINTERNAL like 'N%')) 
and 	(T1.M_TRN_EXP>=to_date('20100105','YYYYMMDD'))) 
and 	(((T1.M_VAL_STATUS'MVAL') and (T1.M_VAL_STATUS'MCON')) 
and 	(T1.M_VAL_STATUS'MCAP'))) and (T1.M_MOP_LAST3)) 
and 	(T1.M_TRN_STATUS'DEAD')) and (T1.M_TRN_GRP like 'FXD%')) 
or 	((((T1.M_BENTITY'INDIA MUM') and (T1.M_SENTITY'INDIA MUM')) 
and 	(instr(T1.M_TRN_TYPO, 'PAYCUR') >0 )) 
and 	(T1.M_TRN_EXP>to_date('20100105','YYYYMMDD'))));

Plan hash value: 2180463251

-----------------------------------------------------------------------------
| Id  | Operation          | Name           | Rows  | Bytes |TempSpc| Cost  |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                |       |       |       | 52386 |
|*  1 |  HASH JOIN         |                |   367K|    50M|  9512K| 52386 |
|*  2 |   TABLE ACCESS FULL| TH_DBF         | 91836 |  8430K|       | 46682 |
|   3 |   TABLE ACCESS FULL| TB_DBF         |  2541K|   121M|       |  3161 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T1"."M_NB"="T2"."M_NB")
   2 - filter(((("T1"."M_BINTERNAL" LIKE :SYS_B_00 OR
              "T1"."M_SINTERNAL" LIKE :SYS_B_01) AND "T1"."M_TRN_EXP">=:SYS_B_02 AND
              "T1"."M_VAL_STATUS":SYS_B_03 AND "T1"."M_VAL_STATUS":SYS_B_04 AND
              "T1"."M_VAL_STATUS":SYS_B_05 AND "T1"."M_MOP_LAST":SYS_B_06 AND
              "T1"."M_TRN_STATUS":SYS_B_07 AND "T1"."M_TRN_GRP" LIKE :SYS_B_08) OR
              ("T1"."M_BENTITY":SYS_B_09 AND "T1"."M_SENTITY":SYS_B_10 AND
              INSTR("T1"."M_TRN_TYPO",:SYS_B_11)>:SYS_B_12 AND
              "T1"."M_TRN_EXP">:SYS_B_13)))

SQL Baseline

Fortunately, the customer had maintained SQL Baseline for all the queries and the good plan from 9i was found in these baseline tables. In 9i, thanks to Thomas Kyte, following tables are critical and should be maintained on a regular basis :

create table sqlarea_baseline as select hash_value, sql_text, executions, buffer_gets from v$sqlarea;
create index sqlarea_baseline_idx on sqlarea_baseline(hash_value);

insert into sqlarea_baseline
select hash_value, sql_text, executions, buffer_gets from v$sqlarea a
where not exists
(select 'x' from sqlarea_baseline b where b.hash_value = a.hash_value);

create table sqlplan_baseline as select * from v$sql_plan;
create index sqlplan_baseline_idx on sqlplan_baseline(hash_value);

insert into sqlplan_baseline
select * from v$sql_plan a
where not exists
(select 'x' from sqlplan_baseline b where b.hash_value = a.hash_value);

The data collected using above scripts came handy and we could get the plan of this SQL from 9i. The plan from 9i is pasted below :

9i Plan
-------
--------------------------------------------------------------------------------
| Operation                      | PHV/Object Name     |
Rows | Bytes|   Cost |
--------------------------------------------------------------------------------
|SELECT STATEMENT                |----- 4101123407 ----|       |      |  48785 |
| NESTED LOOPS                   |                     |    28K|    3M|  48785 |
|  TABLE ACCESS FULL             |TH_DBF               |     7K|  643K|  46682 |
|  INDEX RANGE SCAN              |TB_ND1               |     4 |      |      2 |
--------------------------------------------------------------------------------

In both the cases, the optimizer has opted for a Full Table Scan of T_DBF table. The difference is Nested Loop in 9i against Hash Join and FTS of the second table in 10g. Now the investigation was restricted to check for the change that has triggered this HASH Join. If you notice the number of rows, as calculated by the optimizer, returned by the FULL Table Scan of TH_DBF, there is a vast difference in the calculated cardinality by the two versions. Optimizer from 9i computes 7K rows, whereas, it is 91K in 10g. The Optimizer choice of Hash Join is well justified, if the rows returned by the initial step is huge. Just to check whether the plan changes by passing 7k as a cardinality of the Driving Step, I executed the query with CARDINALITY(t1,7000) hint and as expected, the plan from 9i was restored i.e.changed from Hash Join to Nested Loop, also, the response time of the query improved as well.

Now, since the problem was clear and known, detailed investigation in the optimizer calculation was required. Modifying the query with CARDINALITY hint was not possible, as it required change at the application level and could have taken time. Modifying the statistics was also not feasible as this table is the heart of the application and modifying the stats may impact other queries. Therefore, the only choice was to investigate and fix the issue.

INVESTIGATION

With the analysis mentioned above, it was evident that Optimizer is miscomputing the expected Cardinality of the row source generated by the driving step. Therefore, the columns of TH_DBF table used in the query were looked upon and one such column is M_TRN_EXP. This is a date column. Whenever this query is executed, the value passed, at runtime, to this column is SYSDATE. As a result, the number of rows returned by this query is less than 5000 rows. This was validated after executing the query with CARDINALITY hint. With 5000 rows, and 91k prediction by the optimizer, I suspected something amiss with the Column level statistics and therefore, I got into these details.

Using dbms_stats.convert_raw_value function, I could convert the raw value data stored in Low_Value and High_Value of dba_tab_columns. Surprisingly, the result after converting these is as under (the date is in YYYYMMDD format) :

COLUMN_NAME                    NUM_DISTINCT GET_DATE GET_DATE
------------------------------ ------------ -------- --------
M_TRN_EXP                              4355 19810101 20660901

With M_TRN_EXP>sysdate, the optimizer assumes that it has to fetch the data of 56 years out of 85 Years, which is nearly 65%, and this calculation is enough to misguide the optimizer computation. I have always maintained, that Optimizer is a piece of Code and it has to come out with a result based on the input given by us, in the form of statistics. There were only 4 rows with dates pertaining to 1966 and optimizer was not aware of this fact.

SOLUTION

Now, it was quite evident that this is purely an application issue and the Data entered into the application table. The customer requested for a immediate fix to this issue and the solutions suggested were :

  1. Rectifying the data : Not possible immediately
  2. Modifying Column Level Statistics (High Value)
  3. Generating Histograms on this column

Just to make sure that the Column Level Statistics, due to future date data entered, is a problem, we modified the statistics for M_TRN_EXP column and changed the High_Value to 20111231 and the query plan changed for good (similar to 9i). Therefore, a better approach discussed was to generate histogram on this column. Once the histogram was generated, the query response time improved and the plan from 9i was restored. Surprisingly, with this change, the overall performance of the database improved drastically, as most of the queries executed are on this table and all the queries have similar logic (M_TRN_EXP).

Advertisements
%d bloggers like this: