How Challenging is the task of a Performance Optimization Expert ? A real life example…
November 19, 2011 7 Comments
It is a well known fact that, most of the times, a Performance Optimization Expert is usually involved in Reactive mode. An Ongoing Performance Issue is easy to diagnose, but what if a severe performance issue was reported a day before and this caused a Crash of a production system ?
In case of an Issue that occured in the past, while most of the experts call for AWR reports and rely on ASH data available, but in my opinion, the analysis and recommendation based on AWR and ASH reports does not guarantee 100% accuracy. To prove this point, I have a real life scenario to share.
One of my customers added another node to an existing 2 node RAC production setup. Post completion of this activity, the system was released to the end users. In this case, end users are the customers across the globe. In the afternoon, severe slowness was reported and before anyone could analyze the issue, all the Application Servers / Web Servers Crashed and had to be restarted. There was a severe escalation and everyone pointed the finger of blame on the addition of a 3rd Node, as this was the only change in the entire production system. The Database Administrators also mentioned that while investigating the issue, they could observe severe “Cursor:pin” related waits. There were around 400+ sessions waiting for the “Cursor:pin” and the queries that were stuck were running successfully on other 2 nodes. Therefore, it was assumed to be a BUG or some issue related to 3rd Node. Since the Application architecture involved connections thru connection pool, when all the connections pool got exhausted, severe slowness and then a crash was reported.
A detailed investigation was initiated the next day. AWR report for the problem period and the ASH reports also revealed that almost 90% of the time the sessions were waiting on “Cursor:Pin” and the other 2 nodes the AWR was clean. While we were still investigating the issue, one of the DBA also mentioned that all other queries were working absolutely fine and, the problem only existed with the two tables as the queries on these 2 objects were waiting on the Mutexes and the executions of these queries were ZERO. The executions of the queries on these 2 objects were 1000+. Further, he also mentioned that they tried running the queries on these 2 objects from SQL*Plus, even these hung and had to be cancelled using CTRL-C. This raised some suspicion, as why the problem was only limited to 2 Objects and therefore, we started concentrating more these 2 Objects.
Analysis
These 2 Objects were actually a Synonym to 2 remote tables and 90% of the application modules get data from the remote database, using Database Link. Does it mean that during the problem period, there was some connectivity issue between the two Servers ? Let us simulate this with an example.
To simulate this, I created a database link between my 11g and 10g instance and then created a Synonym on EMP table, which resides in 10g Database. Further, to simulate production load, I created a database procedure that selects a row from this remote table and scheduled around 1000 sessions using DBMS_JOB. With the 10g Listener UP and Running, the job runs perfectly fine, therefore, I stopped the listener of 10g Database, as if there was some problem with the connectivity and re-scheduled the job.
connect vivek/vivek
create database link o10g connect to vivek identified by vivek using 'orcl10g';
create synonym emp_remote for emp@o10g;
create or replace procedure test_load as
e_name varchar2(30);
l_eno number;
begin
l_eno:=round(dbms_random.value(7000,8000),0);
select ename into e_name from emp_remote where empno=l_eno;
end;
/
alter session set nls_date_format='Dd-MM-YYYY HH24:MI:SS';
select sysdate from dual;
SYSDATE
-------------------
15-11-2011 00:45:42
## I Scheduled the Jobs to run at 00 Hours & 47 Minutes
declare
l_job number;
begin
for i in 1..1000
loop
dbms_job.submit(l_job,'test_load;',trunc(sysdate)+00/24+47/(24*60));
commit;
end loop;
end;
/
Once the Jobs started, the wait events and the execution of the query hung on “Cursor:Pin” are as under :
SID EVENT P1 P2 P3 SQL_ID
---------- ------------------------------ ---------- ---------- ---------- -------------
1 pmon timer 300 0 0
4 DIAG idle wait 5 1 1000
10 Streams AQ: qmn coordinator id 0 0 0
13 Streams AQ: qmn slave idle wai 1 0 0
15 Streams AQ: waiting for time m 0 0 0
26 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
28 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
30 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
32 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
34 library cache lock 1269367872 1235709864 2 f20gyq4m5fh3c
35 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
42 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
47 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
49 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
51 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
52 library cache lock 1269367872 1235628808 2 f20gyq4m5fh3c
54 library cache lock 1269367872 1269117072 2 f20gyq4m5fh3c
125 VKRM Idle 0 0 0
126 VKTM Logical Idle Wait 0 0 0
127 DIAG idle wait 5 1 1000
131 smon timer 300 0 0
136 Space Manager: slave idle wait 0 0 0
139 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
142 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
143 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
149 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
153 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
154 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
155 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
156 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
170 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
172 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
175 library cache lock 1269367872 1234978972 2 f20gyq4m5fh3c
176 cursor: pin S wait on X 643252332 10682368 327680 f20gyq4m5fh3c
SQL> @sql_sqlid
Enter value for sql: f20gyq4m5fh3c
old 1: select sql_id, sql_text, executions, buffer_gets from v$sqlarea where sql_id='&sql'
new 1: select sql_id, sql_text, executions, buffer_gets from v$sqlarea where sql_id='f20gyq4m5fh3c'
SQL_ID SQL_TEXT EXECUTIONS BUFFER_GETS
------------- -------------------------------------------------- ---------- -----------
f20gyq4m5fh3c SELECT ENAME FROM EMP_REMOTE WHERE EMPNO=:B1 0 0
This was the exact behaviour that was reported on the 3rd Node production setup. I compared the alert.log of my Database and the production 3rd Node and the common observation was an entry in both the environment (this is from my database). In production, the error was related to tns timeout.
Fatal NI connect error 12541, connecting to:
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost.localdomain)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl10g)(CID=(PROGRAM=oracle)(HOST=localhost.localdomain)(USER=oracle))))
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.1.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.1.0 - Production
Time: 15-NOV-2011 01:01:30
Tracing not turned on.
Tns error struct:
ns main err code: 12541
TNS-12541: TNS:no listener
ns secondary err code: 12560
nt main err code: 511
TNS-00511: No listener
nt secondary err code: 111
nt OS err code: 0
At this point, when we discussed this with the customer, one of the Application team member admitted that as a part of the checklist, they missed out on Opening of a Port from the database server to the Server hosting the remote database and this was realized in the afternoon. They were not aware of the implication it had on the production setup.
In this case, the AWR & ASH reports revealed “Cursor:Pin” waits but were not sufficient in revealing the actual cause of the issue. It required further analysis, therefore, I usually insist on production access to diagnose a Performance Issue whether it was in the past or is an ongoing issue.
Good one !!!
Excellent diagnosis of the performance problem.
Very useful!
Thanks,
Marko
Really nice, Vivek. One request to you, you’ve decreased the frequency of blogging real-life performance problems
, please share more problems. These real-time scenarios make us learn how experts think when a problem arises.
Hi Santosh,
I understand. Since I write only on Real Life problems, and mostly on Interesting issues, there are times when you are working on production issues, but are not that interesting to blog. I refrain from writing on generic topics that are usually well covered in Oracle Documentation and wait for something, that would be of interest for the readers. Another reason for the delay is hectic schedule, which sometimes don’t give you enough time to write
But yes, blogging is my passion that I like the most and will continue to blog, atleast once a month.
Regards
Vivek
Hello Vivek,
First of all Merry Christmas, and thanks for the drop from your entire knowledge of ocean
Happy to see you being active on this blog again. Your blogs are always helpful and yes we are really grateful to you for removing sometime from your busy schedule.
Regards
Arun
Hi Vivek,
a very interesting example. What makes it even more interesting for me, is that I was asked to look at a case with very similar symptoms. Unfortunately, they only came to me after the retention period for ASH was over so I was unable to identify specific SQL hanging on cursor: pin S wait on X. However, indirect evidence is pointing to distributed queries, because they have lots of feeds working through db links.
Do you have any ideas as to the exact mechanism of such symptoms appearing as a result of db link inaccessibility? What does it appear like library cache contention?
Best regards,
Nikolay
Hi Nikolay,
Thanks for the nice words and good to know that Oracle Community can relate my blogs to their own examples. This serves the purpose of blogging.
In case of db link inaccessibility, the Query accessing remote tables have to be parsed in a local Shared Pool. Therefore, these queries, during parsing phase acquire library cache related latches / mutexes and wait for the other pending informations, like object privileges, object_name etc from the remote database. Since the remote database is not accessible, the latches are not released for a specific duration of time. In an concurrent environment, this is enough to cause huge waits. In my case, since this was a Banking Application and was accessed over the net, the waits were very prominent.
Regards
Vivek