How Challenging is the task of a Performance Optimization Expert ? A real life example…

It is a well known fact that, most of the times, a Performance Optimization Expert is usually involved in Reactive mode. An Ongoing Performance Issue is easy to diagnose, but what if a severe performance issue was reported a day before and this caused a Crash of a production system ?

In case of an Issue that occured in the past, while most of the experts call for AWR reports and rely on ASH data available, but in my opinion, the analysis and recommendation based on AWR and ASH reports does not guarantee 100% accuracy. To prove this point, I have a real life scenario to share.

One of my customers added another node to an existing 2 node RAC production setup. Post completion of this activity, the system was released to the end users. In this case, end users are the customers across the globe. In the afternoon, severe slowness was reported and before anyone could analyze the issue, all the Application Servers / Web Servers Crashed and had to be restarted. There was a severe escalation and everyone pointed the finger of blame on the addition of a 3rd Node, as this was the only change in the entire production system. The Database Administrators also mentioned that while investigating the issue, they could observe severe “Cursor:pin” related waits. There were around 400+ sessions waiting for the “Cursor:pin” and the queries that were stuck were running successfully on other 2 nodes. Therefore, it was assumed to be a BUG or some issue related to 3rd Node. Since the Application architecture involved connections thru connection pool, when all the connections pool got exhausted, severe slowness and then a crash was reported.

A detailed investigation was initiated the next day. AWR report for the problem period and the ASH reports also revealed that almost 90% of the time the sessions were waiting on “Cursor:Pin” and the other 2 nodes the AWR was clean. While we were still investigating the issue, one of the DBA also mentioned that all other queries were working absolutely fine and, the problem only existed with the two tables as the queries on these 2 objects were waiting on the Mutexes and the executions of these queries were ZERO. The executions of the queries on these 2 objects were 1000+. Further, he also mentioned that they tried running the queries on these 2 objects from SQL*Plus, even these hung and had to be cancelled using CTRL-C. This raised some suspicion, as why the problem was only limited to 2 Objects and therefore, we started concentrating more these 2 Objects.

Analysis

These 2 Objects were actually a Synonym to 2 remote tables and 90% of the application modules get data from the remote database, using Database Link. Does it mean that during the problem period, there was some connectivity issue between the two Servers ? Let us simulate this with an example.

To simulate this, I created a database link between my 11g and 10g instance and then created a Synonym on EMP table, which resides in 10g Database. Further, to simulate production load, I created a database procedure that selects a row from this remote table and scheduled around 1000 sessions using DBMS_JOB. With the 10g Listener UP and Running, the job runs perfectly fine, therefore, I stopped the listener of 10g Database, as if there was some problem with the connectivity and re-scheduled the job.


connect vivek/vivek
create database link o10g connect to vivek identified by vivek using 'orcl10g';
create synonym emp_remote for emp@o10g;

create or replace procedure test_load as
e_name  varchar2(30);
l_eno	number;
begin
  l_eno:=round(dbms_random.value(7000,8000),0);
  select ename into e_name from emp_remote where empno=l_eno;
end;
/

alter session set nls_date_format='Dd-MM-YYYY HH24:MI:SS';

select sysdate from dual;

SYSDATE
-------------------
15-11-2011 00:45:42

## I Scheduled the Jobs to run at 00 Hours & 47 Minutes

declare
l_job	number;
begin
   for i in 1..1000
     loop

       dbms_job.submit(l_job,'test_load;',trunc(sysdate)+00/24+47/(24*60));
       commit;
     end loop;
end;
/

Once the Jobs started, the wait events and the execution of the query hung on “Cursor:Pin” are as under :

       SID EVENT                                  P1         P2         P3 SQL_ID
---------- ------------------------------ ---------- ---------- ---------- -------------
         1 pmon timer                            300          0          0
         4 DIAG idle wait                          5          1       1000
        10 Streams AQ: qmn coordinator id          0          0          0
        13 Streams AQ: qmn slave idle wai          1          0          0
        15 Streams AQ: waiting for time m          0          0          0
        26 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
        28 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
        30 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
        32 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
        34 library cache lock             1269367872 1235709864          2 f20gyq4m5fh3c
        35 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
        42 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
        47 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
        49 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
        51 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
        52 library cache lock             1269367872 1235628808          2 f20gyq4m5fh3c
        54 library cache lock             1269367872 1269117072          2 f20gyq4m5fh3c
       125 VKRM Idle                               0          0          0
       126 VKTM Logical Idle Wait                  0          0          0
       127 DIAG idle wait                          5          1       1000
       131 smon timer                            300          0          0
       136 Space Manager: slave idle wait          0          0          0
       139 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       142 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       143 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       149 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       153 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       154 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       155 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       156 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       170 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       172 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c
       175 library cache lock             1269367872 1234978972          2 f20gyq4m5fh3c
       176 cursor: pin S wait on X         643252332   10682368     327680 f20gyq4m5fh3c

SQL> @sql_sqlid
Enter value for sql: f20gyq4m5fh3c
old   1: select sql_id, sql_text, executions, buffer_gets from v$sqlarea where sql_id='&sql'
new   1: select sql_id, sql_text, executions, buffer_gets from v$sqlarea where sql_id='f20gyq4m5fh3c'

SQL_ID        SQL_TEXT                                           EXECUTIONS BUFFER_GETS
------------- -------------------------------------------------- ---------- -----------
f20gyq4m5fh3c SELECT ENAME FROM EMP_REMOTE WHERE EMPNO=:B1                0           0

This was the exact behaviour that was reported on the 3rd Node production setup. I compared the alert.log of my Database and the production 3rd Node and the common observation was an entry in both the environment (this is from my database). In production, the error was related to tns timeout.


Fatal NI connect error 12541, connecting to:
 (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost.localdomain)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl10g)(CID=(PROGRAM=oracle)(HOST=localhost.localdomain)(USER=oracle))))

  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.1.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.1.0 - Production
  Time: 15-NOV-2011 01:01:30
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12541

TNS-12541: TNS:no listener
    ns secondary err code: 12560
    nt main err code: 511

TNS-00511: No listener
    nt secondary err code: 111
    nt OS err code: 0

At this point, when we discussed this with the customer, one of the Application team member admitted that as a part of the checklist, they missed out on Opening of a Port from the database server to the Server hosting the remote database and this was realized in the afternoon. They were not aware of the implication it had on the production setup.

In this case, the AWR & ASH reports revealed “Cursor:Pin” waits but were not sufficient in revealing the actual cause of the issue. It required further analysis, therefore, I usually insist on production access to diagnose a Performance Issue whether it was in the past or is an ongoing issue.

Advertisements

About Vivek Sharma
I am an Oracle Professional from Mumbai (India). I blog on the issues that I feel is Interesting for my readers. Some of these are my real life examples, which I hope, you would find interesting. Comments are always a welcome. The Technical Observations & Views here are my own and not necessarily those of Oracle or its affiliates. These are purely based on my understandings, learnings and resolutions of various customer issues.

8 Responses to How Challenging is the task of a Performance Optimization Expert ? A real life example…

  1. Santosh Kumar says:

    Good one !!!

    Like

  2. Marko Sutic says:

    Excellent diagnosis of the performance problem.

    Very useful!

    Thanks,
    Marko

    Like

  3. Really nice, Vivek. One request to you, you’ve decreased the frequency of blogging real-life performance problems 🙂 , please share more problems. These real-time scenarios make us learn how experts think when a problem arises.

    Like

    • Vivek says:

      Hi Santosh,

      I understand. Since I write only on Real Life problems, and mostly on Interesting issues, there are times when you are working on production issues, but are not that interesting to blog. I refrain from writing on generic topics that are usually well covered in Oracle Documentation and wait for something, that would be of interest for the readers. Another reason for the delay is hectic schedule, which sometimes don’t give you enough time to write 🙂

      But yes, blogging is my passion that I like the most and will continue to blog, atleast once a month.

      Regards
      Vivek

      Like

  4. Arun says:

    Hello Vivek,

    First of all Merry Christmas, and thanks for the drop from your entire knowledge of ocean 🙂
    Happy to see you being active on this blog again. Your blogs are always helpful and yes we are really grateful to you for removing sometime from your busy schedule.

    Regards
    Arun

    Like

  5. savvinov says:

    Hi Vivek,

    a very interesting example. What makes it even more interesting for me, is that I was asked to look at a case with very similar symptoms. Unfortunately, they only came to me after the retention period for ASH was over so I was unable to identify specific SQL hanging on cursor: pin S wait on X. However, indirect evidence is pointing to distributed queries, because they have lots of feeds working through db links.

    Do you have any ideas as to the exact mechanism of such symptoms appearing as a result of db link inaccessibility? What does it appear like library cache contention?

    Best regards,
    Nikolay

    Like

    • Vivek says:

      Hi Nikolay,

      Thanks for the nice words and good to know that Oracle Community can relate my blogs to their own examples. This serves the purpose of blogging. 🙂

      In case of db link inaccessibility, the Query accessing remote tables have to be parsed in a local Shared Pool. Therefore, these queries, during parsing phase acquire library cache related latches / mutexes and wait for the other pending informations, like object privileges, object_name etc from the remote database. Since the remote database is not accessible, the latches are not released for a specific duration of time. In an concurrent environment, this is enough to cause huge waits. In my case, since this was a Banking Application and was accessed over the net, the waits were very prominent.

      Regards
      Vivek

      Like

  6. Jamsher says:

    Wonderful example of performance tuning.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s