“Library Cache : Mutex X” ! An Interesting Optimization
May 28, 2013 7 Comments
One of my Customer is migrating from an Non-Oracle Database to Oracle Database and this being a critical Application, the migration is planned in a phased manner. Migration happens every fortnight (a city or set of cities at a time) and is enough to increase the load as a Vertical line (if plotted in a graph). In a simple term, the increase in the load is not linear. With such a dramatic increase, some or the other issues can be expected and this is what triggered post one such migration. The setup is a 2 Node RAC with Database version is 220.127.116.11.8. Connection Pooling is used at the Apps Layer.
During one of the migration the customer increased CPU’s from 80 to 148 on each of the Node. This was based on the recommendation from the Application Team. Immediately, post the migration the top most wait event was “library cache : mutex x”. While this was not impacting the performance but was a cause of concern as the migration was only 40% and next migration would add the load further by 20%.
We had two dramatic changes. These are : (1) The CPU’s Increased from 80 to 148, and (2) The load increased many fold because of the migration of new branches. It was critical to investigate the issue and therefore, we started monitoring the system during peak load. An interesting observation that came out was a function, which was on top, most of the time. It was BEGIN :1 := PA_MC_ENTITY_CONTEXT_FN(:2); END;. The number of executions of this function were also around 600+ per second. For more understanding on this function, we scheduled a discussion with the application team and the outcome of this discussion was very interesting.
During the discussion, it came out that the application has a functionality to use Virtual Private Database (VPD) and this function is used to set the application context, which is then used by each of the application queries to filter out the rows that do not match the context. Usually, VPD is used as a row level security, which is implemented by way of a function to generate WHERE predicate depending upon the privileges or profile of the application user. These predicates can be different for different set of users. All this is generated at run time. Interestingly, in this case, the application context is not generated dynamically based on privileges, but is set to a fix value, ’4000′, for all the database session. Technically, there was nothing wrong in the implementation and therefore, it was not important to drill down further to relate the link between this function and mutex waits. The VPD Implementation and function code all looked fine
The interesting fact that we came to know was the way this function was called. Further investigation and discussion with the Application team revealed that for each database call, this function is called to set the application context. This means, if a session executes 5 queries, as a part of a transaction, the function to set the application context will be executed 5 times. As per them, the reason behind this is that as the Apps Layer use Connection Pool, there is no way to know whether the connection picked up by the application user is one of the existing connection from the pool or a newly spawned (once all the initial connections are exhausted). Clearly, executing a function for each database calls and setting a context on each call seemed to be a problem, as for each execution of this function, the mutex was requested in an exclusive mode to write to a memory location.
As an immediate optimization, in order to reduce the number of execution and exclusive gets, it was suggested to check for the existence of application context before setting the context. This was achieved by way of a query to first check whether the context is set to desired value. If it is, this means, it is one of existing connection and no need to set the application context. If not, this means, this is a new connection and therefore, the application context is set to ’4000′. With this implementation, the number of executions of the function reduced drastically, thereby reducing the waits on Library Cache Mutex.
In this case, everyone suspected this to be an Oracle Database BUG. In order to eliminate or reduce the contention, it is imperative to optimize the amount of work done. Unwanted work is always an overhead, as was the case in this optimization.