Latch Contention Troubleshooting
Systematic Latch Contention Troubleshooting in Oracle
By Tanel Poder
Thanks to IOUG team for giving me the IOUG Select Journal Editor's Choice award (2011) for this article!!!
As an Oracle DBA, developer or performance analyst, you may have run into what is termed "latch contention" at various points. So what exactly is a "latch" and why do we have contention on this "latch". In this article, we will take an in-depth look at latches and how we determine and resolve such contention. Whether you are a newbie or an experienced old-timer, we hope this article will cast a little more light on this ill-understood subject.
What is a latch?
I will put it very simply: Latch is a lock. In fact, a latch is just another special type of lock in Oracle. Oracle uses many different types of locks internally, for example enqueue locks, library cache locks, library cache pins, buffer cache pins, mutexes and latches and these different lock types are useful for different purposes. Enqueue locks are sophisticated, provide ordered queuing capabilities and are used for cases where the locks are held for relatively long durations. Latches on the other hand are much less sophisticated, more lightweight and are used for protecting very short operations on memory structures such various internal linked list modifications, shared pool memory allocation, library cache object lookups and so on. Latches are very low-level locks managed deep inside Oracle kernel code and users or applications can not directly acquire nor release them.
So, how is a latch physically implemented? Every latch is just a memory structure in SGA, usually 100-200 bytes in size, depending on your Oracle version, hardware platform and whether you are running 32 or 64-bit Oracle.
Contrary to a common misconception, latches are in no way related to OS synchronization mechanisms such as mutexes or semaphores! Latches are just plain memory structures in SGA and the OS doesn't even have an idea that latches exist. Taking a latch does not require any system calls or OS interaction. If that were the case, latch operations would slow down considerably as such system calls are expensive. If a process wants to take a latch, it just reads latch memory to check whether the latch value is zero (i.e. latch is not taken) and if it is zero, then it changes the value to indicate that the latch is taken. For "exclusive-get-only" latches, the value will be hardcoded, such as "0xFF" on Intel platforms, but for "shared-get-capable" latches, the acquiring process will put its own Oracle PID in there if it takes that latch in exclusive mode. If a shared latch is held by many processes simultaneously then the latch value will just show the reference count (the number of shared mode holders) and tracking individual processes is done via state objects memory structures.
So, if a latch is just some memory content in SGA, modified by processes at will, who then guarantees that two processes (running on different CPUs) don't accidentally take the same latch simultaneously, without knowing about each other? This would be a race condition, and if allowed, the Oracle instance could quickly end up in a crash or a hang. The solution lies in atomic "test-and-set" and "compare-and-swap" (CAS) instructions which all modern hardware supports. This allows the Oracle kernel code to perform latch value checking and changing in a single atomic instruction. On some platforms (such as Intel), this needs to be combined with a special flag which locks the memory line where the latch lives for a single CPU access only for the duration of the latch get. This is done with LOCK CMPXCHG instruction on Intel platform. CMPXCHG means Compare and Exchange and the LOCK prefix guarantees that only one CPU at a time can access that latch (remember, latches are usually held for very short periods, measured as low as just microseconds).
Since the Oracle SGA holds lots of information in various types of memory structures, it is not enough to have just one latch for each type of memory structure to be protected. For example, if you have an 8 GB buffer cache for 8 KB blocks, then you have one million buffers in the buffer cache. Reasonable concurrent access to all these million buffers can't be provided by a single latch (in this case a "cache buffers chains" latch). As well, the library cache holds thousands if not hundreds of thousands of cursors and objects in it, reasonable access to all of them would not be possible with a single "library cache" latch, especially if you have lots of CPUs in your database server box. Therefore, Oracle has split the memory structures that need protection into logical sub-pieces and each piece is protected by a separate sub-latch known as a child latch. Some structures such redo log buffer are physically split into ranges and each range is protected by (and accessed using) a separate latch (redo allocation latch for log buffer). In some cases, an algorithm such as a hash or modulus function is used for determining which exact child latch should be used when accessing an object. For example, if you run a database with large buffer cache (100GB+), you may have millions of cache buffers chains child latches in use. When accessing a data buffer block, Oracle will run a modulus function on the data block address (DBA) of the block to determine which cache buffers chains latch to get to satisfy that access. The right library cache latch is also determined by running modulus function on the hash value of library cache object name (the hash_value column in V$SQL for example).
Since latches are held for very short periods, if a latch is busy due some other process, then the acquiring process doesn't give up immediately when it cannot get the latch on the first try. If a latch is busy, the acquiring process tries again few thousand times instead of going to sleep (and wait) immediately. The reasoning here is that if a latch is currently busy, it normally will be freed very quickly, so instead of taking the trouble of going to sleep (requires a syscall and context switch) we "spin" and burn some more CPU and test with the hope of getting the latch some (hundreds) of microseconds later. In other words, some CPU time is sacrificed but ultimately this will save time as it may avoid the needing for the process to "sleep" or suspend itself. This is called spinning (busy-waiting). Oracle registers a "latch free" wait event only after the process has failed to acquire the latch after the initial spinning and goes to sleep.
Note that some latch get operations explicitly don't perform a "spin-get" and hence the latch get will fail immediately after the first attempt to get it didn't succeed. This technique is used for some latches where Oracle doesn't care which exact child latch to take and is called immediate latch gets or "not willing to wait" latch gets. The "redo copy" latch is one such example as this helps the LogWriter (LGWR) process to know whether anyone is currently copying redo data to log buffer, but not who exactly is copying and where, as this does not matter to LGWR.
Systematic troubleshooting of latch contention
So after this brief introduction to latching in Oracle, let's see why latch contention occurs and how to troubleshoot it. Remember – essentially, a latch is a type of lock. It allows controlling concurrent access to some SGA memory structures so we wouldn't end up with race conditions, SGA corruptions and crashes. So, take note of this simple, but important concept in latch (or any lock) contention: Such latch contention occurs only when there is some other process already holding the latch which our process wants to acquire!
In other words. latch contention can only happen when these two conditions are satisfied:
1) Someone must try to get the latch
2) Someone else must be already holding the latch
Latch contention means that some process is forced to sleep as it cannot acquire the latch it wants due someone else already holding it. There will be no latch contention if both of the above conditions aren't met! Therefore, whenever we see latch contention, we need to find out two things:
1) Who is trying to get the latch (and why)?
2) Who is already holding the latch (and why)?
These two questions need to be asked for any lock contention and is applicable here as well. The following sections explain how to answer these questions in Oracle.
Question 1 - Who is trying to get the latch and why?
The first question can be answered with the help of any session-level tool based on the Oracle Wait Interface, such as the V$SESSION_WAIT view, Active Session History or SQL_TRACE. All these tools would let you know for which exact (child) latch address the wait occurred.
Note that I did not mention instance-level performance data collection tools such as Statspack or AWR reports in the above list. The problem with these tools is that they only gather and store system-wide performance data such V$SYSSTAT or V$SQL snapshots, and thus they don't know anything about which exact sessions are experiencing the latch contention and waits. These tools would show that someone is experiencing latch contention problems in the database instance, but finding out who and exactly why would involve a significant amount of guesswork, previous experience and luck. Thanks to this information gap, there currently are no systematic approaches to latch contention troubleshooting published and hence such troubleshooting is still perceived as a very complex task. This article aims to change this as there are ways to fill this information gap, which allows us to become systematic and not depend on luck.
There are two common trigger points for latch contention troubleshooting. One is that a DBA sees (from Statspack or AWR report or other monitoring tools) that a significant amount of instance-wide response time is spent waiting for some latch (from the "Top 5 Timed events" section for example). The other trigger point would start from the end user (who's complaining) or some measured business task which takes too long. In this case the DBA would identify the session(s) used by that user or task and use some session-level tool (as mentioned above) for quantifying where the response time is spent.
Question 2 - Who is holding the latch and why?
The second question can be easily answered (starting in Oracle 8.0) by querying the V$LATCHHOLDER view which lists any process/session currently holding a latch. A description is shown below:
Although little known, this is an extremely useful view for systematic latch contention troubleshooting as it provides one important bit of information which no other Oracle view or trace provides - the session ID of the latch holder. Once you know the holder's SID, you can focus on that session and see exactly what it is doing that it needs to hold the latch for so long.
Columns of interest are SID (Session ID), NAME (Latch name), LADDR (Latch address in memory). LADDR is the latch's unique identifier in an instance and helps to determine which exact child latch was held in case the latch has multiple child latches.
As I said earlier, latches are usually held for very short durations, so manually querying this view may not return anything at all even after querying it multiple times. For this reason, I have written a tool for high-frequency profiling of latch holders which I will introduce now. It's called LatchProf and is a plain SQL script which doesn't require any modifications of the database or schema. You can download it for free from my website at http://blog.tanelpoder.com – just search for "latchprof" using the search button on the right side of the page.
Using LatchProf to find who are holding latches the most
Let's assume that we have noticed shared pool and library cache latch contention, reported by a Statspack/AWR report or some monitoring tool. Remember what I mentioned earlier - latch and any other lock contention occurs only if someone is already holding the latch while I'm trying to get the same. Instead of guessing who are the troublemakers, I can just run LatchProf which samples V$LATCHHOLDER at high frequency and reports the top latch holders.
I could start by looking which latch types are the busiest in the system. Note that as there are many latch types in Oracle and they also have lots of child latches then the aggregate "busyness" of a latch does not necessarily reflect the (child) latch with the highest contention. While the Oracle Wait Interface is the only correct source for detecting contention, LatchProf should be used for drilldown once the latch experiencing contention has been detected.
To illustrate the capabilities of LatchProf, I start from a simple example which shows an overview of how busy different latch types are. By "busy" I mean how much of the time these latches are held by somebody.
LatchProf takes four parameters:
Parameter 1 specifies which columns from V$LATCHHOLDER to report and group by. In the case below I just want to report latch holds by latch name (and not even break it down by SID for starters).
Parameter 2 specifies which SIDs to monitor. In the case below, I am interested in any SID which holds a latch (%).
Parameter 3 specifies which latches to monitor. This can be set either to latch name or latch address in memory. In the case below, I monitor all latches (%).
Parameter 4 specifies how many times to sample V$LATCHHOLDER. I use 100000 samples below, which completed in a couple of seconds on my test database. The sampling speed depends on your server CPU/memory bus speed and the value of processes parameter. You should start from lower number like 1000 and adjust it so that LatchProf would complete its sampling in a couple of seconds, and that is usually enough for diagnosing ongoing latch contention problems. You shouldn't keep sampling for long periods since LatchProf runs constantly on the CPU.
So, let's run LatchProf now. After few seconds of sampling, LatchProf returns the above profile:
The output column meanings are as follows:
Name - Latch name
Held - During how many samples out of total samples (100000) the particular latch was held by somebody
Gets - How many latch gets against that latch were detected during LatchProf sampling
Held % - How much % of time was the latch held by somebody during the sampling. This is the main column you want to be looking at in order to see who/what holds the latch the most (the latchprof output is reverse-ordered by that column)
Held ms - How many milliseconds in total was the latch held during the sampling
Avg hold ms - Average latch hold time in milliseconds (normally latches are held from a few to few hundred microseconds)
Note that LatchProf works on Oracle versions 9.2 and higher, but the Gets and Avg hold ms columns are shown only on Oracle 10g and higher as the required data isn't externalized in the V$LATCHHOLDER until 10g.
So, from the above output we can see that the library cache latches were the busiest during LatchProf run (3.32% of the sampling time there was some library cache latch held by some session).
When a latch is busy, one obvious question would be: "which sessions are holding the latch the most?" Sometimes all the sessions in an instance are both victims and troublemakers (each contributing to the latch contention problem just a little), but sometimes there are one (or a few) heavy troublemakers and most other sessions end up as victims.
Luckily LatchProf allows us to break down the latch holder statistics by SID and since I am interested only in the library cache latches for now, I can monitor only this latch, as shown in bold below:
In the listing above, we have just found out the SID of the main "offender", i.e. SID 120, which apparently holds some library cache latch(es) over an order of magnitude more (5.06% of time) than the next most offending holder sessions (139,113).
Since we know the SID of the troublemaker, we can now see what this session is doing, determine which SQL its running by querying V$SESSION, ASH or by enabling SQL trace. In fact, the easiest way for getting an idea what kind of work a session is doing is running my Snapper script. This script is also available on my site (and doesn't make any changes to the database, by the way!) Snapper takes 4 parameters (out|trace - where to show the output, either DBMS_OUTPUT or tracefile, 5 - how many seconds to sleep between session performance snapshots, 1 - how many samples to take, 120 - SID to monitor):
The first thing that immediately jumps out from SID 120's execution profile is that it does heavy hard parsing, 890 hard parses per second! (shown in bold). That's a pretty evident reason for the frequent library cache latch holding as reported by LatchProf. We can be almost sure that the application code is not using bind variables where it should use them. The next step would be to identify couple of SQL statements from this session, either by briefly enabling SQL_TRACE or just by querying V$SESSION.SQL_ID and mapping this to V$SQL, and then call the developer(s) who wrote that code and get them to fix it!
Let's assume that we could get this code fixed, but our monitoring system still reports some ongoing library cache latch contention and we don't see an obvious troublemaker anymore (all sessions reported are holding the monitored latches roughly equal amount of time) as seen below:
So, we now have two ways to move forward:
1) Take one of the reported sessions and see what it is doing, by enabling SQL trace or similar. However this approach may not tell you which of the SQL statements is causing the latches to be held
2) Run LatchProf with SQLID option, which also reports the SQLIDs (or hash values in 9i) of the statements causing the latches to be held
I will proceed with option 2, and will just specify SQLID (or SQLHASH in 9i) in the first parameter to LatchProf:
This shows me that the main statement causing library cache latches to be held has SQL_ID '5csdgx2jquqvj'.
Using the SQL ID I can find the text of the query as shown below:
Apparently someone is running the above query really frequently! (Of course I should see how much the executions increases right now instead of just looking into the total number of executions since that cursor was loaded into the library cache).
The common reasons for such frequently re-executed, but useless SQL can be either someone's really bad piece of code or bad connection pool design feature, which constantly polls DUAL table to verify if the connection is still alive.
Note that it's possible to break the latch holders down by both SID and SQLID to get a more detailed view of latch holders:
The above example was meant as an introduction to the capabilities of LatchProf. I used library cache latches as an example, but LatchProf can show the same information for any other type of latch (there are hundreds of different types of latches in Oracle!)
The key difference between LatchProf and other latch contention troubleshooting approaches are:
1) LatchProf shows you who is holding the latches (who is causing the contention) instead of just showing who's waiting for them
2) LatchProf shows you which actual sessions are holding the latches (not some system-wide ambiguous metrics), thus you can simply focus on the troublemaking sessions and see what are they doing
3) LatchProf shows you which SQL statement caused taking the taking of the latch, thus narrowing down the root cause even more
Having established the basics, let's now dig a bit deeper with another example.
Using LatchProfX for finding the hot block causing cache buffers chains latch contention
Let's say your monitoring system reported significant cache buffers chains latch contention and you want to diagnose it. Cache buffers chains (CBC) latch contention can occur due a number of reasons. One of the most common reasons is that you have an extremely hot block in the buffer cache. Of course we don't know the root cause in advance, so let's systematically troubleshoot this problem.
First, there are many CBC latches in an Oracle instance. In my test server with a small buffer cache, I have 4096 CBC child latches. A big production database with a large buffer cache can easily have hundreds of thousands of child latches! You can check that using the following command:
So, as a first step we should identify which CBC latches are experiencing the contention – i.e. is the contention lightly spread across many CBC latch children or is there a single child latch experiencing much heavier contention compared to others?
We will need to use the Oracle Wait Interface since it can report the contention (waits) for latches. This will require us to use either SQL_TRACE or V$SESSION_WAIT/ASH or calculate deltas of V$LATCH_CHILDREN.WAIT_TIME to see which child latch experiences the most sleeps. I do like the first two approaches more as they allow to tie the latch waits together with a session and with whatever it was doing. For example, SQL_TRACE reports the following lines when a traced session is experiencing CBC latch contention:
The "address" part is important - it shows you exactly which child latch is experiencing the contention (every child latch in the instance has an unique memory address). I took the most prevalent address (in decimal) from the tracefile, converted it to hex and queried V$LATCH_CHILDREN with it as seen below:
Alternatively you can sample V$SESSION or query ASH (if you have the Diagnostics Pack licenses!) as I've shown below:
From above query I get the same child latch address (in hex) as reported by SQL_TRACE. Now that I know the actual CBC child latch address, I can run LatchProf with filtering by individual child latch address! That way, I can monitor the holders of this particular child latch experiencing worst contention.
The top SQLID causing holding of the CBC child latch at address 3AA4F6F08 is "f0cxkf0q803f8".
From this, I know both the top sessions causing the contention and also the top SQL's causing this latch contention! Since we now know the SQLID, the next step could be to check the execution plan of the given SQL and see why is it accessing the datablock protected by our latch so frequently. Sometimes, the contention comes from many frequent revisits of the block(s) due a bad execution plan, and may involve nested loop joins in improper locations, looping over many rows. However, in other situations, the problem may just occur due a hot block, such an index root block, visited by large number of concurrent sessions.
Still, it's useful to know what kind of hot blocks are causing the latch contention.
The good news is that we can exactly identify the exact data block which is so hot that it causes the contention. There are few approaches for identifying the hot block out there, but the most accurate of them is the LatchProfX script. The LatchProfX is available on my website, again for free. However, note the X in the end of script name - it means LatchProf eXtended or LatchProf which requires X$ tables, which means extended privileges.
LatchProfX shows an additional column, called "object". This column shows information about the object protected by a given latch. For cache buffers chains latches, this object shows the Data Block Address (DBA) of the block that we accessed, causing the latch get! See the example below:
It looks like the block at data block address 1C0005A (in hex) is the troublemaker!
Lets check into which segment does this block belong:
As seen above, this block belongs into an index partition of an index SOE.CUSTOMERS_PK. The block itself lives in relative file number 7 and block number 90 in that file. It is possible to dump the contents of the block using ALTER SYSTEM DUMP DATAFILE 7 BLOCK 90 command (note that this ALTER SYSTEM command takes absolute file ID as parameter, but the above number is a relative file ID and should be converted to absolute figure using DBA_DATA_FILES).
However, the data_obj# column shows the data_object_id of the index partition segment and using this information, I can query and find out where this index partition segment's header block resides. Keep in mind that Index root blocks are always stored immediately next to index segment header so that Oracle would know where to physically start traversing the index when accessing it.
So from the above, the segment header block (7,89) resides immediately before our hot block (7,90) so we know that the hot block is an index root block. It is normal to have some contention on very commonly accessed blocks such index root and first level branch blocks (especially of primary key indexes and so on). Oracle has come up with many optimizations for relieving such contention - for example, it allows shared CBC latch access for examination of the index root and branch blocks on most platforms.
Note that sometimes the latch contention shows inefficient physical design for a given concurrency requirement such as not spreading the contention points across multiple blocks/latches by partitioning the object to spread the "hot" data to multiple blocks in various partitions.
Often though, the CBC latch contention is merely a symptom of an incorrect execution plan, such as a nested loop join operation that revisits the same data blocks again and again many times in a tight loop. For example the plan below would work well when the number of matching orders returned from ORDERS table is low, so we need to iterate through the CUSTOMERS table (and the CUSTOMERS_PK index which had contention) relatively small number of times.
Limitations of V$LATCHHOLDER and LatchProf(X)
There's a limitation in the V$LATCHHOLDER view. While it is enough for troubleshooting 99.9% of latch contention problems, there are couple of cases where V$LATCHHOLDER view does not show the latch holder even if it's actually held.
If your STATISTICS_LEVEL = BASIC then V$LATCHHOLDER does not maintain the latch holder information for ultrafast latches. Ultrafast latches are a small subset of latches (like cache buffers chains latches) which can be "even faster" by not maintaining some instrumentation, stats and counters. For all other latches (in practice everything else than CBC latches) will still maintain the latch holder info even if the statistics_level = basic (the undocumented parameter for setting this independently from statistics level is named _ultrafast_latch_statistics). Note that even when the ultrafast latch statistics info is not gathered, it's still possible to extract the latch holder info directly from SGA memory, thanks to state objects which Oracle has to maintain for process recovery.
V$LATCHHOLDER scans through the process state object array (V$PROCESS/X$KSUPR) and looks into a field there which points to the latch held by a process. However, sometimes a process can hold multiple latches at the same time (library cache latch + shared pool latch for example). When a process holds multiple latches, then unfortunately V$LATCHHOLDER only reports the first one taken by that process. This may lead to cases where you see sessions waiting significantly for (let say) shared pool latch, but nobody seems to be holding it too much. However, as said in the beginning of this article, you should only start troubleshooting latch contention whether you see sessions actually wait for that latch! So if you see sessions waiting (the contention exists!), but nobody apparently holding the latch much in V$LATCHHOLDER, then you may be hitting this shortcoming of that V$ view. This is the case when you may need to fall back to old latch contention troubleshooting techniques (which do require lots of previous experience with this and involve quite an amount of guesswork ). Or alternatively you can hire me as a consultant, I have tools for reading the truth out of the state objects in SGA memory ;-)
There's a huge variety of reasons for latch contention in Oracle (and there are hundreds of different types of latches in Oracle too) and I hope to discuss some of the most common reasons in a future article. However it helps to keep in mind two main facts about latch contention:
1) Latch contention and waits happen only if someone tries to get a latch which is already held by someone else! Therefore, after finding out which sessions are waiting and for which latches, it's good to find out which sessions are holding these latches and why (and that's why I wrote LatchProf and LatchProfX!)
2) Latch contention is usually a symptom of some other problem, such bad connection / cursor management, inefficient execution plan, physical design issues or even CPU starvation.
P.S. If you want to take on an interesting SQL challenge, then look into LatchProf or LatchProfX source code and understand how it manages to sample the V$LATCHHOLDER view up to a hundred thousand times per second while being just a single SQL statement!