Troubleshooting Exadata v2 Smart Scan Performance
Introduction
In a recent stress-testing exercise of a large DW on Oracle Exadata v2 I found an interesting performance issue and thought to share it.
Note that the problem happens due a bug in Oracle database layer code, not because of an Exadata design limitation. This bug should be fixed on Oracle 11.2 Exadata bundle patch update4 (BP4), but more about this later.
I publish this article to show that systematic Exadata performance troubleshooting techniques aren't fundamentally different from existing non-Exadata techniques, especially the starting point of troubleshooting. After all, Exadata is just an Oracle Database which happens to be able to intelligently communicate with its storage to avoid unnecessarily moving too much data around. Also, hopefully this article helps someone to avoid this issue in the future or at least diagnose it faster when it happens.
Test Environment
So, I was stress-testing a Sun Oracle Database Machine (a full rack Exadata v2 with SAS disks) to the extreme, to know it's limitations and see where it breaks. Any piece of software/hardware is breakable, including the Oracle Unbreakable Linux based Exadata cluster ;-). You want to discover these issues, so you could either fix these, work around them or ensure that you never drive your workload to such extremes or situations where things break.
The database was loaded with data from an existing Oracle DW getting close to 100TB in size, the nature of data was telecom CDRs, etc. Extracting (from SPARC platform), transferring, compressing (!) and loading the data fast enough was a challenge of its own, I'll write about that in the future.
Fact tables were date-range partitioned and hash sub-partitioned.
No fact tables had any indexes (not even in the pre-Exadata configuration) as the storage subsystem had lots of bandwidth (including more than just two HBAs) and the schema was designed to utilize parallel full scans together with partition pruning to reduce the amount of data to scan.
One of the main fact tables had over 400 columns in it, that plays a role with the bug I hit.
The Problem
I noticed that one of the queries got much slower when some columns were added to the select (projection) list. When I removed these columns, the query completed fast again. Sometimes such behavior happens if with fewer columns selected, Oracle may be able to get all these columns from an index and avoid going to the table after all. In our case, the table didn't have any indexes anyway and the execution plan didn't change either.
I reproduced the same problem with serial execution and using a test table I created (this test table had also over 400 columns).
Check the response times below:
SQL> select sum(col_254) from t where padding = 'x';
SUM(COL_254)
------------
0
Elapsed: 00:00:00.46
SQL> select sum(col_255) from t where padding = 'x';
SUM(COL_255)
------------
0
Elapsed: 00:00:00.48
SQL> select sum(col_256) from t where padding = 'x';
SUM(COL_256)
------------
0
Elapsed: 00:00:55.87
Wow, there was over 100x difference in the response time when all I did was just select the 256th column of the table compared to selecting an "earlier" column in the table!
The execution plan in both cases was the same and showed the storage predicate, which indicates that the database attempts to do a smart scan (by offloading a predicate to the storage cell):
----------------------------------------------
| Id | Operation | Name |
----------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT AGGREGATE | |
|* 2 | TABLE ACCESS STORAGE FULL| T |
----------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage("PADDING"='x')
filter("PADDING"='x')
The logical next step was to measure where the extra time was spent and I did it with my Snapper tool:
SQL> @snapper all 5 1 869
Sampling SID 869 with interval 5 seconds, taking 1 snapshots...
-- Session Snapper v3.13 by Tanel Poder @ E2SN ( http://tech.e2sn.com )
----------------------------------------------------------------------------------------------------------------------
SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH
----------------------------------------------------------------------------------------------------------------------
869, DWH , STAT, session logical reads , 43871, 8.77k,
869, DWH , STAT, cluster wait time , 84, 16.8,
869, DWH , STAT, user I/O wait time , 302, 60.4,
869, DWH , STAT, non-idle wait time , 387, 77.4,
869, DWH , STAT, non-idle wait count , 12796, 2.56k,
869, DWH , STAT, physical read total IO requests , 3714, 742.8,
869, DWH , STAT, physical read total multi block requests, 44, 8.8,
869, DWH , STAT, physical read requests optimized , 2396, 479.2,
869, DWH , STAT, physical read total bytes , 56926208, 11.39M,
869, DWH , STAT, cell physical IO interconnect bytes , 52874336, 10.57M,
869, DWH , STAT, gcs messages sent , 3035, 607,
869, DWH , STAT, consistent gets , 43827, 8.77k,
869, DWH , STAT, consistent gets from cache , 40551, 8.11k,
869, DWH , STAT, consistent gets from cache (fastpath) , 36889, 7.38k,
869, DWH , STAT, consistent gets direct , 3281, 656.2,
869, DWH , STAT, physical reads , 6949, 1.39k,
869, DWH , STAT, physical reads cache , 3667, 733.4,
869, DWH , STAT, physical reads direct , 3281, 656.2,
869, DWH , STAT, physical read IO requests , 3713, 742.6,
869, DWH , STAT, physical read bytes , 56918016, 11.38M,
869, DWH , STAT, free buffer requested , 3662, 732.4,
869, DWH , STAT, gc local grants , 632, 126.4,
869, DWH , STAT, gc remote grants , 3028, 605.6,
869, DWH , STAT, gc CPU used by this session , 53, 10.6,
869, DWH , STAT, cell physical IO bytes eligible for pred, 26877952, 5.38M,
869, DWH , STAT, cell physical IO interconnect bytes retu, 22826080, 4.57M,
869, DWH , STAT, cell session smart scan efficiency , -35, -7,
869, DWH , STAT, no work - consistent read gets , 40425, 8.09k,
869, DWH , STAT, table scan rows gotten , 68618, 13.72k,
869, DWH , STAT, table scan blocks gotten , 3119, 623.8,
869, DWH , STAT, table fetch continued row , 3660, 732,
869, DWH , STAT, cell blocks processed by cache layer , 5428, 1.09k,
869, DWH , STAT, cell blocks processed by txn layer , 5428, 1.09k,
869, DWH , STAT, cell blocks processed by data layer , 3625, 725,
869, DWH , STAT, cell simulated session smart scan effici, 29696000, 5.94M,
869, DWH , STAT, cell IO uncompressed bytes , 29696000, 5.94M,
869, DWH , STAT, buffer is not pinned count , 40164, 8.03k,
869, DWH , STAT, cell flash cache read hits , 2374, 474.8,
869, DWH , TIME, DB CPU , 1036843, 207.37ms, 20.7%, |@@@ |
869, DWH , TIME, sql execute elapsed time , 3903062, 780.61ms, 78.1%, |@@@@@@@@ |
869, DWH , TIME, DB time , 3903062, 780.61ms, 78.1%, |@@@@@@@@ |
869, DWH , WAIT, gc cr grant 2-way , 829611, 165.92ms, 16.6%, |@@ |
869, DWH , WAIT, gc cr grant congested , 363, 72.6us, .0%, | |
869, DWH , WAIT, cell smart table scan , 1503, 300.6us, .0%, | |
869, DWH , WAIT, cell single block physical read , 3008521, 601.7ms, 60.2%, |@@@@@@ |
869, DWH , WAIT, events in waitclass Other , 8141, 1.63ms, .2%, | |
-- End of Stats snap 1, end=2010-07-01 14:35:58, seconds=5
-----------------------------------------------------------------------
Active% | SQL_ID | EVENT | WAIT_CLASS
-----------------------------------------------------------------------
57% | fvq6t9hy7dts3 | cell single block physica | User I/O
26% | fvq6t9hy7dts3 | gc cr request | Cluster
17% | fvq6t9hy7dts3 | ON CPU | ON CPU
-- End of ASH snap 1, end=2010-07-01 14:35:58, seconds=5, samples_taken=46
What the hell? We are spending over 50% of response time doing single block IOs? (That's what a cell single block physical read wait event means). Remember the execution plan above, we are doing a single full table scan, no joins, no indexes, no other fancy stuff.
We are running a smart scan as we see a tiny amount of time spent waiting for cell smart table scan wait event (and an earlier snapper snapshot, which captured the start of the query also reported that the cell scans counter go up.
I have highlighted few more counters of interest in bold, but what gives us a good clue of what's happening is the table fetch continued row statistic about which I've written before. The statistic says my session was looking up next row piece of a chained row it had hit 732 times per second. Now when you look into the physical read total IO requests statistics above, they confirm that most of the IO requests done are single block IOs (physical read total IO requests - physical read total multi block requests = single block IO requests). In fact, 742.8 - 8.8 = 736 single block IOs were done by second, which is suspiciously close to the 732 table fetch continued row fetches per second.
This could be a coincidence but the table fetch continued row statistic which always appeared with the problematic single block IOs was a very good indicator of what was happening - before I explain it, see a snapper run from a test case where I selected column number 255 (instead of 256 as above):
SQL> @snapper all 30 1 584
Sampling SID 584 with interval 30 seconds, taking 1 snapshots...
-- Session Snapper v3.13 by Tanel Poder @ E2SN ( http://tech.e2sn.com )
----------------------------------------------------------------------------------------------------------------------
SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH
----------------------------------------------------------------------------------------------------------------------
... some output snipped...
584, DWH , TIME, DB CPU , 1672745, 55.76ms, 5.6%, |@ |
584, DWH , TIME, sql execute elapsed time , 30056983, 1s, 100.2%, |@@@@@@@@@@|
584, DWH , TIME, DB time , 30056983, 1s, 100.2%, |@@@@@@@@@@|
584, DWH , WAIT, enq: KO - fast object checkpoint , 257277, 8.58ms, .9%, |@ |
584, DWH , WAIT, cell smart table scan , 26442350, 881.41ms, 88.1%, |@@@@@@@@@ |
584, DWH , WAIT, events in waitclass Other , 1163764, 38.79ms, 3.9%, |@ |
-- End of Stats snap 1, end=2010-06-28 16:38:12, seconds=30
-----------------------------------------------------------------------
Active% | SQL_ID | EVENT | WAIT_CLASS
-----------------------------------------------------------------------
87% | bapnmnkpqjn71 | cell smart table scan | User I/O
9% | bapnmnkpqjn71 | ON CPU | ON CPU
2% | bapnmnkpqjn71 | reliable message | Other
2% | bapnmnkpqjn71 | gcs drm freeze in enter s | Other
-- End of ASH snap 1, end=2010-06-28 16:38:12, seconds=30, samples_taken=97
This is how normal smart scan waits should look like, no single block IOs and most of time is spent waiting for cell smart table scan or of course running on CPU. Note that for simplicity I ran this test case in serial execution mode, which still is able to use serial direct path reads and smart scan.
The Resolution
After seeing the table fetch continued rows combined with unexpected single block reads, it was clear that this problem happened due row chaining as that statistic is about chained rows (by the way, migrated rows are just one variation of chained rows). However, migrated rows don't cause that statistic to increase if you are using a full table scan to access the table. And I had just created that table with CTAS, so there couldn't be any migrated rows anyway.
Such unexpected single block reads due chained rows can easily show up in non-exadata environments too when running table scans, but the rows both in the original fact table and my narrowed-down test case were much smaller than 8kB, so they all did fit into their blocks entirely, without need for physical chaining to other blocks.
This is where the number of columns in the table comes into play:
The fact table had over 400 columns in it and my test table had 260 columns.
When I selected any column from column 1 up to column 255 the smart scan performance was ok.
When I included any column at position higher than 255, the performance got bad
Hmm, 255 is a very familiar number :-)
Oracle stores the number of physical columns present in a row in a single byte in the row header. Therefore a physical row piece can have only up to 255 columns in it!
Wait, but how come I can create a table with over 400 columns then? And how can Oracle support up to 1000 columns in a table?
Well, up to Oracle 7.3 you couldn't have more than 255 columns per table. The 1000 column support was added in Oracle 8.0 and Oracle engineers didn't want to change the row structure just for that (reasons: backward compatibility and migrations without having to update all your data).
So, the clever solution was to just chain that that 400-column row into 2 row pieces, just as it's done when a single row is too big to fit into one block. But in our cases the chaining into two row pieces was done so that we could put 256 columns into first row piece and the rest into the next row piece. Every row piece header has its own column count byte! So, if you need to store a 260 column row, you'll need 2 row pieces. If you want to store a 1000 column row, you'll need 4 row pieces.
So if the total length of the row actually does fit into a single block, but multiple row pieces are needed for supporting over 255 columns in the table, then Oracle doesn't need to chain the next rowpiece(s) into different blocks from original, it will just put the next row piece into the same block with the head piece - this is called intra-block row chaining.
And the rest is simple, apparently Exadata smart scan can't handle tables with more than 255 columns well, right? My test case clearly showed that when selecting a column at position 256 or higher, Oracle database started performing lots of single block IOs, instead of offloading everything to the Exadata cell. Apparently the database software (or exadata cell server) did not realize that the trailing row pieces were chained into the same blocks as the head pieces and could have been just transparently read and concatenated in the cell during the smart scan, instead of resorting to the usual fetch-the-next-row-piece-with-a-separate-single-block-read-whenever-chained-row-found approach executed in the database layer. For every row piece returned by the smart scan, the database layer had to do an additional logical IO for fetching the remaining row piece and when the required buffer didn't happen to be in the buffer cache, a cell single block read request had to be made.
So, instead of large IOs and full in-cell smart scan processing, we got lots of single block IOs throttling the smart scan performance, which resulted in over 100x worse response time compared to the ideal smart scan.
NB! As I said in the beginning, this is just a software bug. It would be suspicious if this was some sort of a design limitation as it's common to have DW fact tables with over 255 columns, so someone at Oracle would have noticed very early in the testing.
I did some research and asked around and it looks like it's this bug, fixed in Oracle Database Machine Bundle Patch 4:
9291589 - QUERIES ON COMPRESSED TABLES WITH > 255 COLUMNS SLOW WHEN OFFLOAD IS ENABLED
Note that, I haven't had a chance to test this behavior on BP4 yet (the client happened to be on BP3 at the time of stress testing). Regardless the bug description, the performance problem happened both on uncompressed tables and tables compressed with regular table compression and OLTP compression.
The problem didn't happen when running queries on tables compressed with Exadata Hybrid Columnar Compression (EHCC) as in EHCC the row storage format is completely different from the regular row format. (One could say that in EHCC there's column data chaining over multiple blocks, instead of row chaining, but more about this in a future article).
Conclusion
This was an intro to Exadata troubleshooting. Note that I started from measuring what matters, and drilled down from there:
I measured the response time of my "business task" and noticed the difference when selecting 256th column
I then broke down the response time of the session running my business task using Oracle wait interface and noticed the unexpected waits for cell single block reads while smart scan was in use
I gathered additional evidence related to the cell single block reads wait event using Snapper - which in this case was the table fetch continued row statistic which happens
Additionally I was aware of the related details like execution plan which didn't use indexes and the schema physical design (I actually described that fact table and noticed that it had over 255 columns etc)
This was enough to conclude what was the cause for the performance problem and find the bug causing it.
Note that I did not start from measuring "advanced" things like cell CPU or interconnect utilization, querying V$CELL_% views or running CELLCLI stats commands. The performance troubleshooting work should be done from top down also on Exadata and in this case I didn't even have to dig deep into the smart storage stack. Also, Kudos to the Exadata developer or architect who decided to make the storage cells send some of their performance stats back to the database layer during smart scan execution - so that much of the troubleshooting can be done using the top down approach like demonstrated above.
There are couple of more things to note:
If you see TABLE ACCESS STORAGE FULL in an execution plan, it doesn't mean that you're automatically using smart scan, it merely means your table (partitions) accessed reside on Exadata storage
If you see storage() predicate in an execution plan, this means that the execution plan is going to try to take advantage of the smart scan. But this does not mean that this will definitely succeed and give you the intended efficiency and speedup. As you saw above, the smart scan was used in principle, but it was heavily throttled by additional single block reads caused by the database layer when it searched for trailing rowpieces.
You still need to measure where the response time of your task/operation is going and this can reliably be done only with the top-down approach
And last, but not least, you should be more aggressive with patching your Exadata cell & Database software as there are lots of bugs out there, some much more evil than the performance bug I discussed above! ;-)
Questions, comments to:
Tanel Poder