Troubleshooting Exadata v2 Smart Scan Performance

Introduction

In a recent stress-testing exercise of a large DW on Oracle Exadata v2 I found an interesting performance issue and thought to share it. 

Note that the problem happens due a bug in Oracle database layer code, not because of an Exadata design limitation. This bug should be fixed on Oracle 11.2 Exadata bundle patch update4 (BP4), but more about this later. 

I publish this article to show that systematic Exadata performance troubleshooting techniques aren't fundamentally different from existing non-Exadata techniques, especially the starting point of troubleshooting. After all, Exadata is just an Oracle Database which happens to be able to intelligently communicate with its storage to avoid unnecessarily moving too much data around. Also, hopefully this article helps someone to avoid this issue in the future or at least diagnose it faster when it happens.

Test Environment

So, I was stress-testing a Sun Oracle Database Machine (a full rack Exadata v2 with SAS disks) to the extreme, to know it's limitations and see where it breaks. Any piece of software/hardware is breakable, including the Oracle Unbreakable Linux based Exadata cluster ;-). You want to discover these issues, so you could either fix these, work around them or ensure that you never drive your workload to such extremes or situations where things break.

The database was loaded with data from an existing Oracle DW getting close to 100TB in size, the nature of data was telecom CDRs, etc. Extracting (from SPARC platform), transferring, compressing (!) and loading the data fast enough was a challenge of its own, I'll write about that in the future.

Fact tables were date-range partitioned and hash sub-partitioned.

No fact tables had any indexes (not even in the pre-Exadata configuration) as the storage subsystem had lots of bandwidth (including more than just two HBAs) and the schema was designed to utilize parallel full scans together with partition pruning to reduce the amount of data to scan.

One of the main fact tables had over 400 columns in it, that plays a role with the bug I hit.

The Problem

I noticed that one of the queries got much slower when some columns were added to the select (projection) list.  When I removed these columns, the query completed fast again. Sometimes such behavior happens if with fewer columns selected, Oracle may be able to get all these columns from an index and avoid going to the table after all. In our case, the table didn't have any indexes anyway and the execution plan didn't change either.

I reproduced the same problem with serial execution and using a test table I created (this test table had also over 400 columns).

Check the response times below:

SQL> select sum(col_254) from t where padding = 'x';

SUM(COL_254)

------------

           0

Elapsed: 00:00:00.46

SQL> select sum(col_255) from t where padding = 'x';

SUM(COL_255)

------------

           0

Elapsed: 00:00:00.48

SQL> select sum(col_256) from t where padding = 'x';

SUM(COL_256)

------------

           0

Elapsed: 00:00:55.87

Wow, there was over 100x difference in the response time when all I did was just select the 256th column of the table compared to selecting an "earlier" column in the table!

The execution plan in both cases was the same and showed the storage predicate, which indicates that the database attempts to do a smart scan (by offloading a predicate to the storage cell):

----------------------------------------------

| Id  | Operation                  | Name    |

----------------------------------------------

|   0 | SELECT STATEMENT           |         |

|   1 |  SORT AGGREGATE            |         |

|*  2 |   TABLE ACCESS STORAGE FULL| T       |

----------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   2 - storage("PADDING"='x')

       filter("PADDING"='x')

The logical next step was to measure where the extra time was spent and I did it with my Snapper tool:

SQL> @snapper all 5 1 869

Sampling SID 869 with interval 5 seconds, taking 1 snapshots...

-- Session Snapper v3.13 by Tanel Poder @ E2SN ( http://tech.e2sn.com )

----------------------------------------------------------------------------------------------------------------------

    SID, USERNAME  , TYPE, STATISTIC                               ,         DELTA, HDELTA/SEC,    %TIME, GRAPH

----------------------------------------------------------------------------------------------------------------------

    869, DWH       , STAT, session logical reads                   ,         43871,      8.77k,

    869, DWH       , STAT, cluster wait time                       ,            84,       16.8,

    869, DWH       , STAT, user I/O wait time                      ,           302,       60.4,

    869, DWH       , STAT, non-idle wait time                      ,           387,       77.4,

    869, DWH       , STAT, non-idle wait count                     ,         12796,      2.56k,

    869, DWH       , STAT, physical read total IO requests         ,          3714,      742.8,

    869, DWH       , STAT, physical read total multi block requests,            44,        8.8,

    869, DWH       , STAT, physical read requests optimized        ,          2396,      479.2,

    869, DWH       , STAT, physical read total bytes               ,      56926208,     11.39M,

    869, DWH       , STAT, cell physical IO interconnect bytes     ,      52874336,     10.57M,

    869, DWH       , STAT, gcs messages sent                       ,          3035,        607,

    869, DWH       , STAT, consistent gets                         ,         43827,      8.77k,

    869, DWH       , STAT, consistent gets from cache              ,         40551,      8.11k,

    869, DWH       , STAT, consistent gets from cache (fastpath)   ,         36889,      7.38k,

    869, DWH       , STAT, consistent gets direct                  ,          3281,      656.2,

    869, DWH       , STAT, physical reads                          ,          6949,      1.39k,

    869, DWH       , STAT, physical reads cache                    ,          3667,      733.4,

    869, DWH       , STAT, physical reads direct                   ,          3281,      656.2,

    869, DWH       , STAT, physical read IO requests               ,          3713,      742.6,

    869, DWH       , STAT, physical read bytes                     ,      56918016,     11.38M,

    869, DWH       , STAT, free buffer requested                   ,          3662,      732.4,

    869, DWH       , STAT, gc local grants                         ,           632,      126.4,

    869, DWH       , STAT, gc remote grants                        ,          3028,      605.6,

    869, DWH       , STAT, gc CPU used by this session             ,            53,       10.6,

    869, DWH       , STAT, cell physical IO bytes eligible for pred,      26877952,      5.38M,

    869, DWH       , STAT, cell physical IO interconnect bytes retu,      22826080,      4.57M,

    869, DWH       , STAT, cell session smart scan efficiency      ,           -35,         -7,

    869, DWH       , STAT, no work - consistent read gets          ,         40425,      8.09k,

    869, DWH       , STAT, table scan rows gotten                  ,         68618,     13.72k,

    869, DWH       , STAT, table scan blocks gotten                ,          3119,      623.8,

    869, DWH       , STAT, table fetch continued row               ,          3660,        732,

    869, DWH       , STAT, cell blocks processed by cache layer    ,          5428,      1.09k,

    869, DWH       , STAT, cell blocks processed by txn layer      ,          5428,      1.09k,

    869, DWH       , STAT, cell blocks processed by data layer     ,          3625,        725,

    869, DWH       , STAT, cell simulated session smart scan effici,      29696000,      5.94M,

    869, DWH       , STAT, cell IO uncompressed bytes              ,      29696000,      5.94M,

    869, DWH       , STAT, buffer is not pinned count              ,         40164,      8.03k,

    869, DWH       , STAT, cell flash cache read hits              ,          2374,      474.8,

    869, DWH       , TIME, DB CPU                                  ,       1036843,   207.37ms,    20.7%, |@@@       |

    869, DWH       , TIME, sql execute elapsed time                ,       3903062,   780.61ms,    78.1%, |@@@@@@@@  |

    869, DWH       , TIME, DB time                                 ,       3903062,   780.61ms,    78.1%, |@@@@@@@@  |

    869, DWH       , WAIT, gc cr grant 2-way                       ,        829611,   165.92ms,    16.6%, |@@        |

    869, DWH       , WAIT, gc cr grant congested                   ,           363,     72.6us,      .0%, |          |

    869, DWH       , WAIT, cell smart table scan                   ,          1503,    300.6us,      .0%, |          |

    869, DWH       , WAIT, cell single block physical read         ,       3008521,    601.7ms,    60.2%, |@@@@@@    |

    869, DWH       , WAIT, events in waitclass Other               ,          8141,     1.63ms,      .2%, |          |

--  End of Stats snap 1, end=2010-07-01 14:35:58, seconds=5

-----------------------------------------------------------------------

Active% | SQL_ID          | EVENT                     | WAIT_CLASS

-----------------------------------------------------------------------

    57% | fvq6t9hy7dts3   | cell single block physica | User I/O

    26% | fvq6t9hy7dts3   | gc cr request             | Cluster

    17% | fvq6t9hy7dts3   | ON CPU                    | ON CPU

--  End of ASH snap 1, end=2010-07-01 14:35:58, seconds=5, samples_taken=46

What the hell? We are spending over 50% of response time doing single block IOs? (That's what a cell single block physical read wait event means). Remember the execution plan above, we are doing a single full table scan, no joins, no indexes, no other fancy stuff.

We are running a smart scan as we see a tiny amount of time spent waiting for cell smart table scan wait event (and an earlier snapper snapshot, which captured the start of the query also reported that the cell scans counter go up.

I have highlighted few more counters of interest in bold, but what gives us a good clue of what's happening is the table fetch continued row statistic about which I've written before. The statistic says my session was looking up next row piece of a chained row it had hit 732 times per second. Now when you look into the physical read total IO requests statistics above, they confirm that most of the IO requests done are single block IOs (physical read total IO requests - physical read total multi block requests = single block IO requests). In fact, 742.8 - 8.8 = 736 single block IOs were done by second, which is suspiciously close to the 732 table fetch continued row fetches per second. 

This could be a coincidence but the table fetch continued row statistic which always appeared with the problematic single block IOs was a very good indicator of what was happening - before I explain it, see a snapper run from a test case where I selected column number 255 (instead of 256 as above): 

SQL> @snapper all 30 1 584

Sampling SID 584 with interval 30 seconds, taking 1 snapshots...

-- Session Snapper v3.13 by Tanel Poder @ E2SN ( http://tech.e2sn.com )

----------------------------------------------------------------------------------------------------------------------

    SID, USERNAME  , TYPE, STATISTIC                               ,         DELTA, HDELTA/SEC,    %TIME, GRAPH

----------------------------------------------------------------------------------------------------------------------

... some output snipped...

    584, DWH       , TIME, DB CPU                                  ,       1672745,    55.76ms,     5.6%, |@         |

    584, DWH       , TIME, sql execute elapsed time                ,      30056983,         1s,   100.2%, |@@@@@@@@@@|

    584, DWH       , TIME, DB time                                 ,      30056983,         1s,   100.2%, |@@@@@@@@@@|

    584, DWH       , WAIT, enq: KO - fast object checkpoint        ,        257277,     8.58ms,      .9%, |@         |

    584, DWH       , WAIT, cell smart table scan                   ,      26442350,   881.41ms,    88.1%, |@@@@@@@@@ |

    584, DWH       , WAIT, events in waitclass Other               ,       1163764,    38.79ms,     3.9%, |@         |

--  End of Stats snap 1, end=2010-06-28 16:38:12, seconds=30

-----------------------------------------------------------------------

Active% | SQL_ID          | EVENT                     | WAIT_CLASS

-----------------------------------------------------------------------

    87% | bapnmnkpqjn71   | cell smart table scan     | User I/O

     9% | bapnmnkpqjn71   | ON CPU                    | ON CPU

     2% | bapnmnkpqjn71   | reliable message          | Other

     2% | bapnmnkpqjn71   | gcs drm freeze in enter s | Other

--  End of ASH snap 1, end=2010-06-28 16:38:12, seconds=30, samples_taken=97

This is how normal smart scan waits should look like, no single block IOs and most of time is spent waiting for cell smart table scan or of course running on CPU. Note that for simplicity I ran this test case in serial execution mode, which still is able to use serial direct path reads and smart scan.

The Resolution

After seeing the table fetch continued rows combined with unexpected single block reads, it was clear that this problem happened due row chaining as that statistic is about chained rows (by the way, migrated rows are just one variation of chained rows). However, migrated rows don't cause that statistic to increase if you are using a full table scan to access the table. And I had just created that table with CTAS, so there couldn't be any migrated rows anyway.

Such unexpected single block reads due chained rows can easily show up in non-exadata environments too when running table scans, but the rows both in the original fact table and my narrowed-down test case were much smaller than 8kB, so they all did fit into their blocks entirely, without need for physical chaining to other blocks.

This is where the number of columns in the table comes into play:

Hmm, 255 is a very familiar number :-)

Oracle stores the number of physical columns present in a row in a single byte in the row header. Therefore a physical row piece can have only up to 255 columns in it!

Wait, but how come I can create a table with over 400 columns then? And how can Oracle support up to 1000 columns in a table?

Well, up to Oracle 7.3 you couldn't have more than 255 columns per table. The 1000 column support was added in Oracle 8.0 and Oracle engineers didn't want to change the row structure just for that (reasons: backward compatibility and migrations without having to update all your data). 

So, the clever solution was to just chain that that 400-column row into 2 row pieces, just as it's done when a single row is too big to fit into one block. But in our cases the chaining into two row pieces was done so that we could put 256 columns into first row piece and the rest into the next row piece. Every row piece header has its own column count byte! So, if you need to store a 260 column row, you'll need 2 row pieces. If you want to store a 1000 column row, you'll need 4 row pieces.

So if the total length of the row actually does fit into a single block, but multiple row pieces are needed for supporting over 255 columns in the table, then Oracle doesn't need to chain the next rowpiece(s) into different blocks from original, it will just put the next row piece into the same block with the head piece - this is called intra-block row chaining.

And the rest is simple, apparently Exadata smart scan can't handle tables with more than 255 columns well, right? My test case clearly showed that when selecting a column at position 256 or higher, Oracle database started performing lots of single block IOs, instead of offloading everything to the Exadata cell. Apparently the database software (or exadata cell server) did not realize that the trailing row pieces were chained into the same blocks as the head pieces and could have been just transparently read and concatenated in the cell during the smart scan, instead of resorting to the usual fetch-the-next-row-piece-with-a-separate-single-block-read-whenever-chained-row-found approach executed in the database layer. For every row piece returned by the smart scan, the database layer had to do an additional logical IO for fetching the remaining row piece and when the required buffer didn't happen to be in the buffer cache, a cell single block read request had to be made.

So, instead of large IOs and full in-cell smart scan processing, we got lots of single block IOs throttling the smart scan performance, which resulted in over 100x worse response time compared to the ideal smart scan.

NB! As I said in the beginning, this is just a software bug. It would be suspicious if this was some sort of a design limitation as it's common to have DW fact tables with over 255 columns, so someone at Oracle would have noticed very early in the testing.

I did some research and asked around and it looks like it's this bug, fixed in Oracle Database Machine Bundle Patch 4:

Note that, I haven't had a chance to test this behavior on BP4 yet (the client happened to be on BP3 at the time of stress testing). Regardless the bug description, the performance problem happened both on uncompressed tables and tables compressed with regular table compression and OLTP compression.

The problem didn't happen when running queries on tables compressed with Exadata Hybrid Columnar Compression (EHCC) as in EHCC the row storage format is completely different from the regular row format. (One could say that in EHCC there's column data chaining over multiple blocks, instead of row chaining, but more about this in a future article).

Conclusion

This was an intro to Exadata troubleshooting. Note that I started from measuring what matters, and drilled down from there:

This was enough to conclude what was the cause for the performance problem and find the bug causing it.

Note that I did not start from measuring "advanced" things like cell CPU or interconnect utilization, querying V$CELL_% views or running CELLCLI stats commands. The performance troubleshooting work should be done from top down also on Exadata and in this case I didn't even have to dig deep into the smart storage stack. Also, Kudos to the Exadata developer or architect who decided to make the storage cells send some of their performance stats back to the database layer during smart scan execution - so that much of the troubleshooting can be done using the top down approach like demonstrated above.

There are couple of more things to note:

And last, but not least, you should be more aggressive with patching your Exadata cell & Database software as there are lots of bugs out there, some much more evil than the performance bug I discussed above! ;-)

Questions, comments to:

Tanel Poder

tanel@tanelpoder.com