Tips and wisdom from 15 years of DBA experience

Monday, November 10, 2008

11g OCI client query cache

If you are an OLTP DBA, you know the value of what this feature can be. We have all worked with developers to reduce repetitive select queries in order to keep them off of our CPUs, and keep are eyes off of those mounting cache buffers chains latches. For custom apps, working with developers to develop a caching layer can be very rewarding if you have the luxury of time to engage in a project like this. For Java apps using Hibernate, details are hidden away from both you and the app developer, and chances are pretty good that if you want a Java developer to dig into how to control that caching layer, they will not be as enthusiastic about exploring it's innards as you. After all, the entire purpose of a caching layer is to prevent having to deal with data persistence directly. It is my experience that most web developers don't fully appreciate that their applications really are database applications, so concepts like locking, transaction isolation levels, deadlock protection, excess data retrival (wastage) etc. are not given sufficent attention, but I digress. For packaged apps, we are left to throw hardware and/or RAC licenses at the problem - not an easy expenditure to explain to the CEO, especially on top of the licenses for that packaged app.

Enter the promise of 11g OCI client query cache. Whether an app is using native OCI calls, JDBC, or ODBC, or some other driver interface layer, the OCIStmtExecute() and OCIStmtFetch() calls are made at some point during query processing. The OCI client query cache is implemented "below" those calls, so existing apps will continue to function with no changes. I can already hear you saying - "Yeah, that's great but we all know that app cache = db triggers and well...not on my database..." In this new client cache mechanism, Oracle hides the implementation from you, so you do not need to know whether it is implemented using triggers, interprocess communication, or some other proprietary mechanism. All you need to know is that the cache is kept consistent, and your apps are more efficient. The details are probably exposed someplace in the OCI docs, but my suspicion is that they borrowed a page from the Postgres book and implemented something using process signals such as SIGUSR2 as used in Postgres's LISTEN/NOTIFY architecture.

The init.ora param result_cache_mode controls whether all queries are to be cached on the client or only those with the /* + result_cache */ hint added. There are other init.ora parameters that control how large the cache should be, The maximum lag that the cache can have behind any database changes that have occurred, as well as parameters in the client configuration file specific to each client installation.

Only production road-time will determine whether the cache works as well as we hope it does - and if it doesn't, we will submit the appropriate bugs and hopefully they will get it patched in the next release.

"Snapshot too old" is getting old.

ORA-01555: snapshot too old. We have all been there, we have all wrestled with it. We have all tried to solve the problem the wrong way at some point in our careers...admit it. It does not speak well for my constituency of Oracle DBAs that one of the most common Oracle errors is also one of the least well-understood. My theory is that the error is just rare enough that many DBAs do not feel it is necessary to dig into how Oracle's read consistency mechanism works in order to solve a problem that will probably go away the next time the batch job runs anyway. I could go into an explanation, but that would be re-inventing the wheel. Folks, if you manage an Oracle database, take 30 minutes to read about Ora-01555 on - it will save you a lot of time and improve your credibility because you will fully understand why it happens.

"SLRP" - the 4 major OLTP scalability inhibitors

OLTP databases have particular behavior that needs to be tended to. Below I will outline the four major categories that in my experience have been the greatest inhibitors to scalability. This is a holistic approach that will be useful when thinking big picture for your OLTP systems. It is not intended to provide a performance tuning method - that is a solved problem: read Cary Millsap's "Optimizing Oracle Performance."

Transactional databases capture data - sometimes LOTS of data. If your OLTP system is storing transactional data that would best be archived off to a data warehouse, do so. It will improve your mean time to recover during an emergency, and keep your sysadmin happy - or less miserable, at least.

If you often see "latch free" waits in V$SESSION_WAIT, check to see if it is latch# 66. (in Oracle 8i - number changed in 10g and possibly 9i). If so, that is most likely contention for one or more child latches of the "cache buffers chains" latch. Abusing these latches is the fastest way to spike the cpu, because each waiter for the latch spins the cpu in a well-defined algorithm. For high-transaction volume systems, even small reductions in the number of block gets can produce dramatic reductions in cpu load due to the exponential nature of queueing.

You may be over-commiting, or your system may just do a lot of updates, which generate redo. Explore ways to do operations that commonly generate redo as NOLOGGING operations - Inserting /*+ APPEND */ into a temp table, then using some exchange partition trickery to "flip" the tables. If you ever have to recover and roll through a lot of archive logs, you know how important it is to reduce redo production. It is a critical factor in reducing MTTR.

Hard parsing is bad. Use bind variables - I think if Tom Kyte needs to say this once more, he will burst a vein. Soft parsing is better...but the "softer soft parse" that Tom speaks of is the best. What is it? It is a session cursor cache hit. How do you achieve it? execute the same SQL through the *same* Oracle session. That means the same V$SESSION entry, not some other definition that the application thinks a session is. You may want to try re-architecting things to send particular statements through the same Oracle sessions. Using stored procedures/functions can help immensely also when done right.The next time you hit a scalability barrier, chances are it falls into one of these categories.

Treat your archive logs right

Back 'em up
Although this should be part of every backup strategy, backing up your archive logs may require careful thought to create the correct amount of coverage. Let's say you do a Full on Monday then incrementals throughout the week. You store a year of backups, but nightly you remove all archive logs older than 7 days. Disaster strikes on Tuesday, and your most recent full is corrupt. You go back to your prior week's full, apply the incrementals up through Sunday evening. You have deleted the archive logs to bridge you from Sunday to sysdate-7, so that is as far as you can go forward. More careful scrutiny of your first and second lines of defense would have revealed the importance of archive logs as a second line of defense.

Reduce redo, increase security

Audit the processes that generate the most redo on your systems. If you reduce the amount of redo, you will increase the likelyhood of making archive log backups a viable option. You also decrease your mean time to recover. Any DBA who has not been paged in the middle of the night due to a disk hosting archive logs about to fill has not lived - this is another reason for auditing your processes that generate redo.

Export NOLOGGING tables

Make sure that any table marked NOLOGGING truly needs to be. If so, and the data in that table cannot be quickly derived from other data in the database, export the table. That way, even after rolling through archive logs, you will be able to import the tables. If any NOLOGGING operations are performed on a table, after rolling archive across those operations, the "recovered" blocks will be marked corrupt. You will need to truncate that table, then import.