Reprinted with Permission by Quest Software April  2004


High Availability Using Transparent Application
  Failover On Oracle Real Application Clusters

Murali Vallath, Quest Software, Inc

94% availability of a football player indicates that the player has missed one game of a season.  However, 99.97% availability of a system indicates a downtime of 2.5 hours in a year. Today’s business requirements are to meet 99.99% or 99.999% availability, which indicates .5 hours or no downtime. To meet these high availability requirements with no downtime in a year, the factor critical for success is for systems to provide for auto failover when one participating system fails. If this does not occur, when a system or a participating node fails from a cluster, then a considerable amount of time will be used in migrating the user from the failed node to another, (reestablishing the database connection, re-executing the query, user having to browse through the screen to his/her last view etc).  Oracle Real Application Cluster (RAC) provides multiple systems to participate in a clustered configuration, providing continuous availability, and when one of the participating systems fail, the users are migrated to the other system, providing a failover mechanism. With the failover process using the transparent application failover (TAF) option, database connections and processes that had lost connection are reconnected and the failover is transparent to the user.

ORACLE REAL APPLICATION CLUSTERS

Figure 1: Oracle Real Application Clusters

RAC allows multiple Oracle Instances residing on different nodes to access the same physical database. As illustrated in Figure 1, you can access the same database from any of the available participating instances, look at the same set of data and share data between instances via the cluster interconnect. The Global Cache Service (GCS) and Global Enqueue Service (GES) maintain consistency across the caches of the different nodes. RAC protects against either node failure or communication failure to a subset of the nodes. With an RAC implementation, two different setups for availability are possible:

Failover and scalability are very important factors in today’s Internet based economy and business. These two features of the RAC support the reliability/availability factor of the database tier of the enterprise system.

HOW DOES THE FAILOVER MECHANISM WORK?

RAC relies on the cluster manager (CM) of the clustered operating system for failure detection because the CM maintains the heartbeat functions. The CM is a distributed kernel component that monitors whether cluster members can communicate with each other, and enforces the rules of cluster membership. Among the various functions in the management of the cluster, the CM also maintains the heartbeat functions. Using the heartbeat mechanism the CM allows the nodes to communicate with the other nodes that are available on a continuous basis at preset intervals, e.g. 2 seconds on Sun and Linux clusters. When a node, or the communication to a node, fails, the heartbeat from the node does not get through and after a time-out period (configurable on most operating systems), the remaining nodes detect the failure and attempt to reform the cluster. (The heartbeat timeout parameter, like the heartbeat interval, varies from operating system to operating system, while the default heartbeat timeout parameter on Sun clusters is 12 seconds and the default on Linux clusters is 10 seconds). If the remaining nodes form a quorum, the other nodes reorganize the cluster membership. The reorganization process regroups the nodes that are accessible and removes the nodes that have failed. For example, in a 4-node cluster, if one node fails, the CM will re-group among the remaining 3 nodes. The CM performs this step when a node is added to a cluster or a node is removed from a cluster. This information is exposed to the respective Oracle Instances by the LMON process running on each cluster node.

The next step in the failover process is the database recovery operation. During the database recovery process, re-mastering of GCS from the failed Instance, cache recovery (rolling forward all transactions) from the redo logs of the failed Instances and transactions recovery (rolling back all uncommitted transactions) are performed by the remaining nodes. With the redo logs located on a shared disk system (in RAC implementations), the recovery process is even smoother. The Instance that first deducted the failure reads from the redo logs and applies them to the database.

After detection and during the remastering of GCS of the failed Instance and cache recovery, most work in the surviving Instance is paused and while transaction recovery takes place, work occurs at a slower pace. This point is considered full database availability, as now all data is accessible, including that which resided on the failed Instance. The application is now responsible for reconnecting the users and repeating any uncommitted work they have done.

While RAC provides for high availability of the systems, the database servers and the applications that use it, there are various features and options under the RAC umbrella that provides even more support towards achieving the 99.999% availability of today’s Internet based business requirements. RAC allows for multiple systems to participate in a clustered configuration providing continuous availability. When one of the participating systems fails, the users are migrated to the other system, thus providing a failover mechanism.

The best failover is the one that no one notices. Unfortunately, even though Oracle has been architected to recover very quickly, failures can severely disrupt users by dropping connections to the database. Work in progress at the time of the failure is most likely lost. If the user queried 1000 rows from the database and a failure on the node occurred while the user was scrolling through these rows on his/her browser, the failure would cause the user to re-execute the query and browse through these rows again. This disruption could be eliminated for most situations by masking the failure with the transparent application failover (TAF) option.

TRANSPARENT APPLICATION FAILOVER

RAC provides near-continuous availability by hiding failures from end-user clients and application server clients. This provides continuous, uninterrupted data access. TAF reroutes application clients to an available database node in the cluster when the connected node fails. Application clients do not see error messages describing loss of service.

Figure 2: Oracle Transparent Application Failover

In the above configuration, if the users connection to Node 1 dies, their transaction is rolled back but they can continue work without having to manually reconnect. To get a good understanding of how the TAF architecture works, it is helpful to walk through a failover scenario using the earlier example, where a user is querying the database to retrieve 1000 rows from the database. Assume that the user is connected to Node 1/ Instance 1. By following the steps identified in Figure 2:

  1. The heartbeat mechanism between the various nodes in the cluster checks to see if the other node is available and is participating in the cluster configuration.
  2. Let’s assume that a user on his/her terminal performs a query against the database to retrieve 1000 rows from the database via Instance 1.
  3. The initial 500-rows are retrieved from the database via Instance 1 and returned to the user for browsing through his/her interface.
  4. While the user is browsing through the first 500 rows through his browser, Node 1 fails and does not participate in the clustered configuration.
  5. Node 2 checks for the heartbeat of the other participating nodes deducts that Node 1 is not available.
  6. While the user is unaware of the failure and scrolls through the remaining 500 rows on the window. The process try’s to connect to Instance 1 but detects that the Instance is not available.
  7. Using the entries in the tnsnames.ora file the user connection to the other available node is established.
  8. User is now connected to instance 2 on Node 2.
  9. Oracle re-executes the query using the connection on Instance 2 and displays the remaining rows to the user. If the data was available in the buffer cache, the rows are returned back to the user instantaneously. However, if the rows are not available, Oracle has to perform an I/O operation. This would be delayed until the recovery process has completed.

In Figure 2 when Node 1 fails, any SELECT statements that had partially executed on Instance 1 are migrated as part of the failover process and are displayed through Instance 2 when the user process fails over to Node 2. All this happens transparently without any interruption to the user. It should be noted that along with the ‘SELECT’ statement, the following are also failed over:

The benefits that such a feature adds in meeting the high availability requirements in today’s machine critical applications are overwhelming and the first question that arises is this, why did Oracle not introduce such a feature before or why is this feature not available among other databases? Though the mechanism is very useful in meeting today’s high availability requirements, implementing such a feature is complex, basically because the database connections are not stateless. This means that during the moment of failure, the database, the users and the transactions are in a specific state of operation, such as:

It should be noted from the scenarios above that only SELECT statements are failed over from one node to another; transactional statements are not failed over by configuration of TAF. However, transactional statements can programmatically be transferred from Node 1 to Node 2 by proper validation of Oracle returned error messages and taking appropriate actions. This should be a preferred method to avoid any user interruptions as well as keeping the database or system failures transparent to the user. Among the transactional statements, the following do not failover or are not protected when a node fails:

While the failover is in process, it is user friendly to inform the user via the application interface that the activity or command issued may take some time. This information could be communicated by validating the various error messages returned by Oracle as part of the node and connection failure. Some of the common Oracle error codes that should be handled by the application to track and transfer transactional statements include:

TAF CONFIGURATION

Configuring the TAF option involves adding Oracle Net parameters to the tnsnames.ora and, when one of the participating nodes encounters failure, the use of parameter values to ascertain the next step in the failover process. The parameter that drives the TAF option is the FAILOVER_MODE under the CONNECT_DATA section of a connect descriptor. By using one or more the following sub parameters, the full functionality of TAF can be obtained:

 
Parameter Description
BACKUP Specifies a different net service name to be used to establish the backup connection. A backup should be specified when using PRECONNECT to pre-establish connections. Specifying a BACKUP is strongly recommended for BASIC methods; otherwise, reconnection may first attempt the Instance that has just failed, adding additional delay until the client reconnects.
TYPE Specifies the type of failover. Three types of Oracle Net failover functionality are available by default to the Oracle Call Interface.
  • SESSION – Fails over the session. With this option only connection is established no work in progress is transferred from the failed Instance to the available Instance.
  • SELECT – Enables user with open cursors to continue fetching on them after failure. Oracle Net keeps track of any SELECT statements issued in the current transaction. It also keeps track of how many rows have been fetched back to the client for each cursor associated with a SELECT statement. If connection to the Instance is lost, Oracle Net establishes a connection to a backup Instance re-executes the SELECT statements and positions the cursors so the client can continue fetching rows as if nothing had happened. However, no DML operations are transferred.
  • NONE – Default, where no failover functionality is implemented.
METHOD Determines the speed of the failover from the primary to the secondary or backup node.
  • BASIC – Establishes connections at failover time.
  • PRECONNECT – Pre-establishes connections. In this parameter is used, connection to the backup Instance is made at the same time as the connection to the primary Instance.
RETRIES Specifies the number of times to attempt to connect to the BACKUP node after a failure before giving up.
DELAY Specifies the amount of time in seconds to wait between attempts to connect to the BACKUP node after a failure before giving up.

Another important parameter, or value, that should not be configured manually is the GLOBAL_DBNAME parameter in the SID_LIST_listener_name section of the listener.ora. Configuring this parameter in listener.ora disables TAF. If the GLOBAL_DBNAME parameter has been defined, the parameter should be deleted and the database should be allowed to dynamically register the global database names automatically.

TAF IMPLEMENTATION

The TAF option can be implemented in one of three ways:

Through the examples below the various implementations options are explained.

CONNECT-TIME FAILOVER AND CLIENT LOAD BALANCING

Oracle Net connects randomly to one of the listener addresses on node ORADB3 or ORADB4. If the Instance fails after the connection, Oracle Net fails over to the other node’s Instance, preserving any SELECT statements in progress.

SUMSKYDB =
 (DESCRIPTION =
  (ADDRESS_LIST =
   (LOAD_BALANCE=ON)
   (FAILOVER=ON)
    (ADDRESS=(PROTOCOL=TCP)(HOST=ORADB3)(PORT=1521))  
    (ADDRESS=(PROTOCOL=TCP)(HOST=ORADB4)(PORT=1521))
   )
  (CONNECT_DATA =
     (SERVICE_NAME  = SUMSKYDB.SUMMERSKYUS.COM)
     (FAILOVER_MODE = (TYPE=SELECT)(METHOD=BASIC)
      )
   )
)
RETRYING A CONNECTION

With the RETRIES and DELAY parameters as part of the failover node sub parameter, the connections to the nodes are automatically retried to the number of times specified by the parameter. In this scenario the connection is retried 20 times with a delay of 15 seconds between every retry. Unlike the other option where one node in the cluster fails and the connection is reestablished on one of the other surviving nodes, under this option, the connection is retried on the same node and no backup node is defined as part of the configuration. Similarly there is no significance to the load-balancing parameter and has been set to off.

SUMSKYDB =
 (DESCRIPTION =
  (ADDRESS_LIST =
   (LOAD_BALANCE=OFF)
   (FAILOVER=ON)
   (ADDRESS=(PROTOCOL=TCP)(HOST=ORADB3)(PORT=1521))
  )
 (CONNECT_DATA =
     (SERVICE_NAME = SUMSKYDB.SUMMERSKYUS.COM)
     (FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)(DELAY=15)
      )
  )
 )

PRE-ESTABLISHING A CONNECTION

Another implementation option available under the TAF configuration is to setup a pre-established connection to the backup or secondary Instance. One of the potential performance issues is the time required to re-establish a connection after the primary, or the first, node has failed to the backup or secondary node. This could be resolved by pre-establishing connections, which means that the initial and backup connections are explicitly specified. While there is a great advantage in pre-establishing the connection, this is not without any drawbacks. Pre-established connections consume resources. During testing processes experienced by the author, up to 2% of additional CPU usage when using pre-established connections has been noted. This is because the process always validates the connection throughout its activity. In the following example, the Oracle Net connects to the listener on ORADB3. If ORADB3 fails after the connection, Oracle Net fails-over to ORADB4, preserving any SELECT statements in progress. Having the backup connection already in place can reduce the time needed for a failover in the event that one needs to take place. However, if the connect to the failed Instance does not succeed at connect time, then fail back to the original Instance is not possible. This option increases resource overhead in that the backup connection is always maintained.

SSRAC3 =
 (DESCRIPTION =
  (ADDRESS =(PROTOCOL=TCP)(HOST=ORADB3)(PORT=1521))
  (CONNECT_DATA =
     (SERVICE_NAME = SUMSKYDB.SUMMERSKYUS.COM) 
     (INSTANCE_NAME = RAC3)
     (FAILOVER_MODE = 
  (BACKUP=SSRAC4)(TYPE=SELECT)(METHOD=PRECONNECT)
      )
    )
  )
SSRAC4=
 (DESCRIPTION =
  (ADDRESS =(PROTOCOL=TCP)(HOST=ORADB4)(PORT=1521))
   (CONNECT_DATA =
     (SERVICE_NAME = SUMSKYDB.SUMMERSKYUS.COM)
     (INSTANCE_NAME = RAC4)
     (FAIOVER_MODE =
 (BACKUP=SSRAC3)(TYPE=SELECT)(METHOD=PRECONNECT)
)
    ) 
  )

TAF VERIFICATION

Implementation of TAF could be verified by querying the data dictionary views. V$SESSION has three columns FAILOVER_MODE, FAILOVER_TYPE and FAILED_OVER that provides information pertaining to TAF implementation and verification of results when the node in the cluster crashes and fails over to one of the available nodes.

SELECT MACHINE, FAILOVER_TYPE, FAILOVER_METHOD, FAILED_OVER
FROM V$SESSION

The above query provides the details and status of the failover operation. If there were many sessions on Node 1 it would be good to look at the details by grouping the results. The following query gives a consolidated count on the operation.

SELECT MACHINE, FAILOVER_TYPE, FAILOVER_METHOD, FAILED_OVER, COUNT (*)
FROM V$SESSION 
GROUP BY MACHINE, FAILOVER_TYPE, FAILOVER_METHOD, FAILED_OVER

CONCLUSION

With hardware level clustering and the proper usage of high availability solutions such as Oracle Real Application Clusters, users have a minimum impact when a system in the cluster is lost. However, with the usage of TAF option, a higher availability of the system is possible to meet today’s business requirements. This option provides the smooth transition from the primary Instance to the secondary Instance with minimal or no interruption to the user when connections to the primary Instance is lost.


ABOUT THE AUTHOR

Murali Vallath has over 17 years of experience designing and developing databases. He is an Oracle Certified Database Administrator. He currently works as a Senior Oracle Technologist for Quest Software, USA. His specialty area is with Oracle Parallel Server/Oracle Real Application Clusters. He has presented at various conferences (Oracle World, EOUG, UKOUG, IOUG) and has published papers in leading publications such as SELECT and Oracle Scene.

He is the author of the book titled ‘Oracle Real Application Clusters’ (Publisher: Digital Press). Murali is the President of the new formed RAC –SIG and the Charlotte Oracle Users Group (www.cltoug.org). Murali can be reached at murali.vallath@quest.com

REFERENCES

  1. High availability using transparent application failover on Oracle Real Application Clusters – Murali Vallath – Oracle Open World 2001
  2. Oracle 9i and Oracle 8i Documentation Set