Cluster Restart Crib


Restarting the Cluster

Emergency Restart

Use the following method when something has gone wrong with your RAC enironment, and the decision has been taken to restart everything (i.e. database, ASM, the whole cluster) ASAP.

Down

Log onto node 03 as root (03 is used as an example).

Run standard checks beforehand:

/u01/app/12.1.0/grid/bin/crsctl check cluster

/u01/app/12.1.0/grid/bin/crsctl status resource –t

Next, bring down the cluster as follows:

If you want to shut down the cluster on both nodes, run the following:

/u01/app/12.1.0/grid/bin/crsctl stop cluster –all –f

If you want to shut down the cluster on a single node (03, in this example), run the following:

/u01/app/12.1.0/grid/bin/crsctl stop cluster -n db03 -f

(NOTE: we are using the ‘force’ option because of the fact that there is something wrong with the environment at present, and so need to bring it down ASAP)

Confirm everything has come down by running standard checks again:

/u01/app/12.1.0/grid/bin/crsctl check cluster –all

/u01/app/12.1.0/grid/bin/crsctl status resource –t

Doesn’t hurt to confirm DB and ASM are definitely down:

(especially if we are restarting the cluster because of issues)

ps –ef | grep smon

You can also bring down HA (High Availability) at this point:

# crsctl stop has

NOTE: this command will stop HA on the local node only.

Therefore - depending on your situation (you might be doing the restart only on one node, for example) – you will need to log onto node 04 to stop it there.

Up

Start everything back up again:

If you took down HA, then simply restarting this will bring up the rest of the cluster:

# crsctl start has

If you didn’t take down HA, then run this command to bring up the clusterware:

/u01/app/12.1.0/grid/bin/crsctl start cluster –all

Confirm everything has come up again:

crsctl check cluster –all

crsctl status resource –t

Alternative Commands


NOTE: You can also use the following commands to restart CRS, which achieve the same result as the above “crsctl start/stop cluster” commands:

crsctl check crs

crsctl stop crs

crsctl start crs

Controlled Restart

If you are taking down the cluster as part of some maintenance activity (i.e. not an emergency scenario), then it’s better to take the cluster components down using srvctl commands as described in this section.

From [1] – taking the whole cluster down in one go using CRSCTL as in the previous section "can lead to the database instances being stopped similar to shutdown abort, which requires an instance recovery on startup. If you use SRVCTL to stop the database instances manually before stopping the cluster, then you can prevent a shutdown abort, but this requires that you manually restart the database instances after restarting Oracle Clusterware."

Therefore, it is better to use the method below:

Down

Database

Log onto node 03 as user ‘oracle’.

srvctl status database -db ism_dbname

srvctl stop database -db ism_dbname

srvctl status database -db ism_dbname

Alternatively, you can do this one node at a time (e.g. if one node is already down, as we have currently with rpap2)

srvctl status instance -db ism_dbname –instance rpap1

srvctl stop instance -db ism_dbname –instance rpap1

srvctl status instance -db ism_dbname –instance rpap1

-- only if you’re doing both nodes:

srvctl status instance -db ism_dbname –instance rpap2

srvctl stop instance -db ism_dbname –instance rpap2

srvctl status instance -db ism_dbname –instance rpap2

ASM

Log onto node 03 as user ‘grid’.

srvctl status asm -n db03

srvctl stop asm -n db03

srvctl status asm -n db03

-- only if you’re doing both nodes:

srvctl status asm -n db04

srvctl stop asm -n db04

srvctl status asm -n db04

Nodeapps

Still logged on as ‘grid’:

srvctl status nodeapps -node db03

srvctl stop nodeapps -node db03

srvctl status nodeapps -node db03

-- only if you’re doing both nodes:

srvctl status nodeapps -node db04

srvctl stop nodeapps -node db04

srvctl status nodeapps -node db04

CRS

Log onto node 03 as ‘root’.

/u01/app/12.1.0/grid/bin/crsctl check cluster –all

/u01/app/12.1.0/grid/bin/crsctl status resource –t

If you are taking down the clusterware on both nodes:

/u01/app/12.1.0/grid/bin/crsctl stop cluster –all

If you are taking down the clusterware on one node only:

/u01/app/12.1.0/grid/bin/crsctl stop cluster -n db03

Check everything is now down.

/u01/app/12.1.0/grid/bin/crsctl check cluster –all

/u01/app/12.1.0/grid/bin/crsctl status resource –t

HA

You can also bring down HA (High Availability) at this point:

# crsctl stop has

NOTE: this command will stop HA on the local node only.

Therefore - depending on your situation (you might be doing the restart only on one node, for example) – you will need to log onto node 04 to stop it there.

Up

In order to bring everything up again, follow sthe steps in the “Emergency Restart” (Up) section. The steps will be the same.

Reference

Oracle® Real Application Clusters
Administration and Deployment Guide
12c Release 1 (12.1)
E48838-09
 



Additional - shutting down one node.

ps -ef | grep pmon

This will show the instances running on this machine/node

srvctl status database -db datadb -v

This will show the status of the database named and this instances associated with that database and which server said instance is running on.

srvctl status instance -db datadb -v

This will show the status of the instances running on the named database and any services associated with that instance

srvctl stop service -db ods -service ods_1 -instance ods1

This will stop the service named ods_1 associated with instance ods1 which uses database ods

srvctl stop service -db datadb -service prodcat, datadb_1 -instance datadb1

NB - you may want to disable 'has' e.g. for patching, otherwise the services will come back up following a reboot.

This will stop the list of services linked to instance datadb1 which uses databases datadb





ASM on Linux - quick guide

Using ASM on Linux - a very quick and dirty guide, as always errors and omissions exclude.

Put this together what seems ages ago but still useful.

Oracle ASM on 12c seems to be very popular so just a few notes if you are interested and want to mess about with it.

Download Oracle 12c database for your VM (Oracle VirtualBox – presume already running Linux Release 6.5)

http://www.oracle.com/technetwork/database/enterprise-edition/downloads/database12c-linux-download-1959253.html

linuxamd64_12102_database_1of2.zip
linuxamd64_12102_database_2of2.zip

Don’t worry about this just yet, download the grid controller also.

Use the 12102 grid not the one on the same link

No:
linuxamd64_12c_grid_1of2.zip
linuxamd64_12c_grid_2of2.zip

(If you do use the above 12c grid stuff, you end up with compatibility issues, which I never resolved and ended up upgrading 12.1.0.1 grid to 12.1.0.2 anyway)

http://www.oracle.com/technetwork/database/enterprise-edition/downloads/database12c-linux-download-2240591.html

Yes:

linuxamd64_12102_grid_1of2.zip
linuxamd64_12102_grid_2of2.zip

For this example added two virtual disks on Virtuabox which I used for the ASM stuff. I then used the following web site to create the asm packages and disks:

** http://pierreforstmanndotcom.wordpress.com/2013/08/15/how-to-install-asmlib-on-oracle-linux-6/
** this link is now "dead" see ASMLIB on Linux 6

Follow the link and once you get the bit below you can install the grid infrastructure.

[oracle@mydbhost grid]$ oracleasm listdisks

ASM1
ASM2

If you hit the resource busy error

[root@ora01 /]# oracleasm createdisk DATA /dev/sde1
Unable to open device "/dev/sde1": Device or resource busy

[root@ora01 /]# /usr/sbin/asmtool -C -l /dev/oracleasm -n DATA -s /dev/sde1 -a force=yes

[root@usze2qmbtora01 /]# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...
[root@ora01 /]# oracleasm listdisks
DATA



Unzip the grid zip files. I would suggest creating a “grid” user when you run the grid runInstaller, can leave as oracle, up to you.

I simply used the oracle user but need to be careful that you set the right oracle home

[oracle@mydbhost grid]$ cat /etc/oratab

+ASM:/media/u03/oracle/product/12.1.0/grid_1:N # line added by Agent

MADRID:/media/u02/oracle/product/12.1.0/dbhome_1:N # line added by Agent




You can change the “Data Group Name” to anything you want.

Needed to change the “Change Discovery Path” to /dev/oracleasm/disks

When you have the grid installation working you need to set up a grid instance (environment set to grid home):


[oracle@mydbhost bin]$ sqlplus / as sysasm

SQL*Plus: Release 12.1.0.2.0 Production on Wed Oct 22 11:26:33 2014

Copyright (c) 1982, 2014, Oracle. All rights reserved.

Connected to an idle instance.

SQL> startup nomount;

ORA-01078: failure in processing system parameters

ORA-29701: unable to connect to Cluster Synchonization Service

SQL> select status from v$instance;

select status from v$instance

*

ERROR at line 1:

ORA-01034: ORACLE not available

Process ID: 0

Session ID: 0 Serial number: 0

Need to run this to get the Cluster stuff working, even though it is on one node.

[root@mydbhost bin]# crsctl start resource -all

……

Start as below.

[oracle@mydbhost ~]$ sqlplus "/as sysasm"

SQL*Plus: Release 12.1.0.2.0 Production on Wed Oct 22 11:47:47 2014

Copyright (c) 1982, 2014, Oracle. All rights reserved.

Connected to:

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

With the Automatic Storage Management option

SQL> shutdown;

ASM instance shutdown

SQL> startup nomount;

ASM instance started

Total System Global Area 1140850688 bytes

Fixed Size 2933400 bytes

Variable Size 1112751464 bytes

ASM Cache 25165824 bytes

ASM diskgroups mounted

SQL>

So once you have this running, when you run the dbca stuff for 12c you can pick up the +DATA

Et voilla...

SQL> select file_name from dba_data_files;

FILE_NAME
--------------------------------------------------------------------------------

+DATA/MADRID/DATAFILE/system.258.861232551
+DATA/MADRID/DATAFILE/sysaux.257.861232347
+DATA/MADRID/DATAFILE/undotbs1.260.861232777
+DATA/MADRID/DATAFILE/users.259.861232775
+DATA/MADRID/DATAFILE/matrix_clob01.db
+DATA/MADRID/DATAFILE/sara.285.861707437



Like I said very quick and dirty.


Let me know if anything wrong with this or any way you can think that will improve it.



Latest downloads - as of Feb 2018


Disaster Recovery

I was asked by a friend the other day if I had ever had a disaster recovery situation in my career.
Well the answer is not really but I have been close on a couple of occasions.

On the 12th April 2002 (remember clearly it was a Friday afternoon) and the Distillex factory in North Shields, about 500 metres from where I worked and about the same distance from the server room, caught fire.

The site had been on fire previously so we really did not think it was that bad. Well it was that bad and when the police declared a major incident, within minutes we had to leave quick, epecially when the gas tanks started flying.


It was only when I was driving home and the specialist chemical fire fighters from Middlesbrough were on their way up the A19, I thought, "oh no, the ... server room".  With a slightly different wind direction and different gas tanks, I might have been making a lot of phone calls.

A few years later and the Buncefield oil terminal went up near the M1 (right next to a massive business park, great idea) and there must been a few disaster recovery plans put into place that day. Sure it still holds the record for Britain's costliest industrial accident at a billion pound. So if you think it won't happen to you think again.

Fast forward, ten years and I get a phone call on a Sunday morning, nothing was working, So I get logged in and sure enough nothing except the intranet and the internet site, all a bit strange. At this point my spider sense started tingling and I asked the person who made the call if he could get hold of the lad on security to walk down and check the server room. Ten minutes later, "can you get yourself in work".

Managed to get into work thirty minutes later (fortunately it was a Sunday) and can only describe that the server room was like walking into a greenhouse and every server was beeping away. Despite having the latest server room technology and ultra sensitive smoke detectors, massive halon tanks for fire suppression, nobody had thought about heat monitoring. The air conditioning units had failed (long story) and the only server that had stayed up was the one serving the internet / intranet sites. All I can say is we got lucky that day.

So if you have fire suppression in your server room check you have heat monitoring too or you might need to smile politely when months later your baked disks all start to fail and the engineers are scratching their heads, wondering why they are seeing loads of failures.