Disaster Recovery Planning

Web server failure: Upon failure of a web server, the web server will be removed from the load balancer and be re-assigned from the forward facing security group to a restricted security group (preventing access to the private subnet and outbound internet connections). The web server will then be troubleshooted through a review of the event logs to determine the type of failure. Upon determination of the type of failure, the Amazon Machine Images (AMI) will be inspected to determine the vulnerability and susceptibility of failure. Web servers using the affected image will be patched. A review of network ACLs, security groups, port scans and application vulnerability testing will be performed.

Database (RDS) failure: Upon failure of the database, a database snapshot and concurrent reboot will be initiated. RDS reboot will incur an approximate downtime of 2 minutes. During this time, Amazon security logs and database logs will be reviewed. If unsure that a security breech has not occurred, the master password will be changed from within AWS RDS. All other security credentials will be revoked and re-issued. A review of effected functions/procedures/tables/views will be performed and restored via the latest backup using RDS snapshot. If a snapshot is not available or also effected, a restore will be performed to the effected areas utilizing the offsite backup. A review of network ACLs, security groups, port scans and application vulnerability testing will be performed.  Should a failure of the RDS persist beyond one hour, applications, services, and web servers will be redirected to the offsite database.  The offsite database will be restored to the latest image if it does not already have the latest image.  Upon resolution of the Amazon RDS failure, the applications, services, and web servers will be redirect to the Amazon RDS no sooner than 24 hours and on the following Monday between 00:00 EST and 04:00 EST due to possible interruption of service for a second time to clients.  Once the decision to move services back to Amazon, each administrator will be updated on the plan.

Applications/Services failure: Upon failure of an application or service (ie Flight Tracker coordinates gathering), the application/service will be stopped. Security logs will then be reviewed to determine extent of failure. Once the application/service failure has been corrected, the application will be restarted. A review of network ACLs, security groups, port scans and application vulnerability testing will be performed.

Amazon failure:  Upon complete failure of the Amazon Government resources we will contact Amazon to determine the extent of the outage and their progress towards resolution.  If the outage is likely to persist longer than one hour, we will redirect traffic via DNS change to the offsite location.  During the DNS TTL period, the database will be restored to the latest image if it is not already.  Upon resolution of the Amazon failure, the DNS will be restored to Amazon EC2 no sooner than 24 hours and on the following Monday between 00:00 EST and 04:00 EST due to possible interruption of service for a second time to clients.  Once the decision to move services back to Amazon, each administrator will be updated on the plan.

Upon determination of a breech, all clients will be notified with in 24 hours of the breech.  All services administrators will be updated periodically with the cause (if known), work being performed and resolution of the issue.

 

  1 comment for “Disaster Recovery Planning

Comments are closed.