It’s been almost two months since Hurricane Sandy first hit the Northeast and changed many people’s lives for years to come. About a year before in 2011 we were all preparing for Sandy’s baby sister, Irene. I clearly remember the amount of hype the media produced at that time as the majority of people in New York have never had to prepare for a Hurricane before. Irene came and went with about the same effect of a strong rainstorm and I remember hearing a lot of backlash toward the media about blowing this “Hurricane” out of proportion. From what I remember none of my clients suffered any outages but did prepare many IT departments for the real deal which would come one year later.
October 26th, 2012 is the day that I started to take Sandy serious. Just about all weather radar reports on TV started showing an incredible storm forming in the south Atlantic with a projected path of hitting NY and NJ sometime after Halloween. As the trusted IT advisors for our clients our firm had the responsibility of preparing their IT infrastructure’s and DR plans as if we were 100% sure that this Hurricane would directly hit us. For many years my clients in lower Manhattan would bring up the potential flood scenario, some took action and prepared while some deferred it to worry about another time. At this time we started sending notices to our clients to prepare all the tools necessary to make sure all respective DR/BCP plans were understood and ready for execution.
Just like Irene, Sandy came and went but this time leaving a trail of damage that the Northeast never expected or in any way prepared for. This time the media was very cautious in overblowing this storm but was firm that all should be prepared, which I think saved many lives. Many of the firms we consult for prepared to push the big red button but did not eventually have to. Several clients proactively executed the DR plan while a few sat dead in the water due to a total lack of investment in DR.
A particular client of mine took on major damage since their office is located in the last building on the tip of lower Manhattan. Their building’s basement and lobby was flooded with millions of gallons of seawater and had to vacate their office. In fact, they have still not moved back in yet due to the lack of stable dependent infrastructure services from the outside. A Financial Services firm always looks at risks in all areas of the operation so they decided to continue operations at a temporary office space to ensure that no critical trading functions are disrupted while the building infrastructure is performing repairs. The firm was prepared to utilize backup services for IT via their Singapore office but encountered some struggles. A combined factor of latency and bandwidth limitations which come by default when moving packets of data across the world hindered operations for them in a major way. Also, limited CPU horsepower prevented critical jobs from completing within normal thresholds. This prompted us to physically pluck all technology from 20th floor of their Manhattan office down the stairs and into a car for setup and delivery to a co-location facility in upstate New York.
Now that I have had some time to analyze and digest the procedures and outcomes from many different clients I wanted to share some of the most important lessons that were reinforced and some that were newly learned.
- HAVING A FORMAL DR/BCP PLAN – Sounds like a no brainer right? Wrong! A majority of firms either have no plan whatsoever or something that is partially completed. This can be a daunting task and something that shouldn’t be learned on the fly. I recommend to hiring an outside firm that has seen many of the hurdles and mistakes others have made.
- MAKE FRIENDS WITH THE BUILDING MANAGER – For my downtown client, many years of relationship building with them gave us heavy leverage even when the building was flooded and closed. When we approached the building team to climb up 20 flights of stairs to retrieve 10 servers we hit a stonewall at first. If we got injured a lot of people could have been in some serious trouble but we then engaged the building manager with success. This was the main reason we were able to get them running in a co-location so fast. Relationships are key to getting things done when there is possible red tape.
- COMMUNICATION - If have seen IT managers get into serious hot water when they don’t communicate effectively to the Executives. I feel communication to the higher ups and end users is over 50% of the battle. If something doesn’t go right but you communicate it and et everyone know how you plan to prevent the same misstep going forward you will have a happier audience.
- TESTING YOUR DR/BCP PLAN – Having a plan is great but without constant testing and refinement how can you be sure that things will work when it comes time to execute? At minimum I like to see my clients test their plans at least two times a year. These tests will provide the ability to work out any kinks and stay consistent as the business grows and changes.
- DR vs. BCP – Disaster Recovery and Business Continuity Planning are two different animals, especially firms with large operations. To provide some some contrast, DR is the ability to recover a failed or degraded IT service to a Healthy state. (Example – Performing an Exchange email datacenter switchover) to another data center. Whereas BCP is focused more on logistical functions. (Example – A strategic plan on where/how to relocate humans in the event that employees are not able to enter the office due to a power outage/fire/flood etc).
- TEMPORARY WORKPLACE – Once again, in the case of my client who was not able to enter the office due to the flooding in downtown Manhattan they needed to find a temporary office to put 40+ people for several months. If this is done during an emergency most spaces will have already become unavailable. There are many data center/co-location environments that offer pre-arranged hotel space for a monthly fee. Many firms can operate just fine with employees working remotely from home. A Financial Services firm is more likely to have this due to the requirement of constant collaboration amongst employees and groups. You need to figure out which subset of users can work remotely and which users need to be together.
- DATACENTER PLACEMENT – Another seemingly easy one but often overlooked. I was just meeting with a new client with about 600 employees that had production in Manhattan and DR in New Jersey. Sandy took down both datacenters sine they were so close together and completely shut down the firm’s business. On the flip side, we once again look at my downtown client who’s DR site was on the other side of the world in Singapore. The distance was too great so make sure to take bandwidth and latency into account. I usually recommend a regional backup facility, usually within 60 miles for minimal latency and near zero RPO capabilities. I usually like to see the DR facility placed on an opposite coast. (Example: New York, primary, New Jersey, backup and Arizona for the DR facility.)
- ISP REDUNDANCY TO THE NTH DEGREE – Just because you have two different ISP vendors doesn’t mean you are resilient. In Manhattan Verizon owns most of the copper Infrastructure which means a large single point of failure exists. Due to monopoly reasons Verizon has been forced to offer their copper to resellers. So, if a client of mine has T1 lines from both Verizon and Verizon Reseller X odds are both runs are being physically routed through the same conduits in the building, the same circuit boards in the basement, the same path in the street, and the same central office. Always stagger your lines using different mediums such as Fiber or WiMax. Ask the building to show you their point of presence (POP) diagram which will show you the different entry points for the carriers. This is great information to know when writing up a scenario involving physical construction outside of the building which can potentially sever all connections to the building. Of course multiple POP’s can help avoid any outage. I also like to utilize 4G line of sight connectivity whenever possible form a vendor like Towerstream. This technology has matured greatly and offers speeds in excess of 20mb/sec. Since the infrastructure sites on the roof of the building it is completely autonomous and not affected by any issues that can occur on the ground level. They usually have a dedicated UPS on the roof for it as well! If your building doesn’t offer this sometimes banding together with other firms in the building and presenting the idea to the building manager will get them to allow the infrastructure install on the roof.
Planning and actually being involved in a DR scenario are two different things. No matter how experienced an IT Professional may be, you will always learn something new when going through something of this magnitude. Documentation, checklists, communication, and proactive testing are the ingredients to a successful DR/BCP experience.
-Justin Vashisht (3cVguy)