Share this article on:
We often hear when Australian businesses are ransomwared, but what happens next? The incident response, forensic investigation, and system recovery processes are often never revealed or told.
There are likely multiple reasons why this is the case. One is that recovery from these incidents is often gruelling, with one in four teams needing a month or more to get back to business as usual.
Around-the-clock efforts to get back online are often part and parcel of the post-incident period. It’s an experience security teams are likely to be in no hurry to retell or relive.
It is worth examining why recovery from a ransomware attack takes so long, and in particular, whether architectural changes and/or additional tooling at an infrastructure level might help businesses to get back on their feet faster.
From a local data storage perspective, many businesses have similar infrastructure set-ups, where production servers talk to primary storage, and that data is replicated elsewhere for backup purposes. The backups may be point-in-time snapshots or it may be that data is actively replicated and synchronised between two sites that operate in an active-active configuration.
From a backup perspective, the most important thing is to have an immutable copy with data retention of that copy of the primary storage environment set for a specified period of time such that it cannot be deleted. This is the secure copy of data the business can restore from in the event of a cyber attack. For added safety, it’s also important to put some sort of air gap between the backup and the primary storage environment.
Immutability is an important principle to consider when looking at the cyber resiliency of data infrastructure. The idea is to take a volume of data and make it immutable in such a way that if the business is hit by ransomware, that data cannot be altered by anyone, under any circumstances.
Air gapping is another important security principle. An air gap can be logical or physical; in a traditional infrastructure set-up, point-in-time backups may be stored on tape, which acts as a physical air gap to the primary storage environment. However, tape has its own challenges, and it may be that a logical air gap offers more flexibility in the event that the backup needs to be called on.
Finally, the immutable store has to have a data retention period set, such that deletion is impossible during the retention period and the protected copy of data is securely maintained.
From a month to minutes
Immutability is key to preserving the integrity of backups, but the ability to recover from immutable backups in an agile, timely way also needs to be regularly tested. That’s because recovery is more than just having good backups; in reality, a lot of things need to fall into place, in a particular order, for a recovery scenario to proceed successfully. A misstep at any point can complicate the recovery and prolong downtime.
With immutable backups and a well-designed, tested and executed DR plan, the time to recovery can be significantly reduced.
Take the example of an organisation that had roughly 600 VMs, that was hit by ransomware and had its environment encrypted. The forensics team started their investigation but would not let the operations team start their backup and recovery operations until they’d finished working with the infected primary storage environment. It took four days for the forensics team to complete their work. The operations team then took three days to isolate and recover the 600-odd VMs from immutable backups, so they had to revert to a pre-encrypted version of each of the VMs. Then it took them a day to harden the environment and eradicate the malware, and a further two days to reprotect that before they connected it to the production environment. In total, the recovery time achieved was 10 days, which is considerably better than the industry average.
Still, the recovery was complex and, in some respects, not ideal. Even with immutable backups, we estimate it would still have taken in excess of 3,000 mouse clicks to recover all 600 VMs in the example. I’d argue the chances of performing an operation that requires 3,000 different clicks and actually getting it right the first time without making mistakes are very low. This also shows why recovery can become prolonged: a simple mistake in the ordering can have complex flow-on consequences for the recovery.
But this is also a technical challenge that we are interested in. By experimenting with specific combinations of storage technologies, for example, we’ve been able to demonstrate in a lab environment the ability to recover 1,500 VMs in as little as 17 minutes.
This work is ongoing, but it shows that the problem of ransomware recovery is one that is firmly on the R&D radar and may ultimately be resolvable, tipping the balance of power away from attackers and back towards IT administrators.
Nathan Knight is the vice-president and managing director for Australia and New Zealand at Hitachi Vantara.