Design of database backup

I have been thinking about how I would like backup to work for my database software ( rustdb ). Currently it has the ability to replicate a database in "real time". This protects against the "master" server failing, by keeping an up-to-date copy, but it doesn't protect against "disasters" such as accidentally deleting a table.

So my idea is to have a mode where updates are fetched and saved in "real time" but not immediately applied. Instead they would be applied after some time period, say 12 weeks. In the event of a disaster (e.g. accidental deletion of a table as above), the updates are selectively applied ( omitting whatever caused the disaster ). It still perhaps cannot cover every eventuality, for example some kind of stealth disaster which is not noticed in the 12 week period. I guess it could be argued if the disaster is not noticed, well is it really a disaster?

Does this make sense? What kind of facilities would you like to see for protecting a database from disaster?

For long-term recovery scenarios like this, a common strategy is to have several tiers of backups with different frequencies and retention policies, for example:

  • Daily backups kept for ~2 weeks
  • Weekly backups kept for ~2 months
  • Monthly backups kept for ~1.5 years
  • Annual backups kept for ~7-10 years (or indefinitely)

The older backups aren't necessarily "live," and may even be written to some kind of archival media and stored in a vault-- If you need them, things have gone really wrong somewhere and so security will be more important than convenience for those.

This specific schedule is probably overkill for most projects, but the capability to set up something like this is a good one to have.

That's not necessarily a given. If your system is infiltrated by a sophisticated hacker group, for example, you may need really old records for comparison to figure out how long they've had access once you detect them.

3 Likes

Yes, I was thinking longer term backups can help with the situation where something is deleted but nobody notices for a year. I guess that is probably outside the scope of the software, but it could perhaps be automated.

What I think is important is keeping it simple, as the more complex and difficult to understand it is, the higher the chance it doesn't work (or nobody has bothered to set it up), and isn't there when needed!

I have now implemented this. Instead of automatically applying updates as they are fetched, they are simply stored in a table. Then I also wrote a "Roll" stored procedure which can apply the updates when needed, which can be run as appropriate. I had some trouble getting this to work reliably, due to a "submarine" bug in my code that detects Denial of Service attacks - when run in a null context the check failed as if there was a DoS attack when there wasn't. It all seems quite nice and logical.

Perhaps the only tricky part is when some "disaster" occurs, the Roll procedure needs to be modified to omit the mistake. I don't think I can really automate that, there has to be some manual intervention to identify what should not be applied.

The code is here: