Risk analysis: data loss
Personal risk analysis and planned countermeasures.
Initially I thought it would be a good idea to buy an ECC RAM enabled hardware but this looks like an adventure with consumer hardware. Reliable server component being both loud and expensive I need to give up this idea. Only AMD Threadripper seems to ensure that the ECC works correctly but this is also dependent on the mainboard for example and there are no available tests covering the actual correction. In the end only one bit toggle can be detected and corrected, 2 bits detected but not corrected and if RAM is defective most of the time you face a situation where complete regions are KO. So finally I decided to go for a system without ECC RAM (both for technical and commercial reasons). This decision is reflected into the analysis.
As the data goes from RAM to Disk via the CPU each of the things carrying the information are prone to deteriorate it.
Under disasters I consider risks coming from outside the system.
- water damage
- fire damage
- lightning damage
- theft of physical devices
- theft of information via network
- Virus / Malware / Ransomware
Under system failures I consider risks coming from the system itself, not performing as intended.
- Silent memory corruption
- Defective RAM cells
CPU / Software
- Wrong processing
- Bug in backup software
- Bug in filesystem drivers
- disk failure
- bad sectors
Cables / Bus
- connection cable not seated correctly
- Wrong communication (due to bugs, high load, ...)
- power failure
- heat/cooling trouble
- water cooling leak (if applicable)
Under accident I consider risks coming from the user/operator making a mistake of some kind.
- delete/overwrite files accidentally
- wrong backup procedure/configuration
- dropping backup disk / handling error (~ disk failure)
Prevention / measures
- ZFS use to detect bitrot & silent corruption on disk
- Disk scrubbing to avoid demagnetisation of HDDs
- Use checksums or parchive use
- At least always a backup offline
- Mount only backup drives for the time running backups
- Store at least a backup off-site
- Use 2 different backup software tools
- Use 2 different filesystems
- Password safe
- Always encrypt sensitive information
- Monitor hardware
- Check backup
- Use snapshots and verify immutable data integrity