December 18, 2024

Why you do not need blockchain for archiving historical data

TLDR: You can take the useful part of blockchains - hash chains - and reject the resource wasting over complexity.

I learnt recently that a prominent historical archive was an early adopter of blockchain technology to ongoingly verify the historical accuracy of their records, including written testimonies, photographs and historical accounts.

I think this is almost a good use for the technology - but there are simpler alternatives - including one which has proven it's strength in the wild... of digital piracy.

I'm aiming to make this readable and convincing to a somewhat technical audience, but who may not understand blockchain fully. As blockchain is very complicated, and badly mischaracterised (e.g. have you heard it referred to as a 'secure public ledger'?) it's very difficult/nearly impossible to explain it to people without any technical background.

The main idea here is that one particular aspect of blockchains is actually good for maintaining historical records - hash chains. To understand what these are we need to start with checksums.

What are checksums?

To understand blockchain you first need to understand checksums. Checksums are like digital fingerprints of data which work by running data through a mathematical function that produces a long list of characters called a hash.

Checksum hashes look like this:

8d969eef6ecad3ca3a629280e686cf0c3f5d5a86aff3ca12020c923adc6c92

They are very cool and interesting because:

Even if one tiny change happens on the original data, the whole hash is different
You cannot reconstruct the original data from the checksum
You can define your own algorithm for a checksum and make them as long or short as you like.

You can think of checksums as a bit like very accurate but tiny photographs of data, or more accurately like a fingerprint, that you can use to verify identity.

You can see how these would be useful in digital archives: you can save a checksum for all your data, and if someone sneakily makes an edit, you will know it without having to go through the entire thing, as the checksum will be different!

What are hash chains?

Hash chains build on checksums by linking them together, to create a chain of data and their checksums. The simple way to link them together is to just add the previous checksum hash to the new data. This insures the integrity of all the previous data (the history), as well the latest data.

This works as hash chains have a very cool and useful property:

Changing a historical entry would break all the following hashes - so you can't do it!

It's a bit like creating a chain of sealed envelopes. Each new envelope contains today's letter PLUS a copy of the seal from yesterday's envelope. This way, if someone tries to tamper with any previous envelope, all the seals after it would be wrong - making it obvious something was changed.

This is why they are useful for proving that nothing in the history of a dataset has been tampered with, as you can just verify the latest checksum and that instantly (with maths) checks all the previous checksums. So you get verification of your current data and your history of changes in one fell swoop - its very efficient way of getting a fingerprint of the entire history of a database including all the edits, times of editing and everything!

What uses this approach already?

It is used very commonly - wherever the integrity of a piece of data depends on its history.

Git uses a very similar concept - each commit has a checksum hash that includes both the changes and previous hash. This lets you verify the entire history of a codebase.
HTTPS connections use it - when a certificate authority (CA) creates a new certificate for a websites SSL/TLS certificate, it keeps a log which uses this approach to make sure that CA's can't issue secret certificates, and also that if a CA itself is compromised, that malicious certificates will be visible
Many other applications where data's historical integrity is important:
- Software distribution systems e.g. apt
- ZFS and other filesystems
- Time stamping authorities

So what does blockchain do on top of this? There must be something more to it.

One of the main differences between a more simple hash chain based system and blockchain is the extra mechanism of distributed consensus. The idea of this is that instead of one trusted institute storing the data, you have multiple institutes, which makes it possible to compare the data to check it hasn't been tampered with.

For most data in our lives, a trusted institute stores it and protect it from attackers, but this has downsides. For example in a historical archive: if someone wanted to change the history of a dataset, they would just need to hack this location and then they would be able to simply rewrite a historical entry, and just compute and save new checksum hashes all the way back to the present.

Blockchains solve this issue by doing the following two key things:

Encouraging copies of the hash chain in multiple places (nodes)

This is good for our purposes! 👍 But it's also very easy to do without blockchain, in fact 'BitTorrent' is based on this! (more below)

Making it so anyone in the system can propose a change (i.e. send bitcoin to another account) and that all nodes will be updated with this change

This is NOT good for our purposes. 😱

What's interesting is that Satoshi Nakamoto, cited some of these earlier systems, particularly Git's commit history structure, as inspiration for Bitcoin's blockchain design. The innovation of blockchain wasn't really the chain of hashes - it was adding distributed consensus on top of that existing concept.

Without the consensus mechanism, you get something that is more equivalent to BitTorrent - which are both efficient and proven in battle (e.g. no-one can seem to shut down pirate bay!)

In bitcoin consensus on whether the data should be changed is established by 'proof of work', this is really difficult to explain, but one of the main ideas is that it becomes economically infeasible to edit the history, but straightforward to make allowed changes (sending coins).

But wait, does that mean blockchain's main feature is actually something we absolutely don't want when maintaining historical archives?

Yes!

In fact with historical archives we absolutely need and want trusted institutions to take care of the data; we don't want crazy radicalists with powerful computers proposing changes to it. And the crux of this argument is: if you don't want to let unapproved actors propose changes, then there is no point in using blockchains.

So what's the sensible solution?

If you have multiple trusted institutions each keeping their own copies of the data (or even just the latest checksums), you absolutely can get all of the purported security benefits of blockchain without the complexity.

In fact, this is the old fashioned way - it's how many important archival systems work by keeping copies e.g:

Multiple universities and libraries archive books and papers and can cross-reference.
Courts, lawyers and everyone involved each maintain copies of documents
Banks and auditors keep separate copies of transactions that must match

This general approach is called LOCKSS (Lots of Copies Keep Stuff Safe). A brilliant article by Maxwell Neely-Cohen explores the challenging question of how to store digital information for 100 years or more, goes into lots of detail.

Checksums and hash chains just speed up the ability to compare digital data - you don't have to compare the entire thing, just the hashes - it's efficient. So the key is having:

A system of regular checksum verification between copies
Multiple independent parties storing copies
Enough copies that no single disaster/failure loses everything

If an attacker changes the hash chain in one institutions the other ones will be able to detect it. So an attacker would need to conduct an oceans eleven style heists on multiple institutions simultaneously to succeed.

As hinted at earlier, bittorent does this well, and multiple archives are already using torrents e.g.:

https://archive.org/details/BitTorrent https://academictorrents.com/

Note that bittorent doesn't use hash chains, but you could easily store previous versions of the torrent hashes in subsequent torrents, if history of the archive was particularly important. BitTorrent itself is more about the enabling distribution of data rather than archiving it.

So is blockchain basically useless for archiving historical data?

Ahem. Unless you want to accidentally sell the rights to edit your archive for cash, I would say 'yes'.

Blockchains solve the more specific problem of 'how do we agree on new data changes when we don't trust each other?' But if you have a known set of trusted parties who can simply compare their copies, the complex consensus mechanism (e.g. proof of work) becomes completely unnecessary (climate burning) overhead.

More-so, blockchain becomes potentially problematic:

We don't want unknown actors being able to propose changes - this is fine in cryptocurrency like bitcoin: we allow anyone to 'send money' as this is never not wanted by the others!
The computational overhead is wasteful - certain blockchain algorithms are huge improvements, but will always still be more wasteful, and also introduce more weirdness (e.g. proof of stake)
Archives need to be able to make corrections when errors are indeed found
The complexity of blockchain systems might actually increase the risk of future inaccessibility

In the original proof of work blockchain, you also have the issue which that someone could get control over future changes to the archive, by simply buying and setting up more nodes (this is known as the 51% attack). This would be the equivalent of someone setting up multiple historical archives with fictitious data, and then the world consensus switching to those archives as there is simply - more of them.

What's even worse is that crazy in the wild blockchain projects often incorporate 'token-based governance' where:

Voting power is proportional to token ownership
Major decisions and changes are made by token holder votes
Anyone can acquire tokens and thus voting power

This would clearly be disastrous for historical archives because obviously truth shouldn't be determined by wealth, and neither should professional expertise or academic credentials be irrelevant.

It would result in a crazy world where wealthy individuals or groups could gain control over historical narratives.

But what about 'Proof of Authority'

To use blockchain sensibly in historical records, you might be able to use 'Proof of Authority' which by definition removes the main feature of blockchain - the ability to operate without authority. It's like building a submarine with wheels - if you're only going to drive it on roads, why not just use a car?

Blockchain only becomes (potentially) useful when you don't need trusted people/institutions holding data. If you don't trust people you then need a reputation system to coordinate activity and to prevent e.g. Sybil attacks.

Ok, I won't use blockchain, what else can I use?

It will depend quite a bit on your scanerio, but if you're looking to implement secure historical archiving with data integrity verification, consider exploring PostgreSQL with the pgcrypto extension, which allows you to implement cryptographic hashing and maintain audit logs of all changes. For a more specialized solution, Immudb is also worth investigating - it's a lightweight, immutable database specifically designed for cryptographic verification of data changes using Merkle trees. Both options are open-source and can be combined with distributed storage solutions like BitTorrent for redundancy. The key is to focus on the core requirements: cryptographic verification, audit trails, and distributed copies.