Blog posts

Welcome to my personal programming blog. As most of the topics are on programming, if that's not interesting to you, you might want to skip this one! If you are really keen, you can see an archive of all my blog posts.

I also post regularly on my stream which is a microblog of thoughts, photos, reminders etc.

December 18, 2024

Why you do not need blockchain for archiving historical data

TLDR: You can take the useful part of blockchains - hash chains - and reject the resource wasting over complexity.

I learnt recently that a prominent historical archive was an early adopter of blockchain technology to ongoingly verify the historical accuracy of their records, including written testimonies, photographs and historical accounts.

I think this is almost a good use for the technology - but there are simpler alternatives - including one which has proven it's strength in the wild... of digital piracy.

I'm aiming to make this readable and convincing to a somewhat technical audience, but who may not understand blockchain fully. As blockchain is very complicated, and badly mischaracterised (e.g. have you heard it referred to as a 'secure public ledger'?) it's very difficult/nearly impossible to explain it to people without any technical background.

The main idea here is that one particular aspect of blockchains is actually good for maintaining historical records - hash chains. To understand what these are we need to start with checksums.

What are checksums?

To understand blockchain you first need to understand checksums. Checksums are like digital fingerprints of data which work by running data through a mathematical function that produces a long list of characters called a hash.

Checksum hashes look like this:

8d969eef6ecad3ca3a629280e686cf0c3f5d5a86aff3ca12020c923adc6c92

They are very cool and interesting because:

Even if one tiny change happens on the original data, the whole hash is different
You cannot reconstruct the original data from the checksum
You can define your own algorithm for a checksum and make them as long or short as you like.

You can think of checksums as a bit like very accurate but tiny photographs of data, or more accurately like a fingerprint, that you can use to verify identity.

You can see how these would be useful in digital archives: you can save a checksum for all your data, and if someone sneakily makes an edit, you will know it without having to go through the entire thing, as the checksum will be different!

What are hash chains?

Hash chains build on checksums by linking them together, to create a chain of data and their checksums. The simple way to link them together is to just add the previous checksum hash to the new data. This insures the integrity of all the previous data (the history), as well the latest data.

This works as hash chains have a very cool and useful property:

Changing a historical entry would break all the following hashes - so you can't do it!

It's a bit like creating a chain of sealed envelopes. Each new envelope contains today's letter PLUS a copy of the seal from yesterday's envelope. This way, if someone tries to tamper with any previous envelope, all the seals after it would be wrong - making it obvious something was changed.

This is why they are useful for proving that nothing in the history of a dataset has been tampered with, as you can just verify the latest checksum and that instantly (with maths) checks all the previous checksums. So you get verification of your current data and your history of changes in one fell swoop - its very efficient way of getting a fingerprint of the entire history of a database including all the edits, times of editing and everything!

What uses this approach already?

It is used very commonly - wherever the integrity of a piece of data depends on its history.

Git uses a very similar concept - each commit has a checksum hash that includes both the changes and previous hash. This lets you verify the entire history of a codebase.
HTTPS connections use it - when a certificate authority (CA) creates a new certificate for a websites SSL/TLS certificate, it keeps a log which uses this approach to make sure that CA's can't issue secret certificates, and also that if a CA itself is compromised, that malicious certificates will be visible
Many other applications where data's historical integrity is important:
- Software distribution systems e.g. apt
- ZFS and other filesystems
- Time stamping authorities

So what does blockchain do on top of this? There must be something more to it.

One of the main differences between a more simple hash chain based system and blockchain is the extra mechanism of distributed consensus. The idea of this is that instead of one trusted institute storing the data, you have multiple institutes, which makes it possible to compare the data to check it hasn't been tampered with.

For most data in our lives, a trusted institute stores it and protect it from attackers, but this has downsides. For example in a historical archive: if someone wanted to change the history of a dataset, they would just need to hack this location and then they would be able to simply rewrite a historical entry, and just compute and save new checksum hashes all the way back to the present.

Blockchains solve this issue by doing the following two key things:

Encouraging copies of the hash chain in multiple places (nodes)

This is good for our purposes! 👍 But it's also very easy to do without blockchain, in fact 'BitTorrent' is based on this! (more below)

Making it so anyone in the system can propose a change (i.e. send bitcoin to another account) and that all nodes will be updated with this change

This is NOT good for our purposes. 😱

What's interesting is that Satoshi Nakamoto, cited some of these earlier systems, particularly Git's commit history structure, as inspiration for Bitcoin's blockchain design. The innovation of blockchain wasn't really the chain of hashes - it was adding distributed consensus on top of that existing concept.

Without the consensus mechanism, you get something that is more equivalent to BitTorrent - which are both efficient and proven in battle (e.g. no-one can seem to shut down pirate bay!)

In bitcoin consensus on whether the data should be changed is established by 'proof of work', this is really difficult to explain, but one of the main ideas is that it becomes economically infeasible to edit the history, but straightforward to make allowed changes (sending coins).

But wait, does that mean blockchain's main feature is actually something we absolutely don't want when maintaining historical archives?

Yes!

In fact with historical archives we absolutely need and want trusted institutions to take care of the data; we don't want crazy radicalists with powerful computers proposing changes to it. And the crux of this argument is: if you don't want to let unapproved actors propose changes, then there is no point in using blockchains.

So what's the sensible solution?

If you have multiple trusted institutions each keeping their own copies of the data (or even just the latest checksums), you absolutely can get all of the purported security benefits of blockchain without the complexity.

In fact, this is the old fashioned way - it's how many important archival systems work by keeping copies e.g:

Multiple universities and libraries archive books and papers and can cross-reference.
Courts, lawyers and everyone involved each maintain copies of documents
Banks and auditors keep separate copies of transactions that must match

This general approach is called LOCKSS (Lots of Copies Keep Stuff Safe). A brilliant article by Maxwell Neely-Cohen explores the challenging question of how to store digital information for 100 years or more, goes into lots of detail.

Checksums and hash chains just speed up the ability to compare digital data - you don't have to compare the entire thing, just the hashes - it's efficient. So the key is having:

A system of regular checksum verification between copies
Multiple independent parties storing copies
Enough copies that no single disaster/failure loses everything

If an attacker changes the hash chain in one institutions the other ones will be able to detect it. So an attacker would need to conduct an oceans eleven style heists on multiple institutions simultaneously to succeed.

As hinted at earlier, bittorent does this well, and multiple archives are already using torrents e.g.:

https://archive.org/details/BitTorrent https://academictorrents.com/

Note that bittorent doesn't use hash chains, but you could easily store previous versions of the torrent hashes in subsequent torrents, if history of the archive was particularly important. BitTorrent itself is more about the enabling distribution of data rather than archiving it.

So is blockchain basically useless for archiving historical data?

Ahem. Unless you want to accidentally sell the rights to edit your archive for cash, I would say 'yes'.

Blockchains solve the more specific problem of 'how do we agree on new data changes when we don't trust each other?' But if you have a known set of trusted parties who can simply compare their copies, the complex consensus mechanism (e.g. proof of work) becomes completely unnecessary (climate burning) overhead.

More-so, blockchain becomes potentially problematic:

We don't want unknown actors being able to propose changes - this is fine in cryptocurrency like bitcoin: we allow anyone to 'send money' as this is never not wanted by the others!
The computational overhead is wasteful - certain blockchain algorithms are huge improvements, but will always still be more wasteful, and also introduce more weirdness (e.g. proof of stake)
Archives need to be able to make corrections when errors are indeed found
The complexity of blockchain systems might actually increase the risk of future inaccessibility

In the original proof of work blockchain, you also have the issue which that someone could get control over future changes to the archive, by simply buying and setting up more nodes (this is known as the 51% attack). This would be the equivalent of someone setting up multiple historical archives with fictitious data, and then the world consensus switching to those archives as there is simply - more of them.

What's even worse is that crazy in the wild blockchain projects often incorporate 'token-based governance' where:

Voting power is proportional to token ownership
Major decisions and changes are made by token holder votes
Anyone can acquire tokens and thus voting power

This would clearly be disastrous for historical archives because obviously truth shouldn't be determined by wealth, and neither should professional expertise or academic credentials be irrelevant.

It would result in a crazy world where wealthy individuals or groups could gain control over historical narratives.

But what about 'Proof of Authority'

To use blockchain sensibly in historical records, you might be able to use 'Proof of Authority' which by definition removes the main feature of blockchain - the ability to operate without authority. It's like building a submarine with wheels - if you're only going to drive it on roads, why not just use a car?

Blockchain only becomes (potentially) useful when you don't need trusted people/institutions holding data. If you don't trust people you then need a reputation system to coordinate activity and to prevent e.g. Sybil attacks.

Ok, I won't use blockchain, what else can I use?

It will depend quite a bit on your scanerio, but if you're looking to implement secure historical archiving with data integrity verification, consider exploring PostgreSQL with the pgcrypto extension, which allows you to implement cryptographic hashing and maintain audit logs of all changes. For a more specialized solution, Immudb is also worth investigating - it's a lightweight, immutable database specifically designed for cryptographic verification of data changes using Merkle trees. Both options are open-source and can be combined with distributed storage solutions like BitTorrent for redundancy. The key is to focus on the core requirements: cryptographic verification, audit trails, and distributed copies.

September 14, 2024

Combining Ancestry and 23andMe Data with AWK

I'm a big fan of learning about health and genetics, and for me, there is no better site than Genetic Lifehacks, which has a ton of well written and accessible detailed articles. With membership, it also allows you to connect your own genetic data to explain whether various genetic factors apply to you.

I originally just had 23andMe data, but in order to get more SNPs, I also took the AncestryDNA test. There is overlap in coverage between the two tests, but each also has unique SNPs (single nucleotide polymorphisms) for various traits.

The first thing I wanted to do on Genetic Lifehacks was to combine the files from both services for a more comprehensive coverage in the articles. However, I encountered a few issues:

Sometimes, the site would have an entry for an SNP, but the data would be -- or 00, indicating missing data. In some cases, the data was only missing in one dataset, but present in the other.
There were some SNPs that had different values in the two datasets. In fact, I found 27 differences out of a possible 165,406 SNPs! While that's a small percentage, I wanted to catch them, especially if an important SNP is involved.

I've developed a slightly more in depth process to combine the two datasets. For this project, I renamed my Ancestry data as ancestry.txt and my 23andMe data as 24andme.txt, and removed the comments and header from the file.

Step 1: Convert the AncestryDNA File

The AncestryDNA file had each allele in a separate column, so following the main guide I used this awk command to combine those columns into a single one for easier combination with 23andMe:

awk 'BEGIN {FS="\t"};{print $1"\t"$2"\t"$3"\t"$4 $5}' ancestry.txt > ancestrycombined.txt

This command extracts the first five columns (rsID, chromosome, position, and alleles) and combines the alleles (4th and 5th columns) into one, producing a clean output in ancestrycombined.txt.

Step 2: Clean Up the Data (Line Endings and Missing SNPs)

Next, I had to convert both files to Unix line-end format (since they contained Windows-style carriage return characters) and strip out any rows where the alleles were missing (-- or 00):

tr -d '\r' < ancestrycombined.txt | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > ancestryclean.txt
tr -d '\r' < 24andme.txt | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > 24andmeclean.txt

tr -d '\r' removes any carriage return (\r) characters from the file.
The awk command filters out any lines where the fourth column contains -- or 00.

This left me with two clean datasets: ancestryclean.txt and 24andmeclean.txt.

Step 3: Combine the Cleaned Datasets

I then combined the two cleaned datasets with cat:

cat 24andmeclean.txt ancestryclean.txt > combined.txt

This produced a single file (combined.txt) containing data from both services.

Step 4: Sort Alleles in Alphabetical Order

In some cases, the alleles were listed in different orders between the two datasets (e.g., one dataset might list CT while the other lists TC). To make sure I wouldn't mistakenly label these as different, I sorted the alleles alphabetically using awk:

awk -F'\t' '{
    split($4, chars, "");  # Split the 4th column (alleles) into an array of characters
    n = length(chars);     # Get the number of characters
    for (i = 1; i < n; i++) {      # Bubble sort to order the characters alphabetically
        for (j = i + 1; j <= n; j++) {
            if (chars[i] > chars[j]) {
                temp = chars[i];
                chars[i] = chars[j];
                chars[j] = temp;
            }
        }
    }
    sorted = "";            # Join the sorted characters back together
    for (i = 1; i <= n; i++) {
        sorted = sorted chars[i];
    }
    $4 = sorted;            # Replace the 4th column with the sorted string
    print $0;
}' OFS='\t' combined.txt > sorted_combined.txt

This ensured that alleles like CT and TC were treated the same across both datasets, producing a file _sortedcombined.txt.

Step 5: Remove Duplicate Rows

Next, I needed to remove any duplicate SNP entries where the rsID and alleles were identical:

awk '!seen[$1$4]++' sorted_combined.txt > deduplicated.txt

This command checks for duplicates based on the rsID ($1) and allele ($4), printing each unique combination only once.

Step 6: Identify and Combine Mismatches

The final step was to identify any SNPs where the rsID was the same but the alleles differed between the two datasets. I also wanted to combine these mismatched alleles into a single field with both values separated by a |, like CT|TT. This is not an ideal solution, and for my file for geneticlifehacks, I might just remove these or replace with ??:

awk -F'\t' '{
    key = $1;              # Set key as the rsID (1st column)
    if (key in seen) {      # If we've seen this rsID before
        if (seen[key] != $4) {  # And the allele differs
            print $0"\t"seen[key]"|"$4;  # Print both alleles, separated by a pipe
        }
    } else {
        seen[key] = $4;     # Otherwise, store the first allele
    }
}' deduplicated.txt > mismatches.txt

This script finds SNPs with different alleles for the same rsID and prints both versions, saving the result to mismatches.txt.

Results

After following this process, I found 27 mismatches between the two datasets out of 165,406 SNPs. While that’s a tiny percentage, it does show the variability in raw data between testing services and the importance of verifying your data if there is something very important coming up.

Bash script that does everything

#!/bin/bash

# Check if two arguments are provided
if [ "$#" -ne 2 ]; then
    echo "Usage: $0 24andme.txt ancestry.txt"
    exit 1
fi

# Input files
file_24andme="$1"
file_ancestry="$2"

# Output files
combined="combined.txt"
sorted_combined="sorted_combined.txt"
deduplicated="deduplicated.txt"
mismatches="mismatches.txt"

# Step 1: Combine the two files after converting to Unix format and removing missing data
echo "Converting and cleaning files..."

awk 'BEGIN {FS="\t"};{print $1"\t"$2"\t"$3"\t"$4 $5}' $file_ancestry > ancestrycombined.txt
tr -d '\r' < ancestrycombined.txt | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > ancestry_clean.txt
tr -d '\r' < "$file_24andme" | awk -F'\t' '{if (length($4) == 1) $4 = $4$4; print $0}' OFS='\t' | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > 24andme_clean.txt

# Combine the two cleaned files
echo "Combining files..."
cat 24andme_clean.txt ancestry_clean.txt > "$combined"

# Step 2: Sort alleles in alphabetical order
echo "Sorting alleles in alphabetical order..."
awk -F'\t' '{
    split($4, chars, "");  # Split the 4th column into an array of characters
    n = length(chars);     # Get the number of characters
    for (i = 1; i < n; i++) {      # Bubble sort to order the characters alphabetically
        for (j = i + 1; j <= n; j++) {
            if (chars[i] > chars[j]) {
                temp = chars[i];
                chars[i] = chars[j];
                chars[j] = temp;
            }
        }
    }
    sorted = "";            # Join the sorted characters back together
    for (i = 1; i <= n; i++) {
        sorted = sorted chars[i];
    }
    $4 = sorted;            # Replace the 4th column with the sorted string
    print $0;
}' OFS='\t' "$combined" > "$sorted_combined"

# Step 3: Remove duplicate SNPs where both the rsID and alleles are the same
echo "Removing duplicates..."
awk '!seen[$1$4]++' "$sorted_combined" > "$deduplicated"

# Step 4: Identify mismatches and combine alleles into one field (e.g., CT|TT)
echo "Identifying mismatches..."
awk -F'\t' '{
    key = $1;              # Set key as the rsID (1st column)
    if (key in seen) {      # If we have seen this rsID before
        if (seen[key] != $4) {  # And the allele differs
            print $0"\t"seen[key]"|"$4;  # Print both alleles, separated by a pipe
        }
    } else {
        seen[key] = $4;     # Otherwise, store the first allele
    }
}' "$deduplicated" > "$mismatches"

# Final output
echo "Process complete!"
echo "Results:"
echo " - Deduplicated file: $deduplicated"
echo " - Mismatches file: $mismatches"

rm ancestrycombined.txt ancestry_clean.txt 24andme_clean.txt "$combined" "$sorted_combined"

cat "$mismatches"

The deduplicated.txt file is the one I now use in geneticlifehacks. Let me know if you find any errors or if this helped you!

July 5, 2024

When to use React... maybe never?

I've used React for many projects in my career, and it's a framework I'm quite comfortable with and can develop with quite quickly, and I do like building things with it! But in every single case I've doubted whether it's the right choice, and have often ended up believing it has slowed development time down due to making the project tooling over-complex.

Recently at work I have been debating with my team about whether to use React for a new project, I've been arguing against it and have compiled a list of links to articles that make the case well:

https://adactio.com/journal/20837

https://joshcollinsworth.com/blog/self-fulfilling-prophecy-of-react

https://johan.hal.se/wrote/2024/01/24/concatenating-text/

https://begin.com/blog/posts/2024-01-26-removing-react-is-just-weakness-leaving-your-codebase

https://cassidoo.co/post/annoyed-at-react/

https://macwright.com/2024/01/03/miffed-about-react

https://dev.to/matfrana/react-where-are-you-going-5284

September 9, 2023

So it turns out PHP rules the web?

PHP is often the butt of jokes online, but I've always thought that quite unfair. Turns out, that I am certainly not the only one, as outlined in this article 'The Internet of PHP' where I learn that PHP is powering 77.2% of the top 10 million website backends, with the next contender being ASP.NET at a measly 6.9%. A large amount of this is Wordpress (obviously), but either way, it's a huge amount of the web and to me the summary of this article captures why:

There are languages that are even faster (Rust), have an even larger community (Node.js), or have more mature compilers (Java); but that tends to trade other values.

PHP hits a certain Goldilocks sweetspot. It is pretty fast, has a large community for productivity, features modern syntax, is actively developed, easy to learn, easy to scale, and has a large standard library. It offers high and safe concurrency at scale, yet without async complexity or blocking a main thread. It also tends to carry low maintenance cost due to a stable platform, and through a community that values compatibility and low dependency count. You will have different needs at times, of course, but for this particular sweetspot, PHP stands among very few others.

There is another thing unmentioned here: Because of this large community, stability and ease of use, it also is well practiced by Chat-GPT and Github Codex. The training data is well-fed and familiar with popular frameworks like Laravel and even libraries such as Filament. I've found Chat-GPT's AI pair programming fluency to be of higher quality than that of Node and Python - perhaps because the language has undergone less changes over the recent years.

For me PHP is a good choice for static sites requiring a CMS, blogs and enterprise style software, (e.g. anything involving dashboards, tables etc). It's not always the best choice for these projects (h/t Python with Django), but it's a certainly worthy contender. That being said I was still slightly surprised at how insanely popular it is!

March 22, 2023

The way things are going...

I have now been using Chat-GPT, and Github's Copilot for coding for 3 months, and I just have to say, some days I am just breezing through tasks that would otherwise take me ages. For me, it's perfect as it excels at the tasks I actually struggle with, e.g. boring boilerplate, repetitive, data entry type things, leaving me to feel more mental space to think about the big picture as well as staying more organised.

The thing is in many cases I will have to tweak or alter some code - both the above tools are essentially super-charged autocomplete which essentially just speed up my typing ability. They still need me to direct how the code all fits together as well as read and correct mistakes.

Soon I may have a project using a low-code framework Oracle Apex, which recently I have spent some time watching a few tutorials and examples of projects built. I am pretty excited about this, as I can definitely see how quickly it could be used for certain types of project. Still it's not 'no-code' - having a decent knowledge of REST APIs, SQL, and CSS seems to be a must for anything but the most basic app.

Sometimes I still wonder (and slightly worry) as primarily a web developer, how much actual coding will I be doing on a daily basis say in 2-3 years time. The way things are going, it might be very little, and at some point there may be professional app builders who know very little code - but still I find it hard to believe that they would not be at a disadvantage in some way. Coding is such a rewarding activity, so it would be sad for that skill to become obsolete, but building stuff quickly and well is also great - I guess we will see!