Blog posts

Welcome to my personal programming blog. As most of the topics are on programming, if that's not interesting to you, you might want to skip this one! If you are really keen, you can see an archive of all my blog posts.

I also post regularly on my stream which is a microblog of thoughts, photos, reminders etc.

Combining Ancestry and 23andMe Data with AWK

I'm a big fan of learning about health and genetics, and for me, there is no better site than Genetic Lifehacks, which has a ton of well written and accessible detailed articles. With membership, it also allows you to connect your own genetic data to explain whether various genetic factors apply to you.

I originally just had 23andMe data, but in order to get more SNPs, I also took the AncestryDNA test. There is overlap in coverage between the two tests, but each also has unique SNPs (single nucleotide polymorphisms) for various traits.

The first thing I wanted to do on Genetic Lifehacks was to combine the files from both services for a more comprehensive coverage in the articles. However, I encountered a few issues:

  1. Sometimes, the site would have an entry for an SNP, but the data would be -- or 00, indicating missing data. In some cases, the data was only missing in one dataset, but present in the other.
  2. There were some SNPs that had different values in the two datasets. In fact, I found 27 differences out of a possible 165,406 SNPs! While that's a small percentage, I wanted to catch them, especially if an important SNP is involved.

I've developed a slightly more in depth process to combine the two datasets. For this project, I renamed my Ancestry data as ancestry.txt and my 23andMe data as 24andme.txt, and removed the comments and header from the file.


Step 1: Convert the AncestryDNA File

The AncestryDNA file had each allele in a separate column, so following the main guide I used this awk command to combine those columns into a single one for easier combination with 23andMe:

awk 'BEGIN {FS="\t"};{print $1"\t"$2"\t"$3"\t"$4 $5}' ancestry.txt > ancestrycombined.txt

This command extracts the first five columns (rsID, chromosome, position, and alleles) and combines the alleles (4th and 5th columns) into one, producing a clean output in ancestrycombined.txt.


Step 2: Clean Up the Data (Line Endings and Missing SNPs)

Next, I had to convert both files to Unix line-end format (since they contained Windows-style carriage return characters) and strip out any rows where the alleles were missing (-- or 00):

tr -d '\r' < ancestrycombined.txt | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > ancestryclean.txt
tr -d '\r' < 24andme.txt | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > 24andmeclean.txt
  • tr -d '\r' removes any carriage return (\r) characters from the file.
  • The awk command filters out any lines where the fourth column contains -- or 00.

This left me with two clean datasets: ancestryclean.txt and 24andmeclean.txt.


Step 3: Combine the Cleaned Datasets

I then combined the two cleaned datasets with cat:

cat 24andmeclean.txt ancestryclean.txt > combined.txt

This produced a single file (combined.txt) containing data from both services.


Step 4: Sort Alleles in Alphabetical Order

In some cases, the alleles were listed in different orders between the two datasets (e.g., one dataset might list CT while the other lists TC). To make sure I wouldn't mistakenly label these as different, I sorted the alleles alphabetically using awk:

awk -F'\t' '{
    split($4, chars, "");  # Split the 4th column (alleles) into an array of characters
    n = length(chars);     # Get the number of characters
    for (i = 1; i < n; i++) {      # Bubble sort to order the characters alphabetically
        for (j = i + 1; j <= n; j++) {
            if (chars[i] > chars[j]) {
                temp = chars[i];
                chars[i] = chars[j];
                chars[j] = temp;
            }
        }
    }
    sorted = "";            # Join the sorted characters back together
    for (i = 1; i <= n; i++) {
        sorted = sorted chars[i];
    }
    $4 = sorted;            # Replace the 4th column with the sorted string
    print $0;
}' OFS='\t' combined.txt > sorted_combined.txt

This ensured that alleles like CT and TC were treated the same across both datasets, producing a file _sortedcombined.txt.


Step 5: Remove Duplicate Rows

Next, I needed to remove any duplicate SNP entries where the rsID and alleles were identical:

awk '!seen[$1$4]++' sorted_combined.txt > deduplicated.txt

This command checks for duplicates based on the rsID ($1) and allele ($4), printing each unique combination only once.


Step 6: Identify and Combine Mismatches

The final step was to identify any SNPs where the rsID was the same but the alleles differed between the two datasets. I also wanted to combine these mismatched alleles into a single field with both values separated by a |, like CT|TT. This is not an ideal solution, and for my file for geneticlifehacks, I might just remove these or replace with ??:

awk -F'\t' '{
    key = $1;              # Set key as the rsID (1st column)
    if (key in seen) {      # If we've seen this rsID before
        if (seen[key] != $4) {  # And the allele differs
            print $0"\t"seen[key]"|"$4;  # Print both alleles, separated by a pipe
        }
    } else {
        seen[key] = $4;     # Otherwise, store the first allele
    }
}' deduplicated.txt > mismatches.txt

This script finds SNPs with different alleles for the same rsID and prints both versions, saving the result to mismatches.txt.


Results

After following this process, I found 27 mismatches between the two datasets out of 165,406 SNPs. While that’s a tiny percentage, it does show the variability in raw data between testing services and the importance of verifying your data if there is something very important coming up.


Bash script that does everything

#!/bin/bash

# Check if two arguments are provided
if [ "$#" -ne 2 ]; then
    echo "Usage: $0 24andme.txt ancestry.txt"
    exit 1
fi

# Input files
file_24andme="$1"
file_ancestry="$2"

# Output files
combined="combined.txt"
sorted_combined="sorted_combined.txt"
deduplicated="deduplicated.txt"
mismatches="mismatches.txt"

# Step 1: Combine the two files after converting to Unix format and removing missing data
echo "Converting and cleaning files..."

awk 'BEGIN {FS="\t"};{print $1"\t"$2"\t"$3"\t"$4 $5}' $file_ancestry > ancestrycombined.txt
tr -d '\r' < ancestrycombined.txt | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > ancestry_clean.txt
tr -d '\r' < "$file_24andme" | awk -F'\t' '{if (length($4) == 1) $4 = $4$4; print $0}' OFS='\t' | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > 24andme_clean.txt

# Combine the two cleaned files
echo "Combining files..."
cat 24andme_clean.txt ancestry_clean.txt > "$combined"

# Step 2: Sort alleles in alphabetical order
echo "Sorting alleles in alphabetical order..."
awk -F'\t' '{
    split($4, chars, "");  # Split the 4th column into an array of characters
    n = length(chars);     # Get the number of characters
    for (i = 1; i < n; i++) {      # Bubble sort to order the characters alphabetically
        for (j = i + 1; j <= n; j++) {
            if (chars[i] > chars[j]) {
                temp = chars[i];
                chars[i] = chars[j];
                chars[j] = temp;
            }
        }
    }
    sorted = "";            # Join the sorted characters back together
    for (i = 1; i <= n; i++) {
        sorted = sorted chars[i];
    }
    $4 = sorted;            # Replace the 4th column with the sorted string
    print $0;
}' OFS='\t' "$combined" > "$sorted_combined"

# Step 3: Remove duplicate SNPs where both the rsID and alleles are the same
echo "Removing duplicates..."
awk '!seen[$1$4]++' "$sorted_combined" > "$deduplicated"

# Step 4: Identify mismatches and combine alleles into one field (e.g., CT|TT)
echo "Identifying mismatches..."
awk -F'\t' '{
    key = $1;              # Set key as the rsID (1st column)
    if (key in seen) {      # If we have seen this rsID before
        if (seen[key] != $4) {  # And the allele differs
            print $0"\t"seen[key]"|"$4;  # Print both alleles, separated by a pipe
        }
    } else {
        seen[key] = $4;     # Otherwise, store the first allele
    }
}' "$deduplicated" > "$mismatches"

# Final output
echo "Process complete!"
echo "Results:"
echo " - Deduplicated file: $deduplicated"
echo " - Mismatches file: $mismatches"

rm ancestrycombined.txt ancestry_clean.txt 24andme_clean.txt "$combined" "$sorted_combined"

cat "$mismatches"

The deduplicated.txt file is the one I now use in geneticlifehacks. Let me know if you find any errors or if this helped you!

When to use React... maybe never?

I've used React for many projects in my career, and it's a framework I'm quite comfortable with and can develop with quite quickly, and I do like building things with it! But in every single case I've doubted whether it's the right choice, and have often ended up believing it has slowed development time down due to making the project tooling over-complex.

Recently at work I have been debating with my team about whether to use React for a new project, I've been arguing against it and have compiled a list of links to articles that make the case well:

https://adactio.com/journal/20837

https://joshcollinsworth.com/blog/self-fulfilling-prophecy-of-react

https://johan.hal.se/wrote/2024/01/24/concatenating-text/

https://begin.com/blog/posts/2024-01-26-removing-react-is-just-weakness-leaving-your-codebase

https://cassidoo.co/post/annoyed-at-react/

https://macwright.com/2024/01/03/miffed-about-react

https://dev.to/matfrana/react-where-are-you-going-5284

So it turns out PHP rules the web?

PHP is often the butt of jokes online, but I've always thought that quite unfair. Turns out, that I am certainly not the only one, as outlined in this article 'The Internet of PHP' where I learn that PHP is powering 77.2% of the top 10 million website backends, with the next contender being ASP.NET at a measly 6.9%. A large amount of this is Wordpress (obviously), but either way, it's a huge amount of the web and to me the summary of this article captures why:

There are languages that are even faster (Rust), have an even larger community (Node.js), or have more mature compilers (Java); but that tends to trade other values.

PHP hits a certain Goldilocks sweetspot. It is pretty fast, has a large community for productivity, features modern syntax, is actively developed, easy to learn, easy to scale, and has a large standard library. It offers high and safe concurrency at scale, yet without async complexity or blocking a main thread. It also tends to carry low maintenance cost due to a stable platform, and through a community that values compatibility and low dependency count. You will have different needs at times, of course, but for this particular sweetspot, PHP stands among very few others.

There is another thing unmentioned here: Because of this large community, stability and ease of use, it also is well practiced by Chat-GPT and Github Codex. The training data is well-fed and familiar with popular frameworks like Laravel and even libraries such as Filament. I've found Chat-GPT's AI pair programming fluency to be of higher quality than that of Node and Python - perhaps because the language has undergone less changes over the recent years.

For me PHP is a good choice for static sites requiring a CMS, blogs and enterprise style software, (e.g. anything involving dashboards, tables etc). It's not always the best choice for these projects (h/t Python with Django), but it's a certainly worthy contender. That being said I was still slightly surprised at how insanely popular it is!

The way things are going...

I have now been using Chat-GPT, and Github's Copilot for coding for 3 months, and I just have to say, some days I am just breezing through tasks that would otherwise take me ages. For me, it's perfect as it excels at the tasks I actually struggle with, e.g. boring boilerplate, repetitive, data entry type things, leaving me to feel more mental space to think about the big picture as well as staying more organised.

The thing is in many cases I will have to tweak or alter some code - both the above tools are essentially super-charged autocomplete which essentially just speed up my typing ability. They still need me to direct how the code all fits together as well as read and correct mistakes.

Soon I may have a project using a low-code framework Oracle Apex, which recently I have spent some time watching a few tutorials and examples of projects built. I am pretty excited about this, as I can definitely see how quickly it could be used for certain types of project. Still it's not 'no-code' - having a decent knowledge of REST APIs, SQL, and CSS seems to be a must for anything but the most basic app.

Sometimes I still wonder (and slightly worry) as primarily a web developer, how much actual coding will I be doing on a daily basis say in 2-3 years time. The way things are going, it might be very little, and at some point there may be professional app builders who know very little code - but still I find it hard to believe that they would not be at a disadvantage in some way. Coding is such a rewarding activity, so it would be sad for that skill to become obsolete, but building stuff quickly and well is also great - I guess we will see!

Chat-GPT is not coming for my job just yet!

The dust has settled a bit after the wide release of Chat-GPT and my excitement has tempered somewhat. It is capable of some very cool stuff, but its main limitation is accuracy. There are a lot of mistakes, and it asserts itself so confidently you have to be very careful when using it. Despite this I have been using it daily, finding it especially useful for SQL and React and as a general search engine for programming language questions. It is very good with error messages for instance, and if stackoverflow.com doesn’t instantly give me the answer, often Chat-GPT will. It is also surprisingly good at converting data. I turned a hastily written design brief into a well formatted JSON structure, a task that would have otherwise been 15mins of cutting and pasting.

I think it's almost certain that at some stage AI will be able to build full applications, and work on them from natural language but how long that takes is anyone’s guess. In the present, I spend more time than in the past wiring up open-source libraries, and working in consoles rather than coding, and I mainly expect AI to just mean that my work becomes more like that. So in the meantime, whilst I don’t think it’s coming for my whole job just yet, I am keeping a very close eye on it! 😀