Combining Ancestry and 23andMe Data with AWK
I'm a big fan of learning about health and genetics, and for me, there is no better site than Genetic Lifehacks, which has a ton of well written and accessible detailed articles. With membership, it also allows you to connect your own genetic data to explain whether various genetic factors apply to you.
I originally just had 23andMe data, but in order to get more SNPs, I also took the AncestryDNA test. There is overlap in coverage between the two tests, but each also has unique SNPs (single nucleotide polymorphisms) for various traits.
The first thing I wanted to do on Genetic Lifehacks was to combine the files from both services for a more comprehensive coverage in the articles. However, I encountered a few issues:
- Sometimes, the site would have an entry for an SNP, but the data would be -- or 00, indicating missing data. In some cases, the data was only missing in one dataset, but present in the other.
- There were some SNPs that had different values in the two datasets. In fact, I found 27 differences out of a possible 165,406 SNPs! While that's a small percentage, I wanted to catch them, especially if an important SNP is involved.
I've developed a slightly more in depth process to combine the two datasets. For this project, I renamed my Ancestry data as ancestry.txt and my 23andMe data as 24andme.txt, and removed the comments and header from the file.
Step 1: Convert the AncestryDNA File
The AncestryDNA file had each allele in a separate column, so following the main guide I used this awk command to combine those columns into a single one for easier combination with 23andMe:
awk 'BEGIN {FS="\t"};{print $1"\t"$2"\t"$3"\t"$4 $5}' ancestry.txt > ancestrycombined.txt
This command extracts the first five columns (rsID, chromosome, position, and alleles) and combines the alleles (4th and 5th columns) into one, producing a clean output in ancestrycombined.txt.
Step 2: Clean Up the Data (Line Endings and Missing SNPs)
Next, I had to convert both files to Unix line-end format (since they contained Windows-style carriage return characters) and strip out any rows where the alleles were missing (-- or 00):
tr -d '\r' < ancestrycombined.txt | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > ancestryclean.txt
tr -d '\r' < 24andme.txt | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > 24andmeclean.txt
- tr -d '\r' removes any carriage return (\r) characters from the file.
- The awk command filters out any lines where the fourth column contains -- or 00.
This left me with two clean datasets: ancestryclean.txt and 24andmeclean.txt.
Step 3: Combine the Cleaned Datasets
I then combined the two cleaned datasets with cat:
cat 24andmeclean.txt ancestryclean.txt > combined.txt
This produced a single file (combined.txt) containing data from both services.
Step 4: Sort Alleles in Alphabetical Order
In some cases, the alleles were listed in different orders between the two datasets (e.g., one dataset might list CT while the other lists TC). To make sure I wouldn't mistakenly label these as different, I sorted the alleles alphabetically using awk:
awk -F'\t' '{
split($4, chars, ""); # Split the 4th column (alleles) into an array of characters
n = length(chars); # Get the number of characters
for (i = 1; i < n; i++) { # Bubble sort to order the characters alphabetically
for (j = i + 1; j <= n; j++) {
if (chars[i] > chars[j]) {
temp = chars[i];
chars[i] = chars[j];
chars[j] = temp;
}
}
}
sorted = ""; # Join the sorted characters back together
for (i = 1; i <= n; i++) {
sorted = sorted chars[i];
}
$4 = sorted; # Replace the 4th column with the sorted string
print $0;
}' OFS='\t' combined.txt > sorted_combined.txt
This ensured that alleles like CT and TC were treated the same across both datasets, producing a file _sortedcombined.txt.
Step 5: Remove Duplicate Rows
Next, I needed to remove any duplicate SNP entries where the rsID and alleles were identical:
awk '!seen[$1$4]++' sorted_combined.txt > deduplicated.txt
This command checks for duplicates based on the rsID ($1) and allele ($4), printing each unique combination only once.
Step 6: Identify and Combine Mismatches
The final step was to identify any SNPs where the rsID was the same but the alleles differed between the two datasets. I also wanted to combine these mismatched alleles into a single field with both values separated by a |, like CT|TT. This is not an ideal solution, and for my file for geneticlifehacks, I might just remove these or replace with ??:
awk -F'\t' '{
key = $1; # Set key as the rsID (1st column)
if (key in seen) { # If we've seen this rsID before
if (seen[key] != $4) { # And the allele differs
print $0"\t"seen[key]"|"$4; # Print both alleles, separated by a pipe
}
} else {
seen[key] = $4; # Otherwise, store the first allele
}
}' deduplicated.txt > mismatches.txt
This script finds SNPs with different alleles for the same rsID and prints both versions, saving the result to mismatches.txt.
Results
After following this process, I found 27 mismatches between the two datasets out of 165,406 SNPs. While that’s a tiny percentage, it does show the variability in raw data between testing services and the importance of verifying your data if there is something very important coming up.
Bash script that does everything
#!/bin/bash
# Check if two arguments are provided
if [ "$#" -ne 2 ]; then
echo "Usage: $0 24andme.txt ancestry.txt"
exit 1
fi
# Input files
file_24andme="$1"
file_ancestry="$2"
# Output files
combined="combined.txt"
sorted_combined="sorted_combined.txt"
deduplicated="deduplicated.txt"
mismatches="mismatches.txt"
# Step 1: Combine the two files after converting to Unix format and removing missing data
echo "Converting and cleaning files..."
awk 'BEGIN {FS="\t"};{print $1"\t"$2"\t"$3"\t"$4 $5}' $file_ancestry > ancestrycombined.txt
tr -d '\r' < ancestrycombined.txt | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > ancestry_clean.txt
tr -d '\r' < "$file_24andme" | awk -F'\t' '{if (length($4) == 1) $4 = $4$4; print $0}' OFS='\t' | awk -F'\t' '$4 != "--" && $4 != "00" { print $0 }' OFS='\t' > 24andme_clean.txt
# Combine the two cleaned files
echo "Combining files..."
cat 24andme_clean.txt ancestry_clean.txt > "$combined"
# Step 2: Sort alleles in alphabetical order
echo "Sorting alleles in alphabetical order..."
awk -F'\t' '{
split($4, chars, ""); # Split the 4th column into an array of characters
n = length(chars); # Get the number of characters
for (i = 1; i < n; i++) { # Bubble sort to order the characters alphabetically
for (j = i + 1; j <= n; j++) {
if (chars[i] > chars[j]) {
temp = chars[i];
chars[i] = chars[j];
chars[j] = temp;
}
}
}
sorted = ""; # Join the sorted characters back together
for (i = 1; i <= n; i++) {
sorted = sorted chars[i];
}
$4 = sorted; # Replace the 4th column with the sorted string
print $0;
}' OFS='\t' "$combined" > "$sorted_combined"
# Step 3: Remove duplicate SNPs where both the rsID and alleles are the same
echo "Removing duplicates..."
awk '!seen[$1$4]++' "$sorted_combined" > "$deduplicated"
# Step 4: Identify mismatches and combine alleles into one field (e.g., CT|TT)
echo "Identifying mismatches..."
awk -F'\t' '{
key = $1; # Set key as the rsID (1st column)
if (key in seen) { # If we have seen this rsID before
if (seen[key] != $4) { # And the allele differs
print $0"\t"seen[key]"|"$4; # Print both alleles, separated by a pipe
}
} else {
seen[key] = $4; # Otherwise, store the first allele
}
}' "$deduplicated" > "$mismatches"
# Final output
echo "Process complete!"
echo "Results:"
echo " - Deduplicated file: $deduplicated"
echo " - Mismatches file: $mismatches"
rm ancestrycombined.txt ancestry_clean.txt 24andme_clean.txt "$combined" "$sorted_combined"
cat "$mismatches"
The deduplicated.txt file is the one I now use in geneticlifehacks. Let me know if you find any errors or if this helped you!