Latest blog posts

Data preprocessing tasks in big projects

There is a common source of friction (and anxiety) in psychology projects with large datasets: The pipeline of getting data from the data collection stage wherever or whatever that may be, into the final form to perform statistical analysis. Frequently the statisticians are not provided with enough information or don't have the tools to process all the idiosyncrasies in the data, or the data collectors do not know how to present the data in the correct way to the statisticians.

The data collector wishes to leave no stone unturned collecting all necessary data, which can involve reacting quickly and improvising when obstacles arise. But the statistician needs reliably organised data, and ensures that it is transparent, documented and correct.

The way to solve this is to have a data preprocessor. Data pre-processing is difficult because a) It requires general programming skills b) Knowledge of statistical analysis c) Knowledge of the data itself. Sadly, it is also an under-appreciated role (and can be quite grueling!)

Data preprocessing essentially involves performing the organising and patching necessary to provide files to go straight into statistical analysis software packages. Anything that a statistics package would find unwieldy, or should be automated, is part of the pre-processing step.

Tasks

The following tasks are what I've found to be the major part of data preprocessing:

  1. Organise and store iterations of the raw data, in a documented form that can be transparently traced back to the original source.

  2. Proactively document and patch discrepancies in the data, which can be amongst a host of other things:

    • Missing or incorrect ids
    • Incorrectly labelled data
    • Multiple versions of data collection instruments
    • Missing variables
    • Instrument coding errors
    • Document and filter irrelevant data (e.g. test data)
  3. Rename and organise variables and document this.

    The variables at the data collection stage may not be the best for the analysis stage. Frequently in the data collection context, lengthy variable names with prefixes (e.g. in a SQL database) will necessarily be used, or the data collector has not named the variable correctly or sufficiently.

  4. Perform merges:

    • Merge similar files to single indexed large files: For instance an experiment may generate a file per participant, with each row representing a trial. But for analysis a single file is needed, so the files need to be merged, and an index column added.
    • Merge tables, to create a new table for a specific analysis. It’s often not necessary to bring together one giant table, but if the dataset is mostly complete, upto a certain scale it can be the goal. For large, often incomplete datasets, it's better to have a top-down process guide this, whereby the statistician works backwards from the analysis required, to work out the precise variables needed in a frame of data. And then generate and label the frame accordingly.
  5. Perform aggregations and transformations:

    Sometimes data involves the calculation of established scores, transformation from formats like JSON, or the analysis requires means, medians, standard deviations etc. These can be a task for the statistician, but most often it is better subsumed under the data pre-processor role, especially if the aggregations involve any kind of unusual transformation, (e.g. the calculator of CIEDE2000 distance scores from RGB values). Again, the rule of thumb is that if it cannot be easily accomplished in statistics software, it should probably be a preprocessing step.

How to do it

Communication: In an ideal world the statistician should be explicit up front about how they would like to recieve the data, and what variables. This will guide the process but at the very least there should be an open communication channel between the two roles. In addition, the data preprocessor should preferably advise how data collectors can produce raw data that is easiest to work with, if they can...

How should data preprocessing itself be done? Ideally it needs to be automated. But most importantly, it needs to be documented and transparent, and done in way that after a year you can go back and figure out very quickly where a number has come from! The need for automation and flexibility means using a fully fledged programming language, but with strong support for data manipulation: so essentially, Pandas with Python, or R. The environment and language should allow:

  1. Text processing/cleaning - I have seen this cause difficulty because accomplished statisticians can lack basic text programming skills
  2. Streamlined loading of table data in different formats
  3. Merging tables (lengthwise and by key)
  4. Use of version control and diff merges (e.g. for pre-processing code)

Summary

The effort and importance of the data preprocessing role is often not immediately apparent, especially at the start of projects. So, when you sense complicated data, it's best to plan up ahead the likely tasks of this role and to assign it correctly to someone who has the confidence to take it on. When it's done well, nobody will notice, and perhaps it takes experience of it being done badly to appreciate it being done well! Up front planning for this work will definitely reduce potential sources of friction, and give a sense of streamlined flow between messy data collection and rigorous analysis.

New year, new webserver at nearlyfreespeech.net

I've been using nearlyfreespeech.net for a while for a side project, and also for adhoc sites that are needed from time-to-time during work. I have grown to really like it's minimal, clutter free setup. So this year I've decided to move my site over, and away from AWS. The costs are about the same, and it feels better to not be giving any money to Amazon - they have enough.

The migration took about 5 minutes, and went very smoothly with the bulk of the time just trying to remember how to do LetsEncrypt on nearlyfreespeech. Ahh, the advantages of a no-dependency website. Smug mode.

Having given up twitter (apart from some key accounts as RSS feeds via nitter), I am aiming to post more on this site in the coming year!

Reducing the toil of executing one off code on a remote server with SSH and SCP

This is a little Linux tip that I've started using quite frequently on one of my projects: how to transfer a script and execute it with ssh. I use it when I need to execute a script on a remote server, to do something as a one-off like build a particular data output file, or perform a one off database procedure.

The issue is that I often need to tweak and experiment, but the code being remote means I don't have all my local tools available and transferring the code with git commits is clunky. This is not an everyday occurrence. Typically, I am transferring the main codebase to remote servers with a pipeline using git, but if this one-off script needs to be developed iteratively on (e.g.) a remote staging/debugging environment, and it doesn't belong in the repo long term, then it would clog up the commit history with runtime bugs, and 'Added semicolon'
commits etc. Generally the tweak, test, tweak method of working on a script is not what git repositories are used for!

So, to achieve this quickly, there are a number of existing workflows that I used, all of which involve a certain amount of toil.

  1. Fire up local dev environment, develop and test script on local machine, upload script to remote environment and run.
  2. If no local dev environment, or setup costly, develop directly on remote debug server with vim and console
  3. If vim development at distinct disadvantage to ide, use scp to transfer code changes to remote debug server

I would love to get better at vim, but having spent years using IDEs, it's difficult to give up on my finely tuned mouse skills and keyboard shortcuts for navigating through codebases. I learn a bit more vim each time I use it, but I'm not 'there' yet.

So typically I was using option 3 which involves toil, e.g. if developing a php script and I make a syntax error, I have to do a whole manual copy to server process just for a tiny change.

My aha moment was realising I can create a general purpose script to upload and execute a script on a remote debug server, automating all parts of my option 3 workflow.

So when to use this pattern:

  • The environment is remote
  • It is safe to tweak and play with the environment
  • You want to edit the scripts with local tools

Here is a general pattern

#!/bin/bash

# Copy the script to the server
scp my_script.php server:~/my_script.php

# Run the script which generates downloadable file my_report.csv
ssh server /bin/bash << EOF
cd ~
echo 'Creating report'
php my_script.php
echo 'Done'
EOF

# Copy the file back to local machine
today=$(date +"%Y-%m-%d")
scp server:~/my_report.csv "my_report_${today}.csv"

# Optionally remove all trace of existence
ssh server "rm -f ~/my_report.csv ~/my_script.php"

In my IDE, I can now tweak the PHP, and double click to execute the new file on the server, saving me countless clicks and shell commands.

The other advantage, is that I can then run the script from my local machine on the production environment when it is ready, simply by changing the ssh destinations, making it much less likely for human error!

How to make responsive Likert scales in CSS (like Qualtrics)

At work I’ve been updating many forms to be responsive and work on mobile. This is not as exciting as it sounds - but like any boring but necessary task, there is fun to be had searching for an optimal solution. Psychology is just full of forms and surveys which contain Likert scales, so I came up with the below solution based on radio input elements, for future projects, and it works very nicely. I design with following criteria in mind:

  • Responsive, so horizontal for desktop / tablet, vertical for mobile.
  • Uniform element sizing - it’s important that buttons are not bigger for some than other, as that might introduce bias.
  • Accessible - should allow browser shortcuts and using clear design to indicate the various states of the scale’s buttons.
  • Should never resize to something crazy and unreadable.

Bonus features:

  • Input elements are inside label so no ‘id’ or ‘for’ attributes
  • Stays centered when on mobile but sticks to the left on desktop (but can be centered too, by using 'grid' instead of 'inline-grid').
  • Number of rows can be set by CSS variable, in style attribute.

I looked at both flexbox and grid solutions, I used this great guide to grid to experiment a lot as well as this article on container queries which led me to a near solution with flex: the so called Flexbox Holy Albatross. This was close to the final solution bar one issue: in the vertical arrangement, if the content of one box was bigger than the others, then it wouldn't adjust the sizes of all the boxes, leaving one bigger than the others. In the end I could see no way of getting this behaviour without specifying the number of columns somewhere in the css.

The other downside to this approach is that it relies on media queries, meaning that the scale won't behave as well in situations where it has variable widths. E.g. The media query I have specified works well in my full page setup but I had to add an extra media query for it's display on my front page.

The Flexbox Holy Albatross would eliminate this downside, but I would have to then fix row heights, which in my case is less achievable as the contents of the likert's are unknown, but the width of the container is.

Final setup:



It uses grid under the hood, and a single media query, which switches the orientation when the viewport width goes below 680px. It uses outline as well as background shade to indicate selection, which is important for colour vision deficiencies. I choose this way rather than say indicating selection with dark background and light colour because in binary choices it might not be clear which was selected. The inputs are hidden with opacity, which means that you can still use keyboard shortcuts to navigate the form, and switch them with arrow keys. It recreates the standard blue focus outline too (black when selected but not focusd, important for keyboard users.

It is basically everything you could ever want in a Likert scale.

Source code

HTML

<div class="likert">
    <label><input name="l" type="radio" value="1"/><span>I don’t like it at all</span></label>
    <label><input name="l" type="radio" value="2"/><span>I don’t like it</span></label>
    <label><input name="l" type="radio" value="3"/><span>I am indifferent</span></label>
    <label><input name="l" type="radio" value="4"/><span>I like it</span></label>
    <label><input name="l" type="radio" value="5"/><span>I like it a lot</span></label>
</div>
    

CSS

Not all likert scales have 5 elements, you can add a variable on the likert element's style attribute e.g. style="--likert-rows:6", to change the elements. I was unable to keep the likert buttons the same size using either grid or flex, whilst still not specifying the number of rows somewhere. It would be great in the future to do this with attr() - but I couldn't get it to work.

I was also tempted to put all the colours as variables in the main likert class, to make them easy to tweak and override as a batch - but it didn't feel necessary in the end, I kept it simple.

    .likert {
        --likert-rows: 5;
        display: inline-grid;
        max-width: 900px;
        grid-auto-rows: 1fr;
        gap: 1em;
        grid-template-columns: repeat(var(--likert-rows), minmax(0, 1fr));
    }

    @media only screen and (max-width: 680px) {
        .likert {
            grid-template-columns: minmax(0, 400px);
            justify-content: center;
        }
    }

    .likert input {
        max-width: 250px;
        position: fixed;
        opacity: 0;
        pointer-events: none;
    }


    .likert span {
        border-radius: 5px;
        display: flex;
        justify-content: center;
        align-items: center;
        text-align: center;
        box-sizing: border-box;
        width: 100%;
        height: 100%;
        padding: 20px;
        background: #dcdcdc;
        transition: background .2s ease-in-out;
    }

    .likert input:checked + span {
        outline: black auto 1px;
        background: transparent;
    }

    .likert input:focus + span {
        outline: -webkit-focus-ring-color auto 1px;
    }

    .likert span:hover {
        background: #f1f1f1;
        outline: lightgrey auto 0.5px;
    }
    

Presenting half finished projects for feedback

In my job, I tend to work as a sole developer, and quite often in a project there comes a stage where I need feedback on a specific feature, before the rest of the project is in a presentable state. When working on something tricky I will disable other functionality, add placeholders and special links to reach that feature (kind of like scaffolding in a building site), so I can try out different approaches. This could be something that getting locked down asap is important before investing time building the rest of the project. Often these are things which I cannot just rely on my own intuition about design to get right.

So typically after finding a solution I like, I would set up and share a development release, and ask for some feedback and a discussion.

However, sometimes when people see their whole project for the first time in this unfinished state it can be quite disappointing and confusing. This is the case even in early stages when there is plenty of time to go, and even with warning that it is unfinished. People want to see their idea in an embryonic state, but beautiful, not a jumbled ugly mess.

This problem is compounded because tidying up a design and visual considerations typically come later on. I think most people will imagine the project gets built like a puzzle with nicely finished parts just being wired up together, but the reality is more like building a Frankenstein monster where you start with bones, add the organs, muscles etc, and the skin and fancy outfit only comes at the end.

Initially I felt uncomfortable with the idea of 'hiding' the unfinished aspects, for some reason it seemed like it could be misleading. But the reality was that to the uninformed, half finished projects are just as misleading, or rather confusing, and it can even prevent proper attention being given to what's needed. So now, I treat projects like a building site, and so take stakeholders on much more carefully prepared, guided tours. Development releases now come later when there is actual testing to do, and early ideas are shared with wireframes, videos or shared screen tours . Here are some cute vintage work in progress animated gifs to lighten the appearance of unfinished areas of a project!