Latest blog posts

Live Photo Cleaner

Came across this great open source app for iOS that automatically removes the annoying, space-hogging 'live' effects from photos taken on iPhone. It is called LivePhotoCleaner. Cynicism alert: This live photo feature surely is about nothing other than filling up more iCloud space - with the side effect of wasted energy and resources. After reading David Graeber's Bullshit Jobs, I can suggest that 'live photos' are surely a prime candidate for a 'bullshit feature'.

Solvitur ambulando

The latin phrase Solvitur Ambulando is translated as "it is solved by walking". It has a funny double meaning: classically it has meant "the problem is best solved by actively trying to solve it" as opposed to thinking or planning up front. But some people seem to use it to mean almost the opposite now, which is - solve the problem by literally going for a walk (and whence thinking about it!)

Canceling my Spotify subscription

As part of a process of taking the advice in Stolen Focus I have cut out nearly all notifications, turning off all sounds, and indications of message waiting on email, slack, whatsapp, messenger and everything. The only way someone can interrupt me now is by text or phone - and this for the past 2 months has made a huge difference in the amount of time I have spent on my phone or distracted from what I am working on.

My next move has been to cancel all subscription services. I remember the good ol' days before you had to subscribe to things, where you would pay money and have that thing 'forever'! Subscription services obviously make more money than that, as basically, they trick you into thinking something is cheap, because you can't keep track of how much you are using it or need it.

The issue is that for me my interests in various stuff is very boom and bust, so I can be very interested on one thing for a few months, and then not do anything with it. Having the subscription causes me a feeling that I should be using it, which means at least with Netflix I ended up watching rubbish...

Spotify was the last hurdle as I do like to listen to a lot of music. But it's always been a slight source of guilt in the back of my head that of my £120 a year only a pittance goes to the artists - and in fact lots of my money presumably goes to popular artists who I think are, at worst, actively harmful to their listeners (😆). So instead, I am now looking forward to actually buying a new album to keep, each month.

Data preprocessing tasks in big projects

There is a common source of friction (and anxiety) in psychology projects with large datasets: The pipeline of getting data from the data collection stage wherever or whatever that may be, into the final form to perform statistical analysis. Frequently the statisticians are not provided with enough information or don't have the tools to process all the idiosyncrasies in the data, or the data collectors do not know how to present the data in the correct way to the statisticians.

The data collector wishes to leave no stone unturned collecting all necessary data, which can involve reacting quickly and improvising when obstacles arise. But the statistician needs reliably organised data, and ensures that it is transparent, documented and correct.

The way to solve this is to have a data preprocessor. Data pre-processing is difficult because a) It requires general programming skills b) Knowledge of statistical analysis c) Knowledge of the data itself. Sadly, it is also an under-appreciated role (and can be quite grueling!)

Data preprocessing essentially involves performing the organising and patching necessary to provide files to go straight into statistical analysis software packages. Anything that a statistics package would find unwieldy, or should be automated, is part of the pre-processing step.

Tasks

The following tasks are what I've found to be the major part of data preprocessing:

  1. Organise and store iterations of the raw data, in a documented form that can be transparently traced back to the original source.

  2. Proactively document and patch discrepancies in the data, which can be amongst a host of other things:

    • Missing or incorrect ids
    • Incorrectly labelled data
    • Multiple versions of data collection instruments
    • Missing variables
    • Instrument coding errors
    • Document and filter irrelevant data (e.g. test data)
  3. Rename and organise variables and document this.

    The variables at the data collection stage may not be the best for the analysis stage. Frequently in the data collection context, lengthy variable names with prefixes (e.g. in a SQL database) will necessarily be used, or the data collector has not named the variable correctly or sufficiently.

  4. Perform merges:

    • Merge similar files to single indexed large files: For instance an experiment may generate a file per participant, with each row representing a trial. But for analysis a single file is needed, so the files need to be merged, and an index column added.
    • Merge tables, to create a new table for a specific analysis. It’s often not necessary to bring together one giant table, but if the dataset is mostly complete, upto a certain scale it can be the goal. For large, often incomplete datasets, it's better to have a top-down process guide this, whereby the statistician works backwards from the analysis required, to work out the precise variables needed in a frame of data. And then generate and label the frame accordingly.
  5. Perform aggregations and transformations:

    Sometimes data involves the calculation of established scores, transformation from formats like JSON, or the analysis requires means, medians, standard deviations etc. These can be a task for the statistician, but most often it is better subsumed under the data pre-processor role, especially if the aggregations involve any kind of unusual transformation, (e.g. the calculator of CIEDE2000 distance scores from RGB values). Again, the rule of thumb is that if it cannot be easily accomplished in statistics software, it should probably be a preprocessing step.

How to do it

Communication: In an ideal world the statistician should be explicit up front about how they would like to recieve the data, and what variables. This will guide the process but at the very least there should be an open communication channel between the two roles. In addition, the data preprocessor should preferably advise how data collectors can produce raw data that is easiest to work with, if they can...

How should data preprocessing itself be done? Ideally it needs to be automated. But most importantly, it needs to be documented and transparent, and done in way that after a year you can go back and figure out very quickly where a number has come from! The need for automation and flexibility means using a fully fledged programming language, but with strong support for data manipulation: so essentially, Pandas with Python, or R. The environment and language should allow:

  1. Text processing/cleaning - I have seen this cause difficulty because accomplished statisticians can lack basic text programming skills
  2. Streamlined loading of table data in different formats
  3. Merging tables (lengthwise and by key)
  4. Use of version control and diff merges (e.g. for pre-processing code)

Summary

The effort and importance of the data preprocessing role is often not immediately apparent, especially at the start of projects. So, when you sense complicated data, it's best to plan up ahead the likely tasks of this role and to assign it correctly to someone who has the confidence to take it on. When it's done well, nobody will notice, and perhaps it takes experience of it being done badly to appreciate it being done well! Up front planning for this work will definitely reduce potential sources of friction, and give a sense of streamlined flow between messy data collection and rigorous analysis.

New year, new webserver at nearlyfreespeech.net

I've been using nearlyfreespeech.net for a while for a side project, and also for adhoc sites that are needed from time-to-time during work. I have grown to really like it's minimal, clutter free setup. So this year I've decided to move my site over, and away from AWS. The costs are about the same, and it feels better to not be giving any money to Amazon - they have enough.

The migration took about 5 minutes, and went very smoothly with the bulk of the time just trying to remember how to do LetsEncrypt on nearlyfreespeech. Ahh, the advantages of a no-dependency website. Smug mode.

Having given up twitter (apart from some key accounts as RSS feeds via nitter), I am aiming to post more on this site in the coming year!