Data preprocessing tasks in big projects

There is a common source of friction (and anxiety) in psychology projects with large datasets: The pipeline of getting data from the data collection stage wherever or whatever that may be, into the final form to perform statistical analysis. Frequently the statisticians are not provided with enough information or don't have the tools to process all the idiosyncrasies in the data, or the data collectors do not know how to present the data in the correct way to the statisticians.

The data collector wishes to leave no stone unturned collecting all necessary data, which can involve reacting quickly and improvising when obstacles arise. But the statistician needs reliably organised data, and ensures that it is transparent, documented and correct.

The way to solve this is to have a data preprocessor. Data pre-processing is difficult because a) It requires general programming skills b) Knowledge of statistical analysis c) Knowledge of the data itself. Sadly, it is also an under-appreciated role (and can be quite grueling!)

Data preprocessing essentially involves performing the organising and patching necessary to provide files to go straight into statistical analysis software packages. Anything that a statistics package would find unwieldy, or should be automated, is part of the pre-processing step.


The following tasks are what I've found to be the major part of data preprocessing:

  1. Organise and store iterations of the raw data, in a documented form that can be transparently traced back to the original source.

  2. Proactively document and patch discrepancies in the data, which can be amongst a host of other things:

    • Missing or incorrect ids
    • Incorrectly labelled data
    • Multiple versions of data collection instruments
    • Missing variables
    • Instrument coding errors
    • Document and filter irrelevant data (e.g. test data)
  3. Rename and organise variables and document this.

    The variables at the data collection stage may not be the best for the analysis stage. Frequently in the data collection context, lengthy variable names with prefixes (e.g. in a SQL database) will necessarily be used, or the data collector has not named the variable correctly or sufficiently.

  4. Perform merges:

    • Merge similar files to single indexed large files: For instance an experiment may generate a file per participant, with each row representing a trial. But for analysis a single file is needed, so the files need to be merged, and an index column added.
    • Merge tables, to create a new table for a specific analysis. It’s often not necessary to bring together one giant table, but if the dataset is mostly complete, upto a certain scale it can be the goal. For large, often incomplete datasets, it's better to have a top-down process guide this, whereby the statistician works backwards from the analysis required, to work out the precise variables needed in a frame of data. And then generate and label the frame accordingly.
  5. Perform aggregations and transformations:

    Sometimes data involves the calculation of established scores, transformation from formats like JSON, or the analysis requires means, medians, standard deviations etc. These can be a task for the statistician, but most often it is better subsumed under the data pre-processor role, especially if the aggregations involve any kind of unusual transformation, (e.g. the calculator of CIEDE2000 distance scores from RGB values). Again, the rule of thumb is that if it cannot be easily accomplished in statistics software, it should probably be a preprocessing step.

How to do it

Communication: In an ideal world the statistician should be explicit up front about how they would like to recieve the data, and what variables. This will guide the process but at the very least there should be an open communication channel between the two roles. In addition, the data preprocessor should preferably advise how data collectors can produce raw data that is easiest to work with, if they can...

How should data preprocessing itself be done? Ideally it needs to be automated. But most importantly, it needs to be documented and transparent, and done in way that after a year you can go back and figure out very quickly where a number has come from! The need for automation and flexibility means using a fully fledged programming language, but with strong support for data manipulation: so essentially, Pandas with Python, or R. The environment and language should allow:

  1. Text processing/cleaning - I have seen this cause difficulty because accomplished statisticians can lack basic text programming skills
  2. Streamlined loading of table data in different formats
  3. Merging tables (lengthwise and by key)
  4. Use of version control and diff merges (e.g. for pre-processing code)


The effort and importance of the data preprocessing role is often not immediately apparent, especially at the start of projects. So, when you sense complicated data, it's best to plan up ahead the likely tasks of this role and to assign it correctly to someone who has the confidence to take it on. When it's done well, nobody will notice, and perhaps it takes experience of it being done badly to appreciate it being done well! Up front planning for this work will definitely reduce potential sources of friction, and give a sense of streamlined flow between messy data collection and rigorous analysis.