June 20, 2018

Python in SPSS

Using Python in SPSS is great if you want to do any complex calculations, without having to leave the SPSS environment. Python is much more flexible than SPSS syntax, and it's actually very easy to use. It is especially useful when you are collaborating with people who are not willing to do all their analysis in python (e.g. with Spyder), yet require complex data processing steps in their analysis - for example converting between colour spaces. The documentation online is actually pretty good, but I thought I'd post a very simple use case, converting a colour from sRGB to LAB colour space. Doing this in SPSS syntax would be very tiring, but in python I can just cut and paste code into a loop and it's done. Here is the program:

DELETE VARIABLES LAB_L LAB_A LAB_B.
OUTPUT CLOSE *.

I start off with some syntax that deletes any variables with the same names as those I am about to create. This is useful especially when developing as you might run the script many times and don't want to have to delete the variables manually each time.

BEGIN PROGRAM Python.
import spss


spss.StartDataStep()
datasetObj = spss.Dataset()

# Manipulation of variables goes here!

spss.EndDataStep()
END PROGRAM.

This is the standard boilerplate code needed in most Python SPSS scripts. BEGIN PROGRAM and END PROGRAM determine the area in which you are writing python. You then 'import spss' and start a data step. Finally you get a dataset object which allows you to iterate through rows and perform manipulations

# Create the new variables
datasetObj.varlist.append('LAB_L',0)
datasetObj.varlist.append('LAB_A',0)
datasetObj.varlist.append('LAB_B',0)

Here I create the new variables, initialised with 0. LAB is a colour space with three dimensions, L, A and B.

# Get the variable names
rIndex = datasetObj.varlist['r'].index
gIndex = datasetObj.varlist['g'].index
bIndex = datasetObj.varlist['b'].index
LIndex = datasetObj.varlist['LAB_L'].index
AIndex = datasetObj.varlist['LAB_A'].index
BIndex = datasetObj.varlist['LAB_B'].index

In order to perform data manipulations you need the index of the variable in the dataset. I get all these indexes at the start and store them in their own variables.

for idx, row in enumerate(datasetObj.cases):
r = row[rIndex]
g = row[gIndex]
b = row[bIndex]
if r >= 0 and g >= 0 and b >=0:
LAB = calc_LAB(r,g,b)
else:
LAB = [None, None, None]

datasetObj.cases[idx, LIndex] = LAB[0]
datasetObj.cases[idx, AIndex] = LAB[1]
datasetObj.cases[idx, BIndex] = LAB[2]

Here I iterate each row (or case), pulling the R, G and B values into python variables, using the variable indexes (i.e. rIndex, gIndex and bIndex). Then, if each is above 0, I send them to a function calc_LAB (which I will define later). Finally I take the LAB values and put them into the correct place in the dataset.

You can see how powerful this can be, calc_LAB is actually a lengthy function that would be a real chore to program in syntax. Here is the full program with the function:

* Encoding: UTF-8.
*Importing all of the data.
DELETE VARIABLES LAB_L LAB_A LAB_B.
OUTPUT CLOSE *.
BEGIN PROGRAM Python.
import spss


spss.StartDataStep()
datasetObj = spss.Dataset()

# Create the new variables
datasetObj.varlist.append('LAB_L',0)
datasetObj.varlist.append('LAB_A',0)
datasetObj.varlist.append('LAB_B',0)

# Get the variable names
rIndex = datasetObj.varlist['r'].index
gIndex = datasetObj.varlist['g'].index
bIndex = datasetObj.varlist['b'].index
LIndex = datasetObj.varlist['LAB_L'].index
AIndex = datasetObj.varlist['LAB_A'].index
BIndex = datasetObj.varlist['LAB_B'].index


def calc_LAB(R,G,B):
var_R = float(R) / 255.0
var_G = float(G) / 255.0
var_B = float(B) / 255.0


if ( var_R > 0.04045 ):
var_R = pow(( ( var_R + 0.055 ) / 1.055 ), 2.4)
else:
var_R = var_R / 12.92;
if ( var_G > 0.04045 ):
var_G = pow(( ( var_G + 0.055 ) / 1.055 ), 2.4)
else:
var_G = var_G / 12.92;
if ( var_B > 0.04045 ):
var_B = pow(( ( var_B + 0.055 ) / 1.055 ), 2.4)
else:
var_B = var_B / 12.92

var_R = var_R * 100
var_G = var_G * 100
var_B = var_B * 100


X = var_R * 0.4124 + var_G * 0.3576 + var_B * 0.1805
Y = var_R * 0.2126 + var_G * 0.7152 + var_B * 0.0722
Z = var_R * 0.0193 + var_G * 0.1192 + var_B * 0.9505

var_X = X / 95.047
var_Y = Y / 100.000
var_Z = Z / 108.883


third = 1.0/3.0

if ( var_X > 0.008856 ):
var_X = pow(var_X,third)
else:
var_X = ( 7.787 * var_X ) + ( 16.0 / 116.0 )
if ( var_Y > 0.008856 ):
var_Y = pow(var_Y,third)
else:
var_Y = ( 7.787 * var_Y ) + ( 16.0 / 116.0 )
if ( var_Z > 0.008856 ):
var_Z = pow(var_Z,third)
else:
var_Z = ( 7.787 * var_Z ) + ( 16.0 / 116.0 )


L = ( 116 * var_Y ) - 16
A = 500 * ( var_X - var_Y )
B = 200 * ( var_Y - var_Z )


return [L, A, B]


for idx, row in enumerate(datasetObj.cases):
r = row[rIndex]
g = row[gIndex]
b = row[bIndex]
if r >= 0 and g >= 0 and b >=0:
LAB = calc_LAB(r,g,b)
else:
LAB = [None, None, None]

datasetObj.cases[idx, LIndex] = LAB[0]
datasetObj.cases[idx, AIndex] = LAB[1]
datasetObj.cases[idx, BIndex] = LAB[2]

spss.EndDataStep()
END PROGRAM.