readspss introduction

readspss – importing SPSS files to R

Welcome to the readspss package. The package was written from scratch with additional code for special cases. Starting as a side project for me to learn Rcpp the package did grow over the years. Today readspss is well tested using every sav, por and zsav file I could lay my hands on. Import is considered feature complete, write support is available, just not yet for every feature.

Installation

Installation is provided using r-universe or remotes.

With r-universe:

options(repos = c(
  janmarvin = 'https://janmarvin.r-universe.dev',
  CRAN = 'https://cloud.r-project.org'))
install.packages('readspss')

With devtools:

remotes::install_github("JanMarvin/readspss")

Import

For the import of (z)sav and por files read.sav() and read.por() are available.

library(readspss)

# example using sav
fl_sav <- system.file("extdata", "electric.sav", package = "readspss")
ds <- read.sav(fl_sav)

# example using zsav
fl_zsav <- system.file("extdata", "cars.zsav", package = "readspss")
dz <- read.sav(fl_zsav)

# example using por
fl_por <- system.file("extdata", "electric.por", package = "readspss")
dp <- read.por(fl_por)

Both functions return data.frame objects, containing numerics, dates, factors or characters.

Import options

For user specific demand the package supports many option available in foreign such as convert.factors and use.missings. If one is familiar with said package, one should have no problem adapting to readspss. If for example one does not want to have factors since they work differently in R than in SPSS this can be achieved using the following code.

# example using sav
fl_sav <- system.file("extdata", "electric.sav", package = "readspss")
ds <- read.sav(fl_sav, convert.factors = FALSE)

# example using zsav
fl_zsav <- system.file("extdata", "cars.zsav", package = "readspss")
dz <- read.sav(fl_zsav, convert.factors = FALSE)

# example using por
fl_por <- system.file("extdata", "electric.por", package = "readspss")
dp <- read.por(fl_por, convert.factors = FALSE)

Since many features are self explanatory not all will be explained. Of course readspss can handle sav files in different encodings; it handles file sets without data; all types of missings SPSS developers over the years came up with; short, long and longer strings; little and big endian files; sav, por and zsav compressed files; files without valid header information; old and new SPSS files. As stated above, every SPSS file I came across and during the development I came across many.

Using code by Ben Pfaff readspss can handle encrypted SPSS files.

flu <- system.file("extdata", "hotel.sav", package="readspss")
fle <- system.file("extdata", "hotel-encrypted.sav", package="readspss")

df_u <- read.sav(flu)
df_e <- read.sav(fle, pass = "pspp")

Import details

If available in the SPSS file, the resulting data.frame will contain attributes such as the datalabel, timestamp, filelabel, additional documentation and information regarding missings, labels, file encoding. A complete list of attributes can be found using ?read.sav and ?read.por.

Export

R data.frame objects can be exported using write.sav() and write.por().

library(readspss)

write.sav(cars, filepath = "cars.sav") # optional compress = TRUE

write.sav(cars, filepath = "cars.zsav") # optional compress = TRUE
#> Zsav compression is still experimental. Testing is welcome!

write.por(cars, filepath = "cars.por")

Export provides a few options to add a label, for compression of sav and zsav files and conversion of dates. Currently it is not possible to export strings longer than 255 chars. Obviously all exported files can be imported using SPSS and readspss (PSPP is expected to work).

Package development

One may wonder, why does the world need another package to import SPSS data to the R world. Similar tasks can be done by foreign, memisc and haven package. Well the first two packages use code from older releases of PSPP and R-Core most likely has neither time nor a need to update their codebase to a newer PSPP release. Still over the years the SPSS file format has changed. Not drastically but new features such as long strings were implemented. Features that foreign cannot handle. The newest of the three aforementioned packages, haven, is a wrapper around the ReadStat C library. The package development began around the time we started with readstata13 so it is around quite some time now. Contrary to many other people in the R world, I am not a huge fan of tibbles which are an integral part of haven. One can agree that this is a minor problem. My bigger problem with the package is, that I am not yet convinced that ReadStat and haven are tested enough. Even though I am sure that authors of both made sure that in most cases their software works, there are still cases where it does not. During the development process of readspss I reported a few bugs to the haven package. Among them were incorrectly trimmed long strings and a severe bug where por-files imported awfully incorrect values. All errors were found using publicly available data files, writing unit tests and comparing data across different R-packages, PSPP and various versions of SPSS. Until I see that such behavior is adopted by other packages, I simply do not trust them and maybe you should not either. If the import process of data fails, one does not have to worry about anything else.

The development of readspss began once development of readstata13 slowed down. Having written most of the c++ code to import dta-files, I learned a lot about binary files and Rcpp development. Since SPSS was another statistical software used at the university where I worked at that time, it felt natural to have a look at sav-files. Shortly after I learned that the dta-file documentation is priceless, not available for SPSS and development ceased for quite some time. In February 2018 I changed jobs, resulting in many train rides. A project was needed and development began again. Using the PSPP documentation and countless hours of trial and error lead to the current state of the package.

Thanks

readspss uses code of Ben Pfaff for the encryption part. It uses code from TDA by Goetz Rohwer and Ulrich Poetter for the conversion of numerics in the por-parser. The PSPP documentation was a huge help. Without the testing by Ulrich Poetter this package would not be as complete as it is.

Last words

Even though readspss is tested a lot, it would be great, if users would test their imports and exports and report their experience. Open an issue on the github page or write me an email and let me know what you think. Bug free code does not exist.