SPSS and large data files

#1
My own experience with large files and many transformations suggest that SPSS is quite efficient if the job is designed appropriately. Poor performance is often a sign of inefficient coding.

Couple of simple suggestions for efficient processing of large files:

1) Make the working file as small as is needed for the task at hand:

- Use /Keep (or /Drop) on the GET FILE command to limit the variables read.

- Use SELECT IF following the GET FILE to cull records not needed.

2) Minimize passes through the data:

- Specify all data modifications first before calling any procedures. Eliminate EXECUTE (except in *rare* conditions where you need to force a pass through the data). The idea is to build the working file in the first pass and then don't change it by further modifications.

- Minimize passes through the data by procedures. For example, one FREQUENCIES command with 10 variables is preferable to 10 FREQUENCIES commands with 1 variable each. One TABLES command can do the work of several FREQUENCIES, CROSSTABS,

3) Consider alternative data designs:

- A very wide file (many variables) may reflect an inefficient design.

- Appends, merges, and table lookups should be carefully thought out - these are a common source of problems if you don't understand ADD FILES and MATCH FILES (or have data problems).

4) Read Raynald's Data Management with SPSS book.

A comparison with SAS would require a carefully designed benchmark task - as Jon points out, they work differently. SAS has some data management features I find very useful for specialized applications but SPSS can handle most tasks just fine. Either package can be inefficient with coding that does not reflect an understanding of how the product works.


This article is written by Dr. Dennis Deck at RMC Research Corporation and first appeared on SPSSX Discussion.