SPSS pitfalls: Combining files with custom variable attributes

Adding custom variable-attributes is a useful feature of SPSS available since version 14 of 2005. It can be used to assign additional information to variables and store it with the survey data, e.g. metadata or paradata. However, compared to the attributes reserved by SPSS (like variable labels or value labels), user-defined attributes demand extra attention and there are some pitfalls to look out for. When datasets are combined using MATCH FILES or ADD FILES, attributes or their values may easily and unintentionally be dropped in the process. This post aims to demonstrate, how custom attributes are handled by SPSS (version 22) when applying a MATCH FILES command, in order to raise attention about the issue. (The ADD FILES command leads to similar results and is therefore not demonstrated here.) In conclusion, a solution is presented how associated problems can be avoided in the first place.

Some data for demo

The following code creates the datasets named FRUIT, COLOR and ORIGIN with two variables and three cases each. The casewise data is merely necessary for matching the datasets and the emphasis is on the additionally defined variable attributes.

Dataset FRUIT: No custom attributes

Data View of FRUIT — SPSS Data View of dataset FRUIT

Variabe View of FRUIT — SPSS Variable View of dataset FRUIT

NEW FILE.
DATA LIST /id 1-2 favfruit 3-9(A).
BEGIN DATA
1 Cherry
2 Banana
3 Apple
END DATA.
VARIABLE LABELS id "ID (label set in FRUIT)"
    /favfruit "Favourite Fruit".
DATASET NAME FRUIT.

Dataset COLOR: One custom attribute [z]

Data View of COLOR — SPSS Data View of dataset COLOR

SPSS Variable View of dataset COLOR

NEW FILE.
DATA LIST /id 1-2 fcolor 3-9(A).
BEGIN DATA
1 Red
2 Yellow
3 Green
END DATA.
VARIABLE LABELS id "ID (label set in COLOR)"
    /fcolor "Fruit's Color".
VARIABLE ATTRIBUTE
    VARIABLES = id 
    ATTRIBUTE = z("set in COLOR")
    /VARIABLES = fcolor
    ATTRIBUTE = z("set in COLOR").
DATASET NAME COLOR.

Dataset ORIGIN: Three custom attributes [x], [y] and [z]

Data View of ORIGIN — SPSS Data View of dataset ORIGIN

Variable View of ORIGIN — SPSS Variable View of dataset ORIGIN

NEW FILE.
DATA LIST /id 1-2 forigin 3-9(A).
BEGIN DATA
1 Spain
2 Panama
3 Italy
END DATA.
VARIABLE LABELS id "ID (label set in ORIGIN)"
    /forigin "Fruit's Country of Origin".
VARIABLE ATTRIBUTE
    VARIABLES = id
    ATTRIBUTE = x("set in ORIGIN") y("set in ORIGIN") z("set in ORIGIN")
    /VARIABLES = forigin
    ATTRIBUTE = x("set in ORIGIN") y("set in ORIGIN") z("set in ORIGIN").
DATASET NAME ORIGIN.

Blending the data

Please note that customization [z] is defined in COLOR as well as in ORIGIN and the reoccurring variable “id” has different attribute-values in each dataset. Which attributes and values will end up in the combined dataset after a MATCH FILES command has been applied?

MATCH FILES #1: Missing attribute-definitions and -values

The sequence of files in the following syntax is essential:

NEW FILE.
MATCH FILES FILE = FRUIT
    /FILE = COLOR
    /FILE = ORIGIN
    /BY id.
EXECUTE.
DATASET NAME MATCHEDFILES1.

This syntax results in a 3×4 data matrix which contains all the cases’ data from FRUIT, COLOR and ORIGIN in a corresponding variable order.

Data View of 1st MATCH FILES — SPSS Data View of MATCH FILES result #1

Variable View of 1st MATCH FILES — SPSS Variable View of MATCH FILES result #1

However, the Variable View reveals some effects, which might be considered counter-intuitive and surprising:

user-attributes [x] and [y] were not transferred, although the associated data (ORIGIN) was matched successfully
variable “id” contains no value for [z] although there were customizations for this attribute in two of the included files (COLOR and ORIGIN)
on the other hand, the value of [z] was entered for variable “forigin”
the variable label for “id” was not applied from the last matched file (ORIGIN) but according to the first file in the matching sequence (FRUIT)

MATCH FILES #2: Missing attribute-values

Slightly changing the MATCH FILES command will lead to a different result. For the next example, only ORIGIN and COLOR were swapped in the sequence of files:

NEW FILE.
MATCH FILES FILE = FRUIT
    /FILE = ORIGIN
    /FILE = COLOR
    /BY id.
EXECUTE.
DATASET NAME MATCHEDFILES2.

Data View of 2nd MATCH FILES — SPSS Data View of MATCH FILES result #2

Variable View of 2nd MATCH FILES — SPSS Variable View of MATCH FILES result #2

Again, the result is a 3×4 data matrix. This time, however, …

… the custom attributes [x], [y] and [z] are included,
but all attribute-values are still missing for variable “id”

MATCH FILES #3: Complete attribute-definitions and -values

In the final attempt, the sequence of matched files starts with the dataset, that contains attribute-definitions and -values for all of attributes being used in the demo-datasets.

NEW FILE.
MATCH FILES FILE = ORIGIN
    /FILE = COLOR
    /FILE = FRUIT
    /BY id.
EXECUTE.
DATASET NAME MATCHEDFILES3.

Data View of 3rd MATCH FILES — SPSS Data View of MATCH FILES result #3

Variable View of 3rd MATCH FILES — SPSS Variable View of MATCH FILES result #3

Summary

When combining datasets with different custom attributes in SPSS, the sequence of commands also determines how the variable properties will be integrated. MATCH FILES will not add further attributes after a file was processed, which already contained a compilation of custom attributes. However, SPSS will fill in the attribute-values if the same attributes exist in various files and no existing value needs to be replaced. In other words: If a variable exists in multiple source files, the attribute-definitions and -values of the first file in the MATCH FILES queue will be transferred.

Thus, it sometimes may be worth considering to include an extra file in the matching sequence, which exclusively contains all the attribute-information and can processed by MATCH FILES primarily:

NEW FILE.
DATA LIST /id 1-2 favfruit 3-9(A) fcolor 3-9(A) forigin 3-9(A).
VARIABLE LABELS id "set in METADATA"
    /favfruit "set in METADATA"
    /fcolor "set in METADATA"
    /forigin "set in METADATA".
VARIABLE ATTRIBUTE
    VARIABLES = id
    ATTRIBUTE = x("set in METADATA") y("set in METADATA") z("set in METADATA")
    /VARIABLES = favfruit
    ATTRIBUTE = x("set in METADATA") y("set in METADATA") z("set in METADATA")
    /VARIABLES = fcolor
    ATTRIBUTE = x("set in METADATA") y("set in METADATA") z("set in METADATA")
    /VARIABLES = forigin
    ATTRIBUTE = x("set in METADATA") y("set in METADATA") z("set in METADATA").
DATASET NAME METADATA.

NEW FILE.
MATCH FILES FILE = METADATA
    /FILE = FRUIT
    /FILE = COLOR
    /FILE = ORIGIN
    /BY id.
EXECUTE.
DATASET NAME MATCHEDFILES4.

Data View of 4th MATCH FILES — SPSS Data View of MATCH FILES result #4

Variable View of 4th MATCH FILES — SPSS Variable View of MATCH FILES result #4

Downloads

example.sps