Summary of Merging SAS Data Sets

Merging SAS Data Sets

Lesson Summary

This page contains

a text summary of the material taught in the lesson
syntax for statements and options
a sample program
points to remember.

I. Text Summary

To go to the page where a task, programming feature, or concept was presented, select a link.

Results of Match-Merging
Basic DATA step match-merging produces an output data set that contains values from all observations in all input data sets. If an input data set doesn't have any observations for a given value of the common variable, the observation in the output data set contains missing values for the variables that are unique to that input data set.

Preparing Data for Match-Merging
To prepare for match-merging, you first reference all SAS libraries where input data sets are stored, or where you will store the results of the match-merge. Then you determine whether your input data sets are in the appropriate form to be merged. All data sets to be merged must be sorted or indexed by the values of a common variable with the same type and length in all data sets to be merged. If your data are not appropriately sorted or indexed, you can use PROC SORT to sort data before match-merging. You can sort the existing data set permanently or create a separate, sorted output data set.

Performing a Basic Match-Merge
After all input data sets are sorted or indexed by the value of a BY variable, you can merge the data sets using a DATA step that contains MERGE and BY statements.

Renaming Variables
If variables with the same name appear in more than one input data set, DATA step match-merging overwrites values of the like-named variable in the first data set in which it appears with values of the like-named variable in subsequent data sets. To prevent overwriting, you can rename variables using the RENAME= option in the MERGE statement.

Excluding Unmatched Observations
By default, DATA step match-merging combines all observations in all input data sets. However, you may want to select only observations that match for two or more specific input data sets. To exclude unmatched observations from your output data set, you can use the IN= data set option and the subsetting IF statement in your DATA step. The IN= data set option creates and names a variable that indicates whether the data set contributed data to the current observation. The subsetting IF statement checks the IN= values and outputs only those observations that appear in the data sets for which IN= is specified.

Selecting Variables
To select variables to drop or keep in your output data set, you use the DROP= and KEEP= data set options. You can use the KEEP= option instead of the DROP= option if more variables are dropped than kept. You can specify these options in either the DATA statement or the MERGE statement, depending on whether or not you want to process values of the variables in that DATA step. If you don't process certain variables and you don't want them to appear in the new data set, specify them in the DROP= option in the MERGE statement. If you do need to process a variable in the original data set, you must specify the variable in the DROP= option in the DATA statement. Used in the DATA statement, the DROP= option simply drops the variables from the new data set. However, they are still read from the original data set and are available within the DATA step.

Match-Merge Processing
To predict the results of match-merging correctly, you need to understand how the DATA step processes data in match-merges.

Compiling
To prepare to merge data sets, SAS software

reads the descriptor portions of data sets listed in the MERGE statement
reads the remainder of the DATA step program
creates the program data vector (PDV), an area of memory where SAS software builds your data set one observation at a time
assigns a tracking pointer to each data set listed in the MERGE statement.

If variables with the same name appear in more than one data set, the variable from the first data set that contains the variable (in order as listed in the MERGE statement) determines the length of the variable.
Executing
After compiling the DATA step, SAS software sequentially match-merges observations by moving the pointers down each observation of each data set and checking to see whether the BY values match.

If Yes, the observations are written to the PDV in the order that the data sets appear in the MERGE statement. (Remember that values of any like-named variable are overwritten by values of the like-named variable in subsequent data sets.) SAS software writes the combined observation to the new data set and retains the values in the PDV until the BY value changes in all the data sets.

If No, SAS software determines which of the values comes first and writes the observation containing this value to the PDV. Then the observation is written to the new data set.

When the BY value changes in all the input data sets, the PDV is initialized to missing. The DATA step merge continues to process every observation in each data set until it exhausts all observations in all data sets.
Handling Unmatched Observations and Missing Values
All observations written to the PDV, including observations that have missing data and no matching BY values, are written to the output data set.

If an observation contains missing values for a variable, the observation in the output data set contains the missing values as well. Observations with missing values for the BY variable appear at the top of the output data set.

If an input data set doesn't have any observations for a given value of the common variable, the observation in the output data set contains missing values for the variables unique to that input data set.

II. Syntax
To go to the page where a statement or option was presented, select a link.

LIBNAME libref 'SAS-data-library';

PROC SORT DATA=SAS-data-set OUT=SAS-data-set;

BY variable(s);

RUN;

DATA output-SAS-data-set
(RENAME=(old-variable-name=new-variable-name)
DROP=variable(s));

MERGE SAS-data-set-1 SAS-data-set-2
(RENAME=(old-variable-name=new-variable-name)
DROP=variable(s) IN= variable);

BY variable(s);

IF expression;

RUN;

You can use the KEEP= option instead of the DROP= option if more variables are dropped than kept.

III. Sample Program

     libname clinic 'c:\stress\labdata';
     data clinic.combined(drop=id);       
        merge clinic.demog(in=indemog 
                           rename=(date=BirthDate)) 
              clinic.visit(drop=weight in=invisit
                           rename=(date=VisitDate));  
        by id;
        if indemog and invisit;
     run;

IV. Points to Remember

You can merge any number of data sets using the DATA step.
You can rename any number of variables in each occurrence of the RENAME= option.
In match-merging, the IN= data set option can apply to any data set in the MERGE statement. The RENAME=, DROP=, and KEEP= options can apply to any data set in the DATA or MERGE statements.
When you specify multiple data set options for a given data set, enclose them in a single set of parentheses.