Binding Datasets

append, merge and joinby bind the datasets by attaching the “using dataset” stored on disk to the “master dataset” stored in memory.

append

append binds datasets vertically by adding observations from the using dataset to the master dataset.

The datasets should contain mostly the same variables; absent variables from one dataset will be given missing values in the combined file. append applies, for instance, when we want to combine datasets of different times or regions with mostly the same variables.

. use file1
. list


     +--------------------------------------+
     | id   year   family     role   rating |
     |--------------------------------------|
  1. |  1   1995        1    child        A |
  2. |  2   1995        1    child        B |
  3. |  3   1995        1   mother        B |
  4. |  1   1995        2   father        C |
  5. |  2   1995        2   mother        B |
     |--------------------------------------|
  6. |  3   1995        2    child        A |
     +--------------------------------------+

. use file2, clear
. list


     +--------------------------------------+
     | id   year   family     role   rating |
     |--------------------------------------|
  1. |  1   2005        1    child        A |
  2. |  2   2005        1    child        A |
  3. |  3   2005        1   mother        B |
  4. |  1   2005        2   father        B |
  5. |  2   2005        2   mother        C |
     |--------------------------------------|
  6. |  3   2005        2    child        B |
     +--------------------------------------+

. append using file1
. list


     +--------------------------------------+
     | id   year   family     role   rating |
     |--------------------------------------|
  1. |  1   2005        1    child        A |
  2. |  2   2005        1    child        A |
  3. |  3   2005        1   mother        B |
  4. |  1   2005        2   father        B |
  5. |  2   2005        2   mother        C |
     |--------------------------------------|
  6. |  3   2005        2    child        B |
  7. |  1   1995        1    child        A |
  8. |  2   1995        1    child        B |
  9. |  3   1995        1   mother        B |
 10. |  1   1995        2   father        C |
     |--------------------------------------|
 11. |  2   1995        2   mother        B |
 12. |  3   1995        2    child        A |
     +--------------------------------------+

. append using file
adds the observations from file to the master file.

. append using file, keep (var1 var2 var3)
appends the file file and keeps only the three specified variables in the combined dataset.

. append using file1 file2, generate (filenum)
appends file1 and file2 to the master file and assigns a number to the observations to indicate the source file.

We can also create an id for each source file by ourselves that are meaningful to the datasets before appending the datasets.
. use file, clear
. generate str4 source = ”2016”
. save fileb
. use masterfile, clear
. generate str4 source = ”2015”
. append using fileb

. append using file1b file2b
appends file1 and file2 to the master file.

merge

merge combines datasets horizontally by adding variables to the existing dataset.

The datasets are combined by one or more key variables. The key variables must have the same names and data type (numeric or string), otherwise we will receive an error message.

If one dataset contains more observations than the other, missing values will be generated where not matched.

types of merges

There are three types of match merges that we frequently use.

merge 1:1 varlist using filename
In one-to-one merge, we use a unique key variable in both the master and the using datasets to merge files. In both files there should be only one unique value for the key variable.

merge 1:m varlist using filename
merge m:1 varlist using filename
In one-to-many and many-to-one merges, one observation from one dataset is matched to many observations in the other dataset. If the master dataset has many observations to match with the single observation in the using dataset, we use m:1; or we use 1:m if it is the using dataset that has many to match.

One-to-many or many-to-one merges are most frequently met when dealing with hierarchical data.

. use file5, clear
. list


     +---------------------+
     | family   area   SES |
     |---------------------|
  1. |      1      N     1 |
  2. |      2      S     2 |
     +---------------------+

. use file1, clear
. list


     +--------------------------------------+
     | id   year   family     role   rating |
     |--------------------------------------|
  1. |  1   1995        1    child        A |
  2. |  2   1995        1    child        B |
  3. |  3   1995        1   mother        B |
  4. |  1   1995        2   father        C |
  5. |  2   1995        2   mother        B |
     |--------------------------------------|
  6. |  3   1995        2    child        A |
     +--------------------------------------+

. merge m:1 family using file5


Result                           # of obs.
-----------------------------------------
not matched                             0
matched                                 6  (_merge==3)
-----------------------------------------

. list


   +-----------------------------------------------------------------+
   | id   year   family     role   rating   area   SES        _merge |
   |-----------------------------------------------------------------|
1. |  1   1995        1    child        A      N     1   matched (3) |
2. |  2   1995        1    child        B      N     1   matched (3) |
3. |  3   1995        1   mother        B      N     1   matched (3) |
4. |  1   1995        2   father        C      S     2   matched (3) |
5. |  2   1995        2   mother        B      S     2   matched (3) |
   |-----------------------------------------------------------------|
6. |  3   1995        2    child        A      S     2   matched (3) |
   +-----------------------------------------------------------------+

The fourth type, many-to-many merges, can yield random results and be problematic. We will not discuss it here.

multiple key variables

We can merge files using more than one key variable when a single variable is not adequate to match the two datasets.

. use file3, clear
. list


     +----------------------+
     | id   family     role |
     |----------------------|
  1. |  1        1    child |
  2. |  2        1    child |
  3. |  3        1   mother |
  4. |  1        2   father |
  5. |  2        2   mother |
     |----------------------|
  6. |  3        2    child |
     +----------------------+

. use file4, clear
. list


     +-----------------------------+
     | id   year   family   rating |
     |-----------------------------|
  1. |  1   1995        1        A |
  2. |  2   1995        1        B |
  3. |  3   1995        1        B |
  4. |  1   1995        2        C |
  5. |  2   1995        2        B |
     |-----------------------------|
  6. |  3   1995        2        A |
     +-----------------------------+

. merge 1:1 id family using file3


Result                           # of obs.
-----------------------------------------
not matched                             0
matched                                 6  (_merge==3)
-----------------------------------------

. list


   +----------------------------------------------------+
   | id   year   family   rating     role        _merge |
   |----------------------------------------------------|
1. |  1   1995        1        A    child   matched (3) |
2. |  1   1995        2        C   father   matched (3) |
3. |  2   1995        1        B    child   matched (3) |
4. |  2   1995        2        B   mother   matched (3) |
5. |  3   1995        1        B   mother   matched (3) |
   |----------------------------------------------------|
6. |  3   1995        2        A    child   matched (3) |
   +----------------------------------------------------+

_merge

After the merge the _merge variable will be automatically created. It tells us how the observations have been matched.

The _merge variable, in the simple cases, has three values:
_merge == 1, observation appeared in the master file only
_merge == 2, observation appeared in the using file only
_merge == 3, observation appeared in both files

The _merge variable must be dropped or renamed before we perform the next merge.

tab _merge to find out how your merge went.

joinby

joinby forms all pairwise combinations within groups.

. use file6, clear
. list


   +---------------------------+
   | industry   company   rank |
   |---------------------------|
1. |        A         a      1 |
2. |        A         b      2 |
3. |        A         c      3 |
4. |        B         l      4 |
5. |        B         m      5 |
   |---------------------------|
6. |        B         n      6 |
   +---------------------------+

. use file7, clear
. list


   +-----------------+
   | industry   year |
   |-----------------|
1. |        A   1995 |
2. |        A   2005 |
3. |        B   1995 |
4. |        B   2005 |
   +-----------------+

. use file6, clear
. joinby industry using file7
. list


     +----------------------------------+
     | industry   company   rank   year |
     |----------------------------------|
  1. |        A         a      1   1995 |
  2. |        A         a      1   2005 |
  3. |        A         b      2   1995 |
  4. |        A         b      2   2005 |
  5. |        A         c      3   1995 |
     |----------------------------------|
  6. |        A         c      3   2005 |
  7. |        B         l      4   2005 |
  8. |        B         l      4   1995 |
  9. |        B         m      5   1995 |
 10. |        B         m      5   2005 |
     |----------------------------------|
 11. |        B         n      6   1995 |
 12. |        B         n      6   2005 |
     +----------------------------------+

For the unmatched observations from the datasets, the default is to be ignored. But you can also set them to be kept in different ways by the unmatched(both/master/using) option.