append binds datasets vertically by adding observations from the using dataset to the master dataset.
The datasets should contain mostly the same variables; absent variables from one dataset will be given missing values in the combined file. append applies, for instance, when we want to combine datasets of different times or regions with mostly the same variables.
. use file1
. list
+--------------------------------------+
| id year family role rating |
|--------------------------------------|
1. | 1 1995 1 child A |
2. | 2 1995 1 child B |
3. | 3 1995 1 mother B |
4. | 1 1995 2 father C |
5. | 2 1995 2 mother B |
|--------------------------------------|
6. | 3 1995 2 child A |
+--------------------------------------+
. use file2, clear
+--------------------------------------+
| id year family role rating |
|--------------------------------------|
1. | 1 2005 1 child A |
2. | 2 2005 1 child A |
3. | 3 2005 1 mother B |
4. | 1 2005 2 father B |
5. | 2 2005 2 mother C |
|--------------------------------------|
6. | 3 2005 2 child B |
+--------------------------------------+
. append using file1
+--------------------------------------+
| id year family role rating |
|--------------------------------------|
1. | 1 2005 1 child A |
2. | 2 2005 1 child A |
3. | 3 2005 1 mother B |
4. | 1 2005 2 father B |
5. | 2 2005 2 mother C |
|--------------------------------------|
6. | 3 2005 2 child B |
7. | 1 1995 1 child A |
8. | 2 1995 1 child B |
9. | 3 1995 1 mother B |
10. | 1 1995 2 father C |
|--------------------------------------|
11. | 2 1995 2 mother B |
12. | 3 1995 2 child A |
+--------------------------------------+
. append using file
adds the observations from file to the master file.
. append using file, keep (var1 var2 var3)
appends the file file and keeps only the three specified variables in the combined dataset.
. append using file1 file2, generate (filenum)
appends file1 and file2 to the master file and assigns a number to the observations to indicate the source file.
We can also create an id for each source file by ourselves that are meaningful to the datasets before appending the datasets.
. use file, clear
. generate str4 source = ”2016”
. save fileb
. use masterfile, clear
. generate str4 source = ”2015”
. append using fileb
. append using file1b file2b
appends file1 and file2 to the master file.
merge combines datasets horizontally by adding variables to the existing dataset.
The datasets are combined by one or more key variables. The key variables must have the same names and data type (numeric or string), otherwise we will receive an error message.
If one dataset contains more observations than the other, missing values will be generated where not matched.
There are three types of match merges that we frequently use.
merge 1:1 varlist using filename
In one-to-one merge, we use a unique key variable in both the master and the using datasets to merge files. In both files there should be only one unique value for the key variable.
merge 1:m varlist using filename
merge m:1 varlist using filename
In one-to-many and many-to-one merges, one observation from one dataset is matched to many observations in the other dataset. If the master dataset has many observations to match with the single observation in the using dataset, we use m:1; or we use 1:m if it is the using dataset that has many to match.
One-to-many or many-to-one merges are most frequently met when dealing with hierarchical data.
. use file5, clear
. list
+---------------------+
| family area SES |
|---------------------|
1. | 1 N 1 |
2. | 2 S 2 |
+---------------------+
. use file1, clear
+--------------------------------------+
| id year family role rating |
|--------------------------------------|
1. | 1 1995 1 child A |
2. | 2 1995 1 child B |
3. | 3 1995 1 mother B |
4. | 1 1995 2 father C |
5. | 2 1995 2 mother B |
|--------------------------------------|
6. | 3 1995 2 child A |
+--------------------------------------+
. merge m:1 family using file5
Result # of obs.
-----------------------------------------
not matched 0
matched 6 (_merge==3)
-----------------------------------------
. list
+-----------------------------------------------------------------+
| id year family role rating area SES _merge |
|-----------------------------------------------------------------|
1. | 1 1995 1 child A N 1 matched (3) |
2. | 2 1995 1 child B N 1 matched (3) |
3. | 3 1995 1 mother B N 1 matched (3) |
4. | 1 1995 2 father C S 2 matched (3) |
5. | 2 1995 2 mother B S 2 matched (3) |
|-----------------------------------------------------------------|
6. | 3 1995 2 child A S 2 matched (3) |
+-----------------------------------------------------------------+
The fourth type, many-to-many merges, can yield random results and be problematic. We will not discuss it here.
We can merge files using more than one key variable when a single variable is not adequate to match the two datasets.
. use file3, clear
. list
+----------------------+
| id family role |
|----------------------|
1. | 1 1 child |
2. | 2 1 child |
3. | 3 1 mother |
4. | 1 2 father |
5. | 2 2 mother |
|----------------------|
6. | 3 2 child |
+----------------------+
. use file4, clear
+-----------------------------+
| id year family rating |
|-----------------------------|
1. | 1 1995 1 A |
2. | 2 1995 1 B |
3. | 3 1995 1 B |
4. | 1 1995 2 C |
5. | 2 1995 2 B |
|-----------------------------|
6. | 3 1995 2 A |
+-----------------------------+
. merge 1:1 id family using file3
Result # of obs.
-----------------------------------------
not matched 0
matched 6 (_merge==3)
-----------------------------------------
. list
+----------------------------------------------------+
| id year family rating role _merge |
|----------------------------------------------------|
1. | 1 1995 1 A child matched (3) |
2. | 1 1995 2 C father matched (3) |
3. | 2 1995 1 B child matched (3) |
4. | 2 1995 2 B mother matched (3) |
5. | 3 1995 1 B mother matched (3) |
|----------------------------------------------------|
6. | 3 1995 2 A child matched (3) |
+----------------------------------------------------+
After the merge the _merge variable will be automatically created. It tells us how the observations have been matched.
The _merge variable, in the simple cases, has three values:
_merge == 1, observation appeared in the master file only
_merge == 2, observation appeared in the using file only
_merge == 3, observation appeared in both files
The _merge variable must be dropped or renamed before we perform the next merge.
tab _merge to find out how your merge went.
joinby forms all pairwise combinations within groups.
. use file6, clear
. list
+---------------------------+
| industry company rank |
|---------------------------|
1. | A a 1 |
2. | A b 2 |
3. | A c 3 |
4. | B l 4 |
5. | B m 5 |
|---------------------------|
6. | B n 6 |
+---------------------------+
. use file7, clear
+-----------------+
| industry year |
|-----------------|
1. | A 1995 |
2. | A 2005 |
3. | B 1995 |
4. | B 2005 |
+-----------------+
. use file6, clear
+----------------------------------+
| industry company rank year |
|----------------------------------|
1. | A a 1 1995 |
2. | A a 1 2005 |
3. | A b 2 1995 |
4. | A b 2 2005 |
5. | A c 3 1995 |
|----------------------------------|
6. | A c 3 2005 |
7. | B l 4 2005 |
8. | B l 4 1995 |
9. | B m 5 1995 |
10. | B m 5 2005 |
|----------------------------------|
11. | B n 6 1995 |
12. | B n 6 2005 |
+----------------------------------+
For the unmatched observations from the datasets, the default is to be ignored. But you can also set them to be kept in different ways by the unmatched(both/master/using) option.