fillin varlist creates additional rows of observations by filling in all combinations of the specified variables. A new variable _fillin will be created automatically to indicate where the observations come from: 1 if the observations come from the original dataset, or 0 if they are filled in.
Suppose we have a dataset on a course.
. list
+-----------------------------------+
| id course placem~t attend~e |
|-----------------------------------|
1. | 1 A 1 3 |
2. | 1 B 2 4 |
3. | 2 A 3 2 |
4. | 2 C 1 3 |
5. | 3 B 2 4 |
|-----------------------------------|
6. | 3 C 3 5 |
+-----------------------------------+
. fillin id course adds observations from all combinations of id and course; missing values have been created where the person does not have a course record.
. list
+---------------------------------------------+
| id course placem~t attend~e _fillin |
|---------------------------------------------|
1. | 1 A 1 3 0 |
2. | 1 B 2 4 0 |
3. | 1 C . . 1 |
4. | 2 A 3 2 0 |
5. | 2 B . . 1 |
|---------------------------------------------|
6. | 2 C 1 3 0 |
7. | 3 A . . 1 |
8. | 3 B 2 4 0 |
9. | 3 C 3 5 0 |
+---------------------------------------------+
expand [=]exp creates duplicates of each observation with n copies specified in the expression, where the original observation is kept and n-1 copies are created.
Using the same data before fillin above:
. expand attendance makes duplicates of the observations by the number of attendance so that each person will have a record of each attendance of each course. We can then, for instance, add course performance data to each attendance.
. list
. sort id course
+-----------------------------------+
| id course placem~t attend~e |
|-----------------------------------|
1. | 1 A 1 3 |
2. | 1 A 1 3 |
3. | 1 A 1 3 |
4. | 1 B 2 4 |
5. | 1 B 2 4 |
|-----------------------------------|
6. | 1 B 2 4 |
7. | 1 B 2 4 |
8. | 2 A 3 2 |
9. | 2 A 3 2 |
10. | 2 C 1 3 |
|-----------------------------------|
11. | 2 C 1 3 |
12. | 2 C 1 3 |
13. | 3 B 2 4 |
14. | 3 B 2 4 |
15. | 3 B 2 4 |
|-----------------------------------|
16. | 3 B 2 4 |
17. | 3 C 3 5 |
18. | 3 C 3 5 |
19. | 3 C 3 5 |
20. | 3 C 3 5 |
|-----------------------------------|
21. | 3 C 3 5 |
+-----------------------------------+
expand [=]exp, generate(newvar) creates a new variable to indicate if the observations come from the existing dataset or if they are the expanded ones.
We had a dataset with variables of companies, analyst ids, some event dates and times, analyst scores and ranks, ratings on companies etc.
.list in 1/10
+------------------------------------------------------------------------------------------------------------+
| company sector id datetime rank score year rating |
|------------------------------------------------------------------------------------------------------------|
1. | 0000 Banks/Midcap 79092 2014.03.10 00:25:00 1 15.61 2013 3 |
2. | 0001 Integrated Oil 132083 2014.02.28 19:22:00 4 9.89 2013 2 |
3. | 0001 Oil & Gas Exploration & Production 71894 2014.02.26 00:17:00 4 5.57 2013 3 |
4. | 0001 Oil & Gas Exploration & Production 9928 2014.02.28 04:15:00 2 8.76 2013 3 |
5. | 0001 Oil & Gas Exploration & Production 122739 2014.02.26 03:05:00 3 8.11 2013 3 |
|------------------------------------------------------------------------------------------------------------|
6. | 0001 Oil & Gas Exploration & Production 122739 2014.11.10 04:13:00 3 8.11 2014 2 |
7. | 000R Internet 142116 2014.02.27 01:10:00 4 9.14 2013 3 |
8. | 000R Internet 142116 2014.02.18 00:18:00 4 9.14 2013 2 |
9. | 000R Internet 73111 2014.02.19 15:21:00 1 14.96 2013 2 |
10. | 000R Internet 73111 2014.10.31 06:50:00 1 14.96 2013 1 |
+------------------------------------------------------------------------------------------------------------+
We wanted to build models comparing results on ranks 1 versus 2, 2 versus 3, 3 versus the first runner-up, and the last runner-up versus the first non-runner-up. All sounded fine, until it occurred to us that there could be only one runner-up within a group (sector * year).
In this case, we would want to expand the groups where we only have one runner up, and recode it to rank 4 and 5; the first non-runner-up would be rank 6. Below is what we did.
gen rank2 = .
replace rank2 = rank if rank <= 3
//expand the RUs if we only have one rank == 4
gen expd = .
by sector year rank (score), sort: replace expd = 1 if score[1] == score[_N] & rank == 4
expand expd + 1 if expd == 1
sort sector year rank id score
sort _all
by sector year rank, sort: replace rank2 = 4 if mod(_n,2) == 1 & expd == 1
by sector year rank, sort: replace rank3 = 5 if mod(_n,2) == 0 & expd == 1
//1st RU, last RU
by sector year rank, sort: egen max = max(score) if rank == 4 & expd != 1
by sector year rank: replace rank2 = 4 if rank == 4 & score == max & expd != 1
by sector year rank, sort: egen min = min(score) if rank == 4 & expd != 1
by sector year rank: replace rank2 = 5 if rank == 4 & score == min & expd != 1
An example of using fillin and expand to change time periods in panel data:
Social Science Computing Cooperative, UW-Madison, Stata Programming Techniques for Panel Data: Changing Time Periods
by prefix in combination with functions sum(), max(), min(), and mean() can serve many purposes and be quite helpful in handling hierarchical data.
Results Spreading with mean() in calculating summary statistics:
Social Science Computing Cooperative, UW-Madison, Stata for Researchers: Working with Groups
Another example on spreading results with sum() in creating group id:
Nicholas J. Cox and William Gould, How do I create individual identifiers numbered from 1 upwards?
Creating indicators with sum() to refer to locations of certain values of a variable:
Nicholas J. Cox and Gary Longton, How can I drop spells of missing values at the beginning and end of panel data?