Spreading Rows in Hierarchical Data

fillin; expand

fillin

fillin varlist creates additional rows of observations by filling in all combinations of the specified variables. A new variable _fillin will be created automatically to indicate where the observations come from: 1 if the observations come from the original dataset, or 0 if they are filled in.

Suppose we have a dataset on a course.
. list


     +-----------------------------------+
     | id   course   placem~t   attend~e |
     |-----------------------------------|
  1. |  1        A          1          3 |
  2. |  1        B          2          4 |
  3. |  2        A          3          2 |
  4. |  2        C          1          3 |
  5. |  3        B          2          4 |
     |-----------------------------------|
  6. |  3        C          3          5 |
     +-----------------------------------+

. fillin id course adds observations from all combinations of id and course; missing values have been created where the person does not have a course record.
. list


     +---------------------------------------------+
     | id   course   placem~t   attend~e   _fillin |
     |---------------------------------------------|
  1. |  1        A          1          3         0 |
  2. |  1        B          2          4         0 |
  3. |  1        C          .          .         1 |
  4. |  2        A          3          2         0 |
  5. |  2        B          .          .         1 |
     |---------------------------------------------|
  6. |  2        C          1          3         0 |
  7. |  3        A          .          .         1 |
  8. |  3        B          2          4         0 |
  9. |  3        C          3          5         0 |
     +---------------------------------------------+

expand

expand [=]exp creates duplicates of each observation with n copies specified in the expression, where the original observation is kept and n-1 copies are created.

Case 1

Using the same data before fillin above:
. expand attendance makes duplicates of the observations by the number of attendance so that each person will have a record of each attendance of each course. We can then, for instance, add course performance data to each attendance.
. list
. sort id course


     +-----------------------------------+
     | id   course   placem~t   attend~e |
     |-----------------------------------|
  1. |  1        A          1          3 |
  2. |  1        A          1          3 |
  3. |  1        A          1          3 |
  4. |  1        B          2          4 |
  5. |  1        B          2          4 |
     |-----------------------------------|
  6. |  1        B          2          4 |
  7. |  1        B          2          4 |
  8. |  2        A          3          2 |
  9. |  2        A          3          2 |
 10. |  2        C          1          3 |
     |-----------------------------------|
 11. |  2        C          1          3 |
 12. |  2        C          1          3 |
 13. |  3        B          2          4 |
 14. |  3        B          2          4 |
 15. |  3        B          2          4 |
     |-----------------------------------|
 16. |  3        B          2          4 |
 17. |  3        C          3          5 |
 18. |  3        C          3          5 |
 19. |  3        C          3          5 |
 20. |  3        C          3          5 |
     |-----------------------------------|
 21. |  3        C          3          5 |
     +-----------------------------------+

expand [=]exp, generate(newvar) creates a new variable to indicate if the observations come from the existing dataset or if they are the expanded ones.

Case 2

We had a dataset with variables of companies, analyst ids, some event dates and times, analyst scores and ranks, ratings on companies etc.
.list in 1/10


     +------------------------------------------------------------------------------------------------------------+
     | company                               sector       id              datetime   rank   score   year   rating |
     |------------------------------------------------------------------------------------------------------------|
  1. |    0000                         Banks/Midcap    79092   2014.03.10 00:25:00      1   15.61   2013        3 |
  2. |    0001                       Integrated Oil   132083   2014.02.28 19:22:00      4    9.89   2013        2 |
  3. |    0001   Oil & Gas Exploration & Production    71894   2014.02.26 00:17:00      4    5.57   2013        3 |
  4. |    0001   Oil & Gas Exploration & Production     9928   2014.02.28 04:15:00      2    8.76   2013        3 |
  5. |    0001   Oil & Gas Exploration & Production   122739   2014.02.26 03:05:00      3    8.11   2013        3 |
     |------------------------------------------------------------------------------------------------------------|
  6. |    0001   Oil & Gas Exploration & Production   122739   2014.11.10 04:13:00      3    8.11   2014        2 |
  7. |    000R                             Internet   142116   2014.02.27 01:10:00      4    9.14   2013        3 |
  8. |    000R                             Internet   142116   2014.02.18 00:18:00      4    9.14   2013        2 |
  9. |    000R                             Internet    73111   2014.02.19 15:21:00      1   14.96   2013        2 |
 10. |    000R                             Internet    73111   2014.10.31 06:50:00      1   14.96   2013        1 |
     +------------------------------------------------------------------------------------------------------------+

We wanted to build models comparing results on ranks 1 versus 2, 2 versus 3, 3 versus the first runner-up, and the last runner-up versus the first non-runner-up. All sounded fine, until it occurred to us that there could be only one runner-up within a group (sector * year).

In this case, we would want to expand the groups where we only have one runner up, and recode it to rank 4 and 5; the first non-runner-up would be rank 6. Below is what we did.


gen rank2 = .
replace rank2 = rank if rank <= 3

//expand the RUs if we only have one rank == 4 
gen expd = .
by sector year rank (score), sort: replace expd = 1 if score[1] == score[_N] & rank == 4 
expand expd + 1 if expd == 1
sort sector year rank id score
sort _all
by sector year rank, sort: replace rank2 = 4 if mod(_n,2) == 1 & expd == 1
by sector year rank, sort: replace rank3 = 5 if mod(_n,2) == 0 & expd == 1

//1st RU, last RU
by sector year rank, sort: egen max = max(score) if rank == 4 & expd != 1
by sector year rank: replace rank2 = 4 if rank == 4 & score == max & expd != 1

by sector year rank, sort: egen min = min(score) if rank == 4 & expd != 1
by sector year rank: replace rank2 = 5 if rank == 4 & score == min & expd != 1

Case 3

An example of using fillin and expand to change time periods in panel data:
Social Science Computing Cooperative, UW-Madison, Stata Programming Techniques for Panel Data: Changing Time Periods

by prefix with sum(), max(), min(), mean() etc.

by prefix in combination with functions sum(), max(), min(), and mean() can serve many purposes and be quite helpful in handling hierarchical data.

Results Spreading with mean() in calculating summary statistics:
Social Science Computing Cooperative, UW-Madison, Stata for Researchers: Working with Groups

Another example on spreading results with sum() in creating group id:
Nicholas J. Cox and William Gould, How do I create individual identifiers numbered from 1 upwards?

Creating indicators with sum() to refer to locations of certain values of a variable:
Nicholas J. Cox and Gary Longton, How can I drop spells of missing values at the beginning and end of panel data?