Reading and writing Stata binary files is much faster than using text files. We may want to work with a file in Stata format as early as possible.
Stata will always go to the current working directory when working with files.
use filename, clear loads a Stata file (.dta) from the current working directory. clear clears the memory for Stata to load a new file. If the workspace is already clear, we can skip that as well.
save filename, replace saves a Stata file to the current working directory. replace overwrites the file saved earlier with the same file name. If such a file does not exist, we can skip the replace option.
To load a file located in a subdirectory under the working directory, we need to specify the relative path.
. use folder/mydata.dta, clear
To tell Stata to read a file elsewhere, we need to specify a full path.
. use ~/elsewhere/mydata.dta, clear
We can also read a file over the Internet.
. use http://www.stata-press.com/data/r14/auto, clear
sysuse loads example datasets installed with Stata
. sysuse auto, clear
import/export delimited reads/saves text-delimited files. The data can be tab-separated (often .txt) or comma-separated (.csv). If the extension is not specified, Stata assumes that it is a .csv file.
. import delimited filename, clear
. import delimited var1 var2 var3 using filename.csv, clear
loads the dataset and names the three variables var1, var2 and var3.
. export delimited output.txt, delimiter(tab) exports a file and saves it as a tab-delimited text file.
Sometimes we need to take a few more steps to read a text file. Let's look at a real case.
Let's say we have a text file downloaded from a database and we tried to read it into Stata by
. import delimited output.txt
. list in 1/10
+-------------------------------------------------------------------------------------------+
| v1 |
|-------------------------------------------------------------------------------------------|
1. | |
2. | .. |
3. | |
4. | ticker evtdate car |
5. | EGAS 12FEB2014 0.02356 |
|-------------------------------------------------------------------------------------------|
6. | EGAS 27JUL2012 -0.00276 |
7. | HGR 03JUL2002 0.01478 |
8. | HGR 03MAY2005 -0.05070 |
9. | HGR 05AUG2002 0.02337 |
10. | HGR 20JAN2004 0.04206 |
+-------------------------------------------------------------------------------------------+
We can see that all variables are squeezed into one column "v1", that there are empty rows, and that the variable names actually are the fourth row. To fix those problems, first let's try
. import delimited output.txt, rowrange(4:l) varnames(4) delimiters(space)
. list ticker-car in 1/5
rowrange(4:l) reads data from the fourth to the last row (l stands for "last"). varnames(4) means the fourth row is for the variable names. delimiters(space) tells Stata to use space as the delimiters.
+--------------------------------------------------------------------------------------------------------+
| ticker v2 v3 v4 v5 v6 evtdate v8 v9 v10 v11 v12 v13 car |
|--------------------------------------------------------------------------------------------------------|
1. | EGAS . . . . 12FEB2014 . .02356 . |
2. | EGAS . . . . 27JUL2012 -.00276 . . |
3. | HGR . . . . 03JUL2002 . . .01478 |
4. | HGR . . . . 03MAY2005 . -.0507 . |
5. | HGR . . . . 05AUG2002 . . .02337 |
+--------------------------------------------------------------------------------------------------------+
Now the first few empty rows have been removed; and (some) variables are properly named. However, there was something not right. We had three variables, and now we've got a lot more (260! if we list them all). What happened? Stata used multiple spaces as delimiters, and due to that values from one variable went to different columns.
Here is what we want to do:
. import delimited output.txt, rowrange(4:l) varnames(4) delimiters(space, collapse)
Here delimiters(space, collapse) collapsed the multiple spaces into just one.
. list in 1/5
+----------------------------------------+
| ticker evtdate car v4 |
|----------------------------------------|
1. | EGAS 12FEB2014 .02356 |
2. | EGAS 27JUL2012 -.00276 |
3. | HGR 03JUL2002 .01478 |
4. | HGR 03MAY2005 -.0507 |
5. | HGR 05AUG2002 .02337 |
+----------------------------------------+
Now it's about right. We got the correct variable names, just in the wrong places. We need to fix that.
. drop ticker
. rename evtdate ticker
. rename car evtdate
. rename v4 car
. list in 1/5
+------------------------------+
| ticker evtdate car |
|------------------------------|
1. | EGAS 12FEB2014 .02356 |
2. | EGAS 27JUL2012 -.00276 |
3. | HGR 03JUL2002 .01478 |
4. | HGR 03MAY2005 -.0507 |
5. | HGR 05AUG2002 .02337 |
+------------------------------+
Done.
import/export excel reads/saves worksheets from Microsoft Excel (.xls and .xlsx) files.
. import excel filename.xlsx, clear loads the dataset but the variable names will be lost. What we see are Excel's column names.
. import excel filename.xls, firstrow loads the dataset with the original variable names.
. export excel output.xlsx, firstrow(varlabels) sheet("output1") saves the dataset as an Excel file "output.xlsx" to the sheet "output1".