, str()
, summary()
and table()
are useful functions to get a sense of our data in terms of the structure and summary statistics of the objects.
returns the first several rows of an object. The default number of rows is 6.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can set how many rows to be displayed.
head(mtcars, n = 10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
The tail()
function, alternatively, returns the last several rows of the object.
tail(mtcars, 2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
## Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
The str()
function displays the structure of an object.
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We can also go to the level of vector.
## Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
provides us with the basic summary statistics.
Summarizing the dataset:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
Summarizing a column:
## setosa versicolor virginica
## 50 50 50
will also tell us the number of missing values, if there are any.
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## filter, lag
## The following objects are masked from 'package:base':
## intersect, setdiff, setequal, union
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 69.05 138.09 166.76 241.66 1001.18 6528
gives us a frequency table.
## setosa versicolor virginica
## 50 50 50
can be more useful in cross-tabulation.
table(iris$Species, iris$Petal.Width)
## 0.1 0.2 0.3 0.4 0.5 0.6 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
## setosa 5 29 7 7 1 1 0 0 0 0 0 0 0 0 0
## versicolor 0 0 0 0 0 0 7 3 5 13 7 10 3 1 1
## virginica 0 0 0 0 0 0 0 0 0 0 1 2 1 1 11
## 1.9 2 2.1 2.2 2.3 2.4 2.5
## setosa 0 0 0 0 0 0 0
## versicolor 0 0 0 0 0 0 0
## virginica 5 6 6 3 8 3 3
To examine a subset of data, we can print the cases in the R console by subsetting the data frame.
Rows (cases) and columns (variables):
mtcars[10:15, 2:5]
## cyl disp hp drat
## Merc 280 6 167.6 123 3.92
## Merc 280C 6 167.6 123 3.92
## Merc 450SE 8 275.8 180 3.07
## Merc 450SL 8 275.8 180 3.07
## Merc 450SLC 8 275.8 180 3.07
## Cadillac Fleetwood 8 472.0 205 2.93
mtcars[mtcars$mpg > 25, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
, t.test()
and lm()
are useful functions for exploratory analysis.
produces correlations.
## mpg cyl disp
## mpg 1.0000000 -0.8521620 -0.8475514
## cyl -0.8521620 1.0000000 0.9020329
## disp -0.8475514 0.9020329 1.0000000
argument allows us to decide if we want to apply pairwise deletion of missing values should we have missing data, for instance.
a <- mtcars
a[2, 1] <- NA
a[5, 3] <- NA
cor(a[,1:3], use = "pairwise.complete.obs")
## mpg cyl disp
## mpg 1.0000000 -0.8521139 -0.8578945
## cyl -0.8521139 1.0000000 0.8984661
## disp -0.8578945 0.8984661 1.0000000
Or listwise deletion of missing values.
cor(a[,1:3], use = "complete.obs")
## mpg cyl disp
## mpg 1.0000000 -0.8600113 -0.8578945
## cyl -0.8600113 1.0000000 0.9017140
## disp -0.8578945 0.9017140 1.0000000
performs one and two sample t-tests.
Let’s create a group variable by some arbitrary standard first with the mtcars
data before performing a t-test.
mtcars$group <- NA
mtcars$group[mtcars$cyl > 4] <- 1
mtcars$group[mtcars$cyl <= 4] <- 2
Now we can perform an independent 2-group t-test by group
t.test(mpg ~ group, data = mtcars)
## Welch Two Sample t-test
## data: mpg by group
## t = -6.5737, df = 15.266, p-value = 8.09e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.258678 -6.773357
## sample estimates:
## mean in group 1 mean in group 2
## 16.64762 26.66364
We can also do paired t-test and one sample t-test with t.test()
fits linear models. Below are some arbitrary examples of using lm()
to fit linear regressions. +
indicates multiple independent variables. :
indicates interactions without main effects, and *
indicates interactions with main effects.
model1 <- lm(mpg ~ cyl, data = mtcars)
model2 <- lm(mpg ~ cyl + disp + vs, data = mtcars)
model3 <- lm(mpg ~ cyl : disp, data = mtcars)
model4 <- lm(mpg ~ cyl * disp, data = mtcars)
returns the summary statistics of the model.
## Call:
## lm(formula = mpg ~ cyl + disp + vs, data = mtcars)
## Residuals:
## Min 1Q Median 3Q Max
## -4.3899 -2.0944 -0.6386 1.2222 7.0974
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.88091 4.47352 8.021 9.82e-09 ***
## cyl -1.75044 0.87236 -2.007 0.0545 .
## disp -0.02029 0.01045 -1.941 0.0624 .
## vs -0.63372 1.89594 -0.334 0.7407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.103 on 28 degrees of freedom
## Multiple R-squared: 0.7605, Adjusted R-squared: 0.7349
## F-statistic: 29.64 on 3 and 28 DF, p-value: 7.792e-09
We can also obtain the model components of interest to us, which are stored in a list.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.8809111 4.47351580 8.0207409 9.819906e-09
## cyl -1.7504373 0.87235809 -2.0065582 5.454156e-02
## disp -0.0202937 0.01045432 -1.9411790 6.236295e-02
## vs -0.6337231 1.89593627 -0.3342534 7.406791e-01
## Mazda RX4 Mazda RX4 Wag Datsun 710
## -1.13129465 -1.13129465 -3.25371876
## Hornet 4 Drive Hornet Sportabout Valiant
## 1.89121145 4.12832076 -2.07848079
## Duster 360 Merc 240D Merc 230
## -0.27167924 -0.86835242 -2.58808527
## Merc 280 Merc 280C Merc 450SE
## -2.14333940 -3.54333940 0.11959088
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 1.01959088 -1.08040912 -1.89878439
## Lincoln Continental Chrysler Imperial Fiat 128
## -2.14230884 1.75181708 5.75167571
## Honda Civic Toyota Corolla Toyota Corona
## 3.69079460 7.09744356 -4.30816494
## Dodge Challenger AMC Javelin Camaro Z28
## 0.07598519 -0.50812667 -1.47461628
## Pontiac Firebird Fiat X1-9 Porsche 914-2
## 5.44006892 0.65776382 -0.43782931
## Lotus Europa Ford Pantera L Ferrari Dino
## 4.08449245 1.04567742 -2.73570021
## Maserati Bora Volvo 142E
## -0.76900778 -4.38990061