head()
, str()
, summary()
and table()
are useful functions to get a sense of our data in terms of the structure and summary statistics of the objects.
head()
returns the first several rows of an object. The default number of rows is 6.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can set how many rows to be displayed.
head(mtcars, n = 10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
The tail()
function, alternatively, returns the last several rows of the object.
tail(mtcars, 2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
## Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
The str()
function displays the structure of an object.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We can also go to the level of vector.
str(iris$Species)
## Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary()
provides us with the basic summary statistics.
Summarizing the dataset:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Summarizing a column:
summary(iris$Species)
## setosa versicolor virginica
## 50 50 50
summary()
will also tell us the number of missing values, if there are any.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary(storms$ts_diameter)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 69.05 138.09 166.76 241.66 1001.18 6528
table()
gives us a frequency table.
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
table()
can be more useful in cross-tabulation.
table(iris$Species, iris$Petal.Width)
##
## 0.1 0.2 0.3 0.4 0.5 0.6 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
## setosa 5 29 7 7 1 1 0 0 0 0 0 0 0 0 0
## versicolor 0 0 0 0 0 0 7 3 5 13 7 10 3 1 1
## virginica 0 0 0 0 0 0 0 0 0 0 1 2 1 1 11
##
## 1.9 2 2.1 2.2 2.3 2.4 2.5
## setosa 0 0 0 0 0 0 0
## versicolor 0 0 0 0 0 0 0
## virginica 5 6 6 3 8 3 3
To examine a subset of data, we can print the cases in the R console by subsetting the data frame.
Rows (cases) and columns (variables):
mtcars[10:15, 2:5]
## cyl disp hp drat
## Merc 280 6 167.6 123 3.92
## Merc 280C 6 167.6 123 3.92
## Merc 450SE 8 275.8 180 3.07
## Merc 450SL 8 275.8 180 3.07
## Merc 450SLC 8 275.8 180 3.07
## Cadillac Fleetwood 8 472.0 205 2.93
Conditional:
mtcars[mtcars$mpg > 25, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
cor()
, t.test()
and lm()
are useful functions for exploratory analysis.
cor()
produces correlations.
cor(mtcars[,1:3])
## mpg cyl disp
## mpg 1.0000000 -0.8521620 -0.8475514
## cyl -0.8521620 1.0000000 0.9020329
## disp -0.8475514 0.9020329 1.0000000
use
argument allows us to decide if we want to apply pairwise deletion of missing values should we have missing data, for instance.
a <- mtcars
a[2, 1] <- NA
a[5, 3] <- NA
cor(a[,1:3], use = "pairwise.complete.obs")
## mpg cyl disp
## mpg 1.0000000 -0.8521139 -0.8578945
## cyl -0.8521139 1.0000000 0.8984661
## disp -0.8578945 0.8984661 1.0000000
Or listwise deletion of missing values.
cor(a[,1:3], use = "complete.obs")
## mpg cyl disp
## mpg 1.0000000 -0.8600113 -0.8578945
## cyl -0.8600113 1.0000000 0.9017140
## disp -0.8578945 0.9017140 1.0000000
t.test()
performs one and two sample t-tests.
Let’s create a group variable by some arbitrary standard first with the mtcars
data before performing a t-test.
mtcars$group <- NA
mtcars$group[mtcars$cyl > 4] <- 1
mtcars$group[mtcars$cyl <= 4] <- 2
Now we can perform an independent 2-group t-test by group
.
t.test(mpg ~ group, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by group
## t = -6.5737, df = 15.266, p-value = 8.09e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.258678 -6.773357
## sample estimates:
## mean in group 1 mean in group 2
## 16.64762 26.66364
We can also do paired t-test and one sample t-test with t.test()
.
lm()
fits linear models. Below are some arbitrary examples of using lm()
to fit linear regressions. +
indicates multiple independent variables. :
indicates interactions without main effects, and *
indicates interactions with main effects.
model1 <- lm(mpg ~ cyl, data = mtcars)
model2 <- lm(mpg ~ cyl + disp + vs, data = mtcars)
model3 <- lm(mpg ~ cyl : disp, data = mtcars)
model4 <- lm(mpg ~ cyl * disp, data = mtcars)
summary()
returns the summary statistics of the model.
summary(model2)
##
## Call:
## lm(formula = mpg ~ cyl + disp + vs, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3899 -2.0944 -0.6386 1.2222 7.0974
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.88091 4.47352 8.021 9.82e-09 ***
## cyl -1.75044 0.87236 -2.007 0.0545 .
## disp -0.02029 0.01045 -1.941 0.0624 .
## vs -0.63372 1.89594 -0.334 0.7407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.103 on 28 degrees of freedom
## Multiple R-squared: 0.7605, Adjusted R-squared: 0.7349
## F-statistic: 29.64 on 3 and 28 DF, p-value: 7.792e-09
We can also obtain the model components of interest to us, which are stored in a list.
summary(model2)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.8809111 4.47351580 8.0207409 9.819906e-09
## cyl -1.7504373 0.87235809 -2.0065582 5.454156e-02
## disp -0.0202937 0.01045432 -1.9411790 6.236295e-02
## vs -0.6337231 1.89593627 -0.3342534 7.406791e-01
summary(model2)$residuals
## Mazda RX4 Mazda RX4 Wag Datsun 710
## -1.13129465 -1.13129465 -3.25371876
## Hornet 4 Drive Hornet Sportabout Valiant
## 1.89121145 4.12832076 -2.07848079
## Duster 360 Merc 240D Merc 230
## -0.27167924 -0.86835242 -2.58808527
## Merc 280 Merc 280C Merc 450SE
## -2.14333940 -3.54333940 0.11959088
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 1.01959088 -1.08040912 -1.89878439
## Lincoln Continental Chrysler Imperial Fiat 128
## -2.14230884 1.75181708 5.75167571
## Honda Civic Toyota Corolla Toyota Corona
## 3.69079460 7.09744356 -4.30816494
## Dodge Challenger AMC Javelin Camaro Z28
## 0.07598519 -0.50812667 -1.47461628
## Pontiac Firebird Fiat X1-9 Porsche 914-2
## 5.44006892 0.65776382 -0.43782931
## Lotus Europa Ford Pantera L Ferrari Dino
## 4.08449245 1.04567742 -2.73570021
## Maserati Bora Volvo 142E
## -0.76900778 -4.38990061