A Glimpse of Your Data

Summary Statistics

head(), str(), summary() and table() are useful functions to get a sense of our data in terms of the structure and summary statistics of the objects.

head(), tail()

head() returns the first several rows of an object. The default number of rows is 6.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We can set how many rows to be displayed.

head(mtcars, n = 10)

##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

The tail() function, alternatively, returns the last several rows of the object.

tail(mtcars, 2)

##                mpg cyl disp  hp drat   wt qsec vs am gear carb
## Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
## Volvo 142E    21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

str()

The str() function displays the structure of an object.

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We can also go to the level of vector.

str(iris$Species)

##  Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary()

summary() provides us with the basic summary statistics.

Summarizing the dataset:

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Summarizing a column:

summary(iris$Species)

##     setosa versicolor  virginica 
##         50         50         50

summary() will also tell us the number of missing values, if there are any.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

summary(storms$ts_diameter)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   69.05  138.09  166.76  241.66 1001.18    6528

table()

table() gives us a frequency table.

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

table() can be more useful in cross-tabulation.

table(iris$Species, iris$Petal.Width)

##             
##              0.1 0.2 0.3 0.4 0.5 0.6  1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
##   setosa       5  29   7   7   1   1  0   0   0   0   0   0   0   0   0
##   versicolor   0   0   0   0   0   0  7   3   5  13   7  10   3   1   1
##   virginica    0   0   0   0   0   0  0   0   0   0   1   2   1   1  11
##             
##              1.9  2 2.1 2.2 2.3 2.4 2.5
##   setosa       0  0   0   0   0   0   0
##   versicolor   0  0   0   0   0   0   0
##   virginica    5  6   6   3   8   3   3

listing cases

To examine a subset of data, we can print the cases in the R console by subsetting the data frame.

Rows (cases) and columns (variables):

mtcars[10:15, 2:5]

##                    cyl  disp  hp drat
## Merc 280             6 167.6 123 3.92
## Merc 280C            6 167.6 123 3.92
## Merc 450SE           8 275.8 180 3.07
## Merc 450SL           8 275.8 180 3.07
## Merc 450SLC          8 275.8 180 3.07
## Cadillac Fleetwood   8 472.0 205 2.93

Conditional:

mtcars[mtcars$mpg > 25, ]

##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Exploratory Analysis

cor(), t.test() and lm() are useful functions for exploratory analysis.

correlations

cor() produces correlations.

cor(mtcars[,1:3])

##             mpg        cyl       disp
## mpg   1.0000000 -0.8521620 -0.8475514
## cyl  -0.8521620  1.0000000  0.9020329
## disp -0.8475514  0.9020329  1.0000000

use argument allows us to decide if we want to apply pairwise deletion of missing values should we have missing data, for instance.

a <- mtcars
a[2, 1] <- NA
a[5, 3] <- NA

cor(a[,1:3], use = "pairwise.complete.obs")

##             mpg        cyl       disp
## mpg   1.0000000 -0.8521139 -0.8578945
## cyl  -0.8521139  1.0000000  0.8984661
## disp -0.8578945  0.8984661  1.0000000

Or listwise deletion of missing values.

cor(a[,1:3], use = "complete.obs")

##             mpg        cyl       disp
## mpg   1.0000000 -0.8600113 -0.8578945
## cyl  -0.8600113  1.0000000  0.9017140
## disp -0.8578945  0.9017140  1.0000000

t-tests

t.test() performs one and two sample t-tests.

Let’s create a group variable by some arbitrary standard first with the mtcars data before performing a t-test.

mtcars$group <- NA
mtcars$group[mtcars$cyl > 4] <- 1
mtcars$group[mtcars$cyl <= 4] <- 2

Now we can perform an independent 2-group t-test by group.

t.test(mpg ~ group, data = mtcars)

## 
##  Welch Two Sample t-test
## 
## data:  mpg by group
## t = -6.5737, df = 15.266, p-value = 8.09e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.258678  -6.773357
## sample estimates:
## mean in group 1 mean in group 2 
##        16.64762        26.66364

We can also do paired t-test and one sample t-test with t.test().

linear regressions

lm() fits linear models. Below are some arbitrary examples of using lm() to fit linear regressions. + indicates multiple independent variables. : indicates interactions without main effects, and * indicates interactions with main effects.

model1 <- lm(mpg ~ cyl, data = mtcars)
model2 <- lm(mpg ~ cyl + disp + vs, data = mtcars)
model3 <- lm(mpg ~ cyl : disp, data = mtcars)
model4 <- lm(mpg ~ cyl * disp, data = mtcars)

summary() returns the summary statistics of the model.

summary(model2)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + vs, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3899 -2.0944 -0.6386  1.2222  7.0974 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.88091    4.47352   8.021 9.82e-09 ***
## cyl         -1.75044    0.87236  -2.007   0.0545 .  
## disp        -0.02029    0.01045  -1.941   0.0624 .  
## vs          -0.63372    1.89594  -0.334   0.7407    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.103 on 28 degrees of freedom
## Multiple R-squared:  0.7605, Adjusted R-squared:  0.7349 
## F-statistic: 29.64 on 3 and 28 DF,  p-value: 7.792e-09

We can also obtain the model components of interest to us, which are stored in a list.

summary(model2)$coefficient

##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) 35.8809111 4.47351580  8.0207409 9.819906e-09
## cyl         -1.7504373 0.87235809 -2.0065582 5.454156e-02
## disp        -0.0202937 0.01045432 -1.9411790 6.236295e-02
## vs          -0.6337231 1.89593627 -0.3342534 7.406791e-01

summary(model2)$residuals

##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##         -1.13129465         -1.13129465         -3.25371876 
##      Hornet 4 Drive   Hornet Sportabout             Valiant 
##          1.89121145          4.12832076         -2.07848079 
##          Duster 360           Merc 240D            Merc 230 
##         -0.27167924         -0.86835242         -2.58808527 
##            Merc 280           Merc 280C          Merc 450SE 
##         -2.14333940         -3.54333940          0.11959088 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##          1.01959088         -1.08040912         -1.89878439 
## Lincoln Continental   Chrysler Imperial            Fiat 128 
##         -2.14230884          1.75181708          5.75167571 
##         Honda Civic      Toyota Corolla       Toyota Corona 
##          3.69079460          7.09744356         -4.30816494 
##    Dodge Challenger         AMC Javelin          Camaro Z28 
##          0.07598519         -0.50812667         -1.47461628 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
##          5.44006892          0.65776382         -0.43782931 
##        Lotus Europa      Ford Pantera L        Ferrari Dino 
##          4.08449245          1.04567742         -2.73570021 
##       Maserati Bora          Volvo 142E 
##         -0.76900778         -4.38990061