Using vectors and matrices in R

Originally for Statistics 133, by

Modes and Classes

It was mentioned earlier that all the elements of a vector must be of the same mode. To see the mode of an object, you can use the mode function. What happens if we try to combine objects of different modes using the c function? The answer is that R will find a common mode that can accomodate all the objects, resulting in the mode of some of the objects changing. For example, let's try combining some numbers and some character strings:

> both = c('dog',3,'cat','mouse',7,12,9,'chicken')
> both
[1] "dog"     "3"       "cat"     "mouse"   "7"       "12"      "9"      
[8] "chicken"
> mode(both)
[1] "character"

You can see that the numbers have been changed to characters because they are now displayed surrounded by quotes. They also will no longer behave like numbers:

> both[2] + both[5]
Error in both[2] + both[5] : non-numeric argument to binary operator

The error message means that the two values can no longer be added. If you really need to treat character strings like numbers, you can use the as.numeric function:

> as.numeric(both[2]) + as.numeric(both[5])
[1] 10

Of course, the best thing is to avoid combining objects of different modes with the c function. We'll see later that R provides an object known as a list that can store different types of objects without having to change their modes.

Reading Vectors

Once you start working with larger amounts of data, it becomes very tedious to enter data into the c function, especially considering the need to put quotes around character values and commas between values. To read data from a file or from the terminal without the need for quotes and commas, you can use the scan function. To read from a file (or a URL), pass it a quoted string with the name of the file or URL you wish to read; to read from the terminal, call scan() with no arguments, and enter a completely blank line when you're done entering your data. Additionally, on Windows or Mac OS X, you can substitute a call to the file.choose() function for the quoted string with the file name, and you'll be presented with the familiar file chooser used by most programs on those platforms.

Suppose there's a file called numbers in your working directory. (You can get your working directory by calling the getwd() function, or set it using the setwd function or File -> Change dir selection in the R console.) Let's say the contents of this file looks like this:

12 7
9 8 14 10
17

The scan function can be used to read these numbers as follows:

> nums = scan('numbers')
Read 7 items
> nums
[1] 12  7  9  8 14 10 17

The optional what= argument to scan can be used to read vectors of character or logical values, but remember a vector can only hold objects all of which are of the same mode.

Missing Values

No matter how carefully we collect our data, there will always be situations where we don't know the value of a particular variable. For example, we might conduct a survey where we ask people 10 questions, and occasionally we forget to ask one, or people don't know the proper answer. We don't want values like this to enter into calculations, but we can't just eliminate them because then observations that have missing values won't "fit in" with the rest of the data.

In R, missing values are represented by the string NA. For example, suppose we have a vector of 10 values, but the fourth one is missing. I can enter a missing value by passing NA to the c function just as if it was a number (no quotes needed):

x = c(1,4,7,NA,12,19,15,21,20)

R will also recognize the unquoted string NA as a missing value when data is read from a file or URL.

Missing values are different from other values in R in two ways:

  1. Any computation involving a missing value will return a missing value.
  2. Unlike other quantities in R, we can't directly test to see if something is equal to a missing value with the equality operator (==). We must use the builtin is.na function, which will return TRUE if a value is missing and FALSE otherwise.

Here are some simple R statements that illustrate these points:

> x = c(1,4,7,NA,12,19,15,21,20)
> mean(x)
[1] NA
> x == NA
[1] NA NA NA NA NA NA NA NA NA

Fortunately, these problems are fairly easy to solve. In the first case, many functions (like mean, min, max, sd, quantile, etc.) accept an na.rm=TRUE argument, that tells the function to remove any missing values before performing the computation:

> mean(x,na.rm=TRUE)
[1] 12.375

In the second case, we just need to remember to always use is.na whenever we are testing to see if a value is a missing value.

> is.na(x)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

By combining a call to is.na to the logical "not" operator (!) we can filter out missing values in cases where no na.rm= argument is available:

> x[!is.na(x)]
[1]  1  4  7 12 19 15 21 20

Matrices

A very common way of storing data is in a matrix, which is basically a two-way generalization of a vector. Instead of a single index, we can use two indexes, one representing a row and the second representing a column. The matrix function takes a vector and makes it into a matrix in a column-wise fashion. For example,

> mymat = matrix(1:12,4,3)
> mymat
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

The last two arguments to matrix tell it the number of rows and columns the matrix should have. If you used a named argument, you can specify just one dimension, and R will figure out the other:

> mymat = matrix(1:12,ncol=3)
> mymat
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

To create a matrix by rows instead of by columns, the byrow=TRUE argument can be used:

> mymat = matrix(1:12,ncol=3,byrow=TRUE)
> mymat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12

When data is being read from a file, you can simply imbed a call to scan into a call to matrix. Suppose we have a file called matrix.dat with the following contents:

7 12 19 4
18 7 12 3
9 5 8 42

We could create a 3×4 matrix, read in by rows, with the following command:

matrix(scan('matrix.dat'),nrow=3,byrow=TRUE)

To access a single element of a matrix, we need to specify both the row and the column we're interested in. Consider the following matrix, containing the numbers from 1 to 10:

> m = matrix(1:10,5,2)
> m
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

Now suppose we want the element in row 4 and column 1:

> m[4,1]
[1] 4

If we leave out either one of the subscripts, we'll get the entire row or column of the matrix, depending on which subscript we leave out:

> m[4,]
[1] 4 9
> m[,1]
[1] 1 2 3 4 5

Data Frames

One shortcoming of vectors and matrices is that they can only hold one mode of data; they don't allow us to mix, say, numbers and character strings. If we try to do so, it will change the mode of the other elements in the vector to conform. For example:

> c(12,9,"dog",7,5)
[1] "12"  "9"   "dog" "7"   "5"

Notice that the numbers got changed to character values so that the vector could accomodate all the elements we passed to the c function. In R, a special object known as a data frame resolves this problem. A data frame is like a matrix in that it represents a rectangular array of data, but each column in a data frame can be of a different mode, allowing numbers, character strings and logical values to coincide in a single object in their original forms. Since most interesting data problems involve a mixture of character variables and numeric variables, data frames are usually the best way to store information in R. (It should be mentioned that if you're dealing with data of a single mode, a matrix may be more efficient than a data frame.) Data frames correspond to the traditional "observations and variables" model that most statistical software uses, and they are also similar to database tables. Each row of a data frame represents an observation; the elements in a given row represent information about that observation. Each column, taken as a whole, has all the information about a particular variable for the data set.

For small datasets, you can enter each of the columns (variables) of your data frame using the data.frame function. For example, let's extend our temperature example by creating a data frame that has the day of the month, the minimum temperature and the maximum temperature:

> temps = data.frame(day=1:10,
+                min = c(50.7,52.8,48.6,53.0,49.9,47.9,54.1,47.6,43.6,45.5),
+                max = c(59.5,55.7,57.3,71.5,69.8,68.8,67.5,66.0,66.1,61.7))
> head(temps)
  day  min  max
1   1 50.7 59.5
2   2 52.8 55.7
3   3 48.6 57.3
4   4 53.0 71.5
5   5 49.9 69.8
6   6 47.9 68.8

Note that the names we used when we created the data frame are displayed with the data. (You can add names after the fact with the names function.) Also, instead of typing the name temps to see the data frame, we used a call the the head function instead. This will show me just the first six observations (by default) of the data frame, and is very handy to check to make sure a large data.frame really looks the way you think. (There's a function called tail that shows the last lines in an object as well.)

If we try to look at the class or mode of a data frame, it's not that informative:

> class(temps)
[1] "data.frame"
> mode(temps)
[1] "list"

We'll see the same results for every data frame we use. To look at the modes of the individual columns of a data frame, we can use the sapply function. This function simplifies operations that would require loops in other languages, and automatically returns the appropriate results for the operation it performs. To use sapply on a data frame, pass the data frame as the first argument to sapply, and the function you wish to use as the second argument. So to find the modes of the individual columns of the temps data frame, we could use

> sapply(temps,mode)
     date       min   maximum 
"numeric" "numeric" "numeric" 

Notice that sapply even labeled the result with the name of each column.

Suppose we want to concentrate on the maximum daily temperature (which we've called max in our data frame) among the days recorded. There are several ways we can refer to the columns of a data frame:

  1. Probably the easiest way to refer to this column is to use a special notation that eliminates the need to put quotes around the variable names (unless they contain blanks or other special characters). Separate the data frame name from the variable name with a dollar sign ($):
    > temps$max
     [1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7 
    
  2. We can treat the data frame like it was a matrix. Since the maximum temperature is in the third column, we could say
    > temps[,3]
     [1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7 
    
  3. Since we named the columns of temps we can use a character subscript:
    > temps[,"max"]
     [1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7
  4. When you use a single subscript with a data frame, it refers to a data frame consisting of just that column. R also provides a special subscripting method (double brackets) to extract the actual data (in this case a vector) from the data frame:
    > temps['max']
        max
    1  59.5
    2  55.7
    3  57.3
    4  71.5
    5  69.8
    6  68.8
    7  67.5
    8  66.0
    9  66.1
    10 61.7
    > temps[['max']]
    [1] 59.5 55.7 57.3 71.5 69.8 68.8 67.5 66.0 66.1 61.7

    Notice that this second form is identical to temps$max. We could also use the equivalent numerical subscript (in this case 3) with single or double brackets.