We look at some of the ways R can display information graphically. This is a basic introduction to some of the basic plotting commands. It is assumed that you know how to enter data or read data files which is covered in the first chapter, and it is assumed that you are familiar with the different data types.
In each of the topics that follow it is assumed that two different data sets, w1.dat and trees91.csv have been read and defined using the same variables as in the first chapter. Both of these data sets come from the study discussed on the web site given in the first chapter. We assume that they are read using “read.csv” into variables w1 and tree:
> w1 <- read.csv(file="w1.dat",sep=",",head=TRUE)
> names(w1)
[1] "vals"
> tree <- read.csv(file="trees91.csv",sep=",",head=TRUE)
> names(tree)
[1] "C" "N" "CHBR" "REP" "LFBM" "STBM" "RTBM" "LFNCC"
[9] "STNCC" "RTNCC" "LFBCC" "STBCC" "RTBCC" "LFCACC" "STCACC" "RTCACC"
[17] "LFKCC" "STKCC" "RTKCC" "LFMGCC" "STMGCC" "RTMGCC" "LFPCC" "STPCC"
[25] "RTPCC" "LFSCC" "STSCC" "RTSCC"
A strip chart is the most basic type of plot available. It plots the data in order along a line with each data point represented as a box. Here we provide examples using the w1 data frame mentioned at the top of this page, and the one column of the data is w1$vals.
To create a strip chart of this data use the stripchart command:
> help(stripchart)
> stripchart(w1$vals)
A histogram is very common plot. It plots the frequencies that data appears within certain ranges. Here we provide examples using the w1 data frame mentioned at the top of this page, and the one column of data is w1$vals.
To plot a histogram of the data use the “hist” command:
> hist(w1$vals)
> hist(w1$vals,main="Distribution of w1",xlab="w1")
Histogram Options
Many of the basic plot commands accept the same options. The help(hist) command will give you options specifically for the histcommand. You can also use the help command to see more but also note that if you use help(plot) you may see more options. Experiment with different options to see what you can do.
As you can see R will automatically calculate the intervals to use. There are many options to determine how to break up the intervals. Here we look at just one way, varying the domain size and number of breaks. If you would like to know more about the other options check out the help page:
> help(hist)
You can specify the number of breaks to use using the breaks option. Here we look at the histogram for various numbers of breaks:
> hist(w1$vals,breaks=2)
> hist(w1$vals,breaks=4)
> hist(w1$vals,breaks=6)
> hist(w1$vals,breaks=8)
> hist(w1$vals,breaks=12)
>
You can also vary the size of the domain using the xlim option. This option takes a vector with two entries in it, the left value and the right value:
> hist(w1$vals,breaks=12,xlim=c(0,10))
> hist(w1$vals,breaks=12,xlim=c(-1,2))
> hist(w1$vals,breaks=12,xlim=c(0,2))
> hist(w1$vals,breaks=12,xlim=c(1,1.3))
> hist(w1$vals,breaks=12,xlim=c(0.9,1.3))
>
The options for adding titles and labels are exactly the same as for strip charts. You should always annotate your plots and there are many different ways to add titles and labels. One way is within the hist command itself:
> hist(w1$vals,
main='Leaf BioMass in High CO2 Environment',
xlab='BioMass of Leaves')
If you have a plot already and want to change or add a title, you can use the title command:
> title('Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves')
Note that this simply adds the title and labels and will write over the top of any titles or labels you already have.
It is not uncommon to add other kinds of plots to a histogram. For example, one of the options to the stripchart command is to add it to a plot that has already been drawn. For example, you might want to have a histogram with the strip chart drawn across the top. The addition of the strip chart might give you a better idea of the density of the data:
> hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves',ylim=c(0,16))
> stripchart(w1$vals,add=TRUE,at=15.5)
A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set. Here we provide examples using two different data sets. The first is the w1 data frame mentioned at the top of this page, and the one column of data is w1$vals. The second is the tree data frame from the trees91.csv data file which is also mentioned at the top of the page.
We first use the w1 data set and look at the boxplot of this data set:
> boxplot(w1$vals)
Again, this is a very plain graph, and the title and labels can be specified in exactly the same way as in the stripchart and hist commands:
> boxplot(w1$vals,main='Leaf BioMass in High CO2 Environment', ylab='BioMass of Leaves')
Note that the default orientation is to plot the boxplot vertically. Because of this we used the ylab option to specify the axis label. There are a large number of options for this command. To see more of the options see the help page:
> help(boxplot)
As an example you can specify that the boxplot be plotted horizontally by specifying the horizontal option:
> boxplot(w1$vals,
main='Leaf BioMass in High CO2 Environment',
xlab='BioMass of Leaves',
horizontal=TRUE)
The option to plot the box plot horizontally can be put to good use to display a box plot on the same image as a histogram. You need to specify the add option, specify where to put the box plot using the at option, and turn off the addition of axes using the axes option:
> hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves',ylim=c(0,16))
> boxplot(w1$vals,horizontal=TRUE,at=15.5,add=TRUE,axes=FALSE)
If you are feeling really crazy you can take a histogram and add a box plot and a strip chart:
> hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves',ylim=c(0,16))
> boxplot(w1$vals,horizontal=TRUE,at=16,add=TRUE,axes=FALSE)
> stripchart(w1$vals,add=TRUE,at=15)
Some people shell out good money to have this much fun.
For the second part on boxplots we will look at the second data frame, “tree,” which comes from the trees91.csv file. To reiterate the discussion at the top of this page and the discussion in the data types chapter, we need to specify which columns are factors:
> tree <- read.csv(file="trees91.csv",sep=",",head=TRUE)
> tree$C <- factor(tree$C)
> tree$N <- factor(tree$N)
We can look at the boxplot of just the data for the stem biomass:
> boxplot(tree$STBM,
main='Stem BioMass in Different CO2 Environments',
ylab='BioMass of Stems')
That plot does not tell the whole story. It is for all of the trees, but the trees were grown in different kinds of environments. The boxplot command can be used to plot a separate box plot for each level. In this case the data is held in “tree$STBM,” and the different levels are stored as factors in “tree$C.” The command to create different boxplots is the following:
boxplot(tree$STBM~tree$C)
Note that for the level called “2” there are four outliers which are plotted as little circles. There are many options to annotate your plot including different labels for each level. Please use the help(boxplot) command for more information.
A scatter plot provides a graphical view of the relationship between two sets of numbers. Here we provide examples using the tree data frame from the trees91.csv data file which is mentioned at the top of the page. In particular we look at the relationship between the stem biomass (“tree$STBM”) and the leaf biomass (“tree$LFBM”).
The command to plot each pair of points as an x-coordinate and a y-coorindate is “plot:”
> plot(tree$STBM,tree$LFBM)
It appears that there is a strong positive association between the biomass in the stems of a tree and the leaves of the tree. It appears to be a linear relationship. In fact, the corelation between these two sets of observations is quite high:
> cor(tree$STBM,tree$LFBM)
[1] 0.911595
Getting back to the plot, you should always annotate your graphs. The title and labels can be specified in exactly the same way as with the other plotting commands:
> plot(tree$STBM,tree$LFBM,
main="Relationship Between Stem and Leaf Biomass",
xlab="Stem Biomass",
ylab="Leaf Biomass")
The final type of plot that we look at is the normal quantile plot. This plot is used to determine if your data is close to being normally distributed. You cannot be sure that the data is normally distributed, but you can rule out if it is not normally distributed. Here we provide examples using the w1 data frame mentioned at the top of this page, and the one column of data is w1$vals.
The command to generate a normal quantile plot is qqnorm. You can give it one argument, the univariate data set of interest:
> qqnorm(w1$vals)
You can annotate the plot in exactly the same way as all of the other plotting commands given here:
> qqnorm(w1$vals,
main="Normal Q-Q Plot of the Leaf Biomass",
xlab="Theoretical Quantiles of the Leaf Biomass",
ylab="Sample Quantiles of the Leaf Biomass")
After you creat the normal quantile plot you can also add the theoretical line that the data should fall on if they were normally distributed:
> qqline(w1$vals)
In this example you should see that the data is not quite normally distributed. There are a few outliers, and it does not match up at the tails of the distribution.
One common task is to plot multiple data sets on the same plot. In many situations the way to do this is to create the initial plot and then add additional information to the plot. For example, to plot bivariate data the plot command is used to initialize and create the plot. The points command can then be used to add additional data sets to the plot.
First define a set of normally distributed random numbers and then plot them. (This same data set is used throughout the examples below.)
> x <- rnorm(10,sd=5,mean=20)
> y <- 2.5*x - 1.0 + rnorm(10,sd=9,mean=0)
> cor(x,y)
[1] 0.7400576
> plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff")
> x1 <- runif(8,15,25)
> y1 <- 2.5*x1 - 1.0 + runif(8,-6,6)
> points(x1,y1,col=2)
Note that in the previous example, the colour for the second set of data points is set using the col option. You can try different numbers to see what colours are available. For most installations there are at least eight options from 1 to 8. Also note that in the example above the points are plotted as circles. The symbol that is used can be changed using thepch option.
> x2 <- runif(8,15,25)
> y2 <- 2.5*x2 - 1.0 + runif(8,-6,6)
> points(x2,y2,col=3,pch=2)
Again, try different numbers to see the various options. Another helpful option is to add a legend. This can be done with the legend command. The options for the command, in order, are the x and y coordinates on the plot to place the legend followed by a list of labels to use. There are a large number of other options so use help(legend) to see more options. For example a list of colors can be given with the col option, and a list of symbols can be given with the pch option.
> plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff")
> points(x1,y1,col=2,pch=3)
> points(x2,y2,col=4,pch=5)
> legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))
The three data sets displayed on the same graph.
Another common task is to change the limits of the axes to change the size of the plotting area. This is achieved using the xlim and ylim options in the plot command. Both options take a vector of length two that have the minimum and maximum values.
>plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff",xlim=c(0,30),ylim=c(0,100))
> points(x1,y1,col=2,pch=3)
> points(x2,y2,col=4,pch=5)
> legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))
Note that a new command was used in the previous example. The par command can be used to set different parameters. In the example above the mfrow was set. The plots are arranged in an array where the default number of rows and columns is one. The mfrow parameter is a vector with two entries. The first entry is the number of rows of images. The second entry is the number of columns. In the example above the plots were arranged in one row with two plots across.
> par(mfrow=c(2,3))
> boxplot(numberWhite,main="first plot")
> boxplot(numberChipped,main="second plot")
> plot(jitter(numberWhite),jitter(numberChipped),xlab="Number White Marbles Drawn",
ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With Jitter")
> hist(numberWhite,main="fourth plot")
> hist(numberChipped,main="fifth plot")
> mosaicplot(table(numberWhite,numberChipped),main="sixth plot")
There are times when you do not want to plot specific points but wish to plot a density. This can be done using thesmoothScatter command.
> numberWhite <- rhyper(30,4,5,3)
> numberChipped <- rhyper(30,2,7,3)
> smoothScatter(numberWhite,numberChipped,
xlab="White Marbles",ylab="Chipped Marbles",main="Drawing Marbles")
The SmoothScatter can be used to plot densities.
Note that the previous example may benefit by superimposing a grid to help delimit the points of interest. This can be done using the grid command.
> numberWhite <- rhyper(30,4,5,3)
> numberChipped <- rhyper(30,2,7,3)
> smoothScatter(numberWhite,numberChipped,
xlab="White Marbles",ylab="Chipped Marbles",main="Drawing Marbles")
> grid(4,3)
There are times that you want to explore a large number of relationships. A number of relationships can be plotted at one time using the pairs command. The idea is that you give it a matrix or a data frame, and the command will create a scatter plot of all combinations of the data.
> uData <- rnorm(20)
> vData <- rnorm(20,mean=5)
> wData <- uData + 2*vData + rnorm(20,sd=0.5)
> xData <- -2*uData+rnorm(20,sd=0.1)
> yData <- 3*vData+rnorm(20,sd=2.5)
> d <- data.frame(u=uData,v=vData,w=wData,x=xData,y=yData)
> pairs(d)
A shaded region can be plotted using the polygon command. The polygon command takes a pair of vectors, x and y, and shades the region enclosed by the coordinate pairs. In the example below a blue square is drawn. The vertices are defined starting from the lower left. Five pairs of points are given because the starting point and the ending point is the same.
> x = c(-1,1,1,-1,-1)
> y = c(-1,-1,1,1,-1)
> plot(x,y)
> polygon(x,y,col='blue')
>
A more complicated example is given below. In this example the rejection region for a right sided hypothesis test is plotted, and it is shaded in red. A set of custom axes is constructed, and symbols are plotted using the expressioncommand.
> stdDev <- 0.75;
> x <- seq(-5,5,by=0.01)
> y <- dnorm(x,sd=stdDev)
> right <- qnorm(0.95,sd=stdDev)
> plot(x,y,type="l",xaxt="n",ylab="p",
xlab=expression(paste('Assumed Distribution of ',bar(x))),
axes=FALSE,ylim=c(0,max(y)*1.05),xlim=c(min(x),max(x)),
frame.plot=FALSE)
> axis(1,at=c(-5,right,0,5),
pos = c(0,0),
labels=c(expression(' '),expression(bar(x)[cr]),expression(mu[0]),expression(' ')))
> axis(2)
> xReject <- seq(right,5,by=0.01)
> yReject <- dnorm(xReject,sd=stdDev)
> polygon(c(xReject,xReject[length(xReject)],xReject[1]),
c(yReject,0, 0), col='red')
Using polygon to produce a shaded region.
The axes are drawn separately. This is done by first suppressing the plotting of the axes in the plot command, and the horizontal axis is drawn separately. Also note that the expression command is used to plot a Greek character and also produce subscripts.
Comments
Post a Comment