Scatter plots with R

After the histogram, scatter plots are pretty much always the first plot I make when exploring a dataset. The code for the plots below mainly come from two sites. Which are both brilliant sources for info on using R.

  1. R for public health1
  2. Quick R2

I’ve only made small changes to fit my data, and to include themes, custom titles, etc.

Data I’m using

To make it easily to replicate I put the data I’m using in a gist.

temporaryFile <- tempfile()
download.file("https://gist.githubusercontent.com/epijim/8819934/raw/6c76df80eb095065a9ce0fa4b8f94410ad528fed/college_data.csv"
              ,destfile=temporaryFile, method="curl")
mydata <- read.csv(temporaryFile)

# convert wine budget to thousands
mydata$Wine_budget <- mydata$Wine_budget/1000

options(scipen=999) # no scientific notation

# Just in case, I'll duplicate college as a string variable.
mydata$College_char <- as.character(mydata$College)

Scatter plot with rug

A scatter plot with a rug plot on each axis.

png("scatterplot_1_rug.png",width=800)
  ggplot(mydata,aes(Founded,Wine_budget))+
    geom_point()+
    geom_rug(col="darkred",alpha=0.4)+
  theme_bw()+
  xlab("Year college was founded")+
  ylab("Wine budget in thousands")+
  ggtitle("Scatter of year Cambridge college founded and wine budget")
dev.off()

Scatter plot with histograms

Scatter plot with histograms on each axis.

library(ggplot2)
library(gridExtra)
#placeholder plot - prints nothing at all
empty <- ggplot()+geom_point(aes(1,1), colour="white") +
  theme(
    plot.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank()
  )

#scatterplot of x and y variables
scatter <- ggplot(mydata,aes(Founded, Wine_budget))+
  geom_point(aes(color=Rating))+
  scale_color_manual(values = c("orange", "purple","blue"))+
  theme_bw()+
  theme(legend.position=c(1,1),legend.justification=c(1,1))+
  xlab("Year college was founded")+
  ylab("Wine budget in thousands of £")

#marginal density of x - plot on top
plot_top <- ggplot(mydata, aes(Founded, fill=Rating)) +
  geom_density(alpha=.5) +
  scale_fill_manual(values = c("orange", "purple","blue")) +
  theme_bw()+
  theme(legend.position = "none",axis.title.x = element_blank()) # removes density x

#marginal density of y - plot on the right
plot_right <- ggplot(mydata, aes(Wine_budget, fill=Rating)) +
  geom_density(alpha=.5) +
  coord_flip() +
  scale_fill_manual(values = c("orange", "purple","blue")) +
  theme_bw()+
  theme(legend.position = "none",axis.title.y = element_blank()) # removes density y

#arrange the plots together, with appropriate height and width for each row and column
png("scatterplot_2_hist.png",width=800)
grid.arrange(plot_top, empty, scatter, plot_right,
             ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4),
             main="Scatter of year Cambridge college founded and wine budget \nby A/B/C Board of Graduate Studies college grading")
dev.off()

Dealing with large datasets

When you have lots of points, it becomes hard to see the outliers in a scatter plot. A super simple solution is just to make the points a bit transparent. Although that can hide the outliers.

In the plot below there are more than 16,000,000 data points. As it’s financial data, on the price paid for bitcoins, it’s all very closely clustered in one spot. This makes it hard to see the outliers, where trades occurred that were well below or above the market value of a bitcoin. The hexbin plot allows you to see this data.

library(hexbin)
png("scatter_2_hexbin.png", width=800)
  bin<-hexbin(bitcoin$When, bitcoin$priceperbitcoin_log, xbins=50)
  plot(bin,
       main="Price per bitcoin over time in 16,748,682 Mt.Gox transactions",
       xlab="Days since Jan 1st 1970",
       ylab="Log10 of price per bitcoin"
       )
dev.off()
Hexbin plot.

Scatter heat maps

The next plot colours points based on how close other points are to it. This code came out of a Stack Overflow question.3

## Use densCols() output to get density at each point
density <- densCols(mydata_naomit$Undergrad_applications,
                    mydata_naomit$Undergrad_acceptances,
                    colramp=colorRampPalette(c("black", "white")))
mydata_naomit$density <- col2rgb(density)[1,] + 1L

## Map densities to colors
colours <-  colorRampPalette(c("#000099", "#00FEFF", "#45FE4F",
                            "#FCFF00", "#FF9400", "#FF3100"))(256)
mydata_naomit$colours <- colours[mydata_naomit$density]

## Plot it, reordering rows so that densest points are plotted on top
png("scatter_3_heatmap.png")
plot(Undergrad_acceptances~Undergrad_applications,
     data=mydata_naomit[order(mydata_naomit$density),],
     pch=20, col=colours, cex=2,
     xlab="Undergraduate applications",
     ylab="Undergraduate acceptances")
dev.off()
Scatter heat maps.

Scatter contour plots

And a scatter plot with a coloured contour line around the points.

png("scatter_3_contour.png", width=800)
plot(mydata_naomit,
     xlab="Undergraduate applications",
     ylab="Undergraduate acceptances",
     pch=1, cex=.4)
contour(z, drawlabels=FALSE, nlevels=levels, col=colours, add=TRUE)
dev.off()
Contour scatter plot.

Interactive scatter plots

The rCharts package allows you to make awesome, interactive charts. There are are a number of libraries available, and each does things slightly differently. None of it seems well documented, and it’s often confusing if you try and relate the documentation from the original libraries back to the rCharts plots. I found it easier to either edit the charts once they are in html, or to use str() on a plot environment to get a rough idea of what’s editable.

#require(devtools)
#install_github('rCharts', 'ramnathv')

require(rCharts)

selectedfactor = mydata$College

rchart <- rPlot(Firsts_2013 ~ Wine_budget ,
            data = mydata,
            type = "point",
            width = 500,
            color = "College_char")
rchart$guides(y = list(min = 0, max = 40))
rchart$guides(x = list(min = 0, max = 400000))
rchart$guides(
  color = list(
    numticks = length( levels( selectedfactor ) ),
    labels = as.character( levels( selectedfactor ) )
  ),
  y=list(title="% with firsts"),
  x=list(title="Yearly wine budget in thousands (£)"))
rchart$set(width = 550)
rchart$set(height = 460)

rchart$save('index.html', cdn=F)

Most of above is pretty self explanatory - except maybe cdn. If you know you’ll have internet access, set it as true. If you have no internet, or don’t want to rely on the rCharts site, set it as false and put the javascript that runs the plot on your computer/server.

3D scatter plots

A 3D scatter plot from my post about mapping ski data. The 3D scatter plot is super simple to make: code is here.

3D scatter plot of a day skiing.

Scatter plot matrix’s

Code for these are on my page on the scatter plot matrix, here.

Two types of scatter plot matrix.
  1. Part of a the ggplot series on: http://rforpublichealth.blogspot.co.uk

  2. Quick R has a dedicated page on pretty much every simple type of plot: http://www.statmethods.net

  3. The original question is here: http://stackoverflow.com/questions/7073315/how-do-i-create-a-continuous-density-heatmap-of-2d-scatter-data-in-r

A post about: , and

You May Also Enjoy