After the histogram, scatter plots are pretty much always the first plot I make when exploring a dataset. The code for the plots below mainly come from two sites. Which are both brilliant sources for info on using R.
I’ve only made small changes to fit my data, and to include themes, custom titles, etc.
Data I’m using
To make it easily to replicate I put the data I’m using in a gist.
temporaryFile <- tempfile() download.file("https://gist.githubusercontent.com/epijim/8819934/raw/6c76df80eb095065a9ce0fa4b8f94410ad528fed/college_data.csv" ,destfile=temporaryFile, method="curl") mydata <- read.csv(temporaryFile) # convert wine budget to thousands mydata$Wine_budget <- mydata$Wine_budget/1000 options(scipen=999) # no scientific notation # Just in case, I'll duplicate college as a string variable. mydata$College_char <- as.character(mydata$College)
Scatter plot with rug
A scatter plot with a rug plot on each axis.
png("scatterplot_1_rug.png",width=800) ggplot(mydata,aes(Founded,Wine_budget))+ geom_point()+ geom_rug(col="darkred",alpha=0.4)+ theme_bw()+ xlab("Year college was founded")+ ylab("Wine budget in thousands")+ ggtitle("Scatter of year Cambridge college founded and wine budget") dev.off()
Scatter plot with histograms
Scatter plot with histograms on each axis.
library(ggplot2) library(gridExtra) #placeholder plot - prints nothing at all empty <- ggplot()+geom_point(aes(1,1), colour="white") + theme( plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank() ) #scatterplot of x and y variables scatter <- ggplot(mydata,aes(Founded, Wine_budget))+ geom_point(aes(color=Rating))+ scale_color_manual(values = c("orange", "purple","blue"))+ theme_bw()+ theme(legend.position=c(1,1),legend.justification=c(1,1))+ xlab("Year college was founded")+ ylab("Wine budget in thousands of £") #marginal density of x - plot on top plot_top <- ggplot(mydata, aes(Founded, fill=Rating)) + geom_density(alpha=.5) + scale_fill_manual(values = c("orange", "purple","blue")) + theme_bw()+ theme(legend.position = "none",axis.title.x = element_blank()) # removes density x #marginal density of y - plot on the right plot_right <- ggplot(mydata, aes(Wine_budget, fill=Rating)) + geom_density(alpha=.5) + coord_flip() + scale_fill_manual(values = c("orange", "purple","blue")) + theme_bw()+ theme(legend.position = "none",axis.title.y = element_blank()) # removes density y #arrange the plots together, with appropriate height and width for each row and column png("scatterplot_2_hist.png",width=800) grid.arrange(plot_top, empty, scatter, plot_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4), main="Scatter of year Cambridge college founded and wine budget \nby A/B/C Board of Graduate Studies college grading") dev.off()
Dealing with large datasets
When you have lots of points, it becomes hard to see the outliers in a scatter plot. A super simple solution is just to make the points a bit transparent. Although that can hide the outliers.
In the plot below there are more than 16,000,000 data points. As it’s financial data, on the price paid for bitcoins, it’s all very closely clustered in one spot. This makes it hard to see the outliers, where trades occurred that were well below or above the market value of a bitcoin. The hexbin plot allows you to see this data.
library(hexbin) png("scatter_2_hexbin.png", width=800) bin<-hexbin(bitcoin$When, bitcoin$priceperbitcoin_log, xbins=50) plot(bin, main="Price per bitcoin over time in 16,748,682 Mt.Gox transactions", xlab="Days since Jan 1st 1970", ylab="Log10 of price per bitcoin" ) dev.off()
Scatter heat maps
The next plot colours points based on how close other points are to it. This code came out of a Stack Overflow question.3
## Use densCols() output to get density at each point density <- densCols(mydata_naomit$Undergrad_applications, mydata_naomit$Undergrad_acceptances, colramp=colorRampPalette(c("black", "white"))) mydata_naomit$density <- col2rgb(density)[1,] + 1L ## Map densities to colors colours <- colorRampPalette(c("#000099", "#00FEFF", "#45FE4F", "#FCFF00", "#FF9400", "#FF3100"))(256) mydata_naomit$colours <- colours[mydata_naomit$density] ## Plot it, reordering rows so that densest points are plotted on top png("scatter_3_heatmap.png") plot(Undergrad_acceptances~Undergrad_applications, data=mydata_naomit[order(mydata_naomit$density),], pch=20, col=colours, cex=2, xlab="Undergraduate applications", ylab="Undergraduate acceptances") dev.off()
Scatter contour plots
And a scatter plot with a coloured contour line around the points.
png("scatter_3_contour.png", width=800) plot(mydata_naomit, xlab="Undergraduate applications", ylab="Undergraduate acceptances", pch=1, cex=.4) contour(z, drawlabels=FALSE, nlevels=levels, col=colours, add=TRUE) dev.off()
Interactive scatter plots
The rCharts package allows you to make awesome, interactive charts. There are are a number of libraries available, and each does things slightly differently. None of it seems well documented, and it’s often confusing if you try and relate the documentation from the original libraries back to the rCharts plots. I found it easier to either edit the charts once they are in html, or to use str() on a plot environment to get a rough idea of what’s editable.
#require(devtools) #install_github('rCharts', 'ramnathv') require(rCharts) selectedfactor = mydata$College rchart <- rPlot(Firsts_2013 ~ Wine_budget , data = mydata, type = "point", width = 500, color = "College_char") rchart$guides(y = list(min = 0, max = 40)) rchart$guides(x = list(min = 0, max = 400000)) rchart$guides( color = list( numticks = length( levels( selectedfactor ) ), labels = as.character( levels( selectedfactor ) ) ), y=list(title="% with firsts"), x=list(title="Yearly wine budget in thousands (£)")) rchart$set(width = 550) rchart$set(height = 460) rchart$save('index.html', cdn=F)
3D scatter plots
Scatter plot matrix’s
Code for these are on my page on the scatter plot matrix, here.
The original question is here: http://stackoverflow.com/questions/7073315/how-do-i-create-a-continuous-density-heatmap-of-2d-scatter-data-in-r ↩