Sunday, December 4, 2016

Fun with Data - Basics of R - Part 3 Visualization

In this section we'll cover some basic visualizations with R using the built-in plot function and using the library ggplot2. Before we start, here's a quick review of the topics we covered in previous two sections.

Part 1 - Getting R and R-Studio in your machine, understanding the structure of R-Studio and loading data into R. We also covered some basic functions for viewing the content of loaded dataset.

Part 2 - Understanding prompts in R-Studio, variables, vectors and data frames. Also, summary, mean and length functions.

Now that we can load data into R and see what's inside, let's try to visualize it. Visualization is a great way to actually make the meaning of data visible, especially when we're dealing with large amount of data. For example, we're planning a sports event at school and trying to group students based on their heights. We can see the distribution by looking at the data, but when we create a graph it is easy to comprehend which proportion falls in which category based on height. 

Today we'll see how to create scatter plot using both basic plot function and ggplot2. R comes with many freely available datasets, which you can view by typing data(). We'll use the airquality, mtcars and pressure datasets to create sample graphs. Since they're already in R, we don't have to load them, but look into the content by using some of the functions we covered in previous sections, or by simply typing their names in the console.

airquality()
mtcars()
pressure()

Splitting the View Window
You can split the view window to see more than one graph or plot in the screen. For example, par(mfrow = c(1, 2)) splits the window into 2. Changing it to c(2, 2) will split it into 4. 

Install ggplot2 Package
install.packages("ggplot2")  # installs the package
library(ggplot2)                    # loads the library into the workspace 

Scatter Plot
In the mtcars dataset we can see weight of a car and how many miles it runs per gallon. We can find out if there's any correlation between these two by creating a scatter plot.

plot(mtcars$wt, mtcars$mpg, main="Scatterplot Example", xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)

I'll explain what each of the parameters mean. 
main -> Give a name to the plot
xlab -> Name of X-axis
ylab -> Name of Y-axis
pch -> Type of symbol for the graph. You can see the full list from this page.

Now, this alone can be little difficult to understand. So, you can add fit lines to this to see how many data are positively or negatively correlated. 

abline(lm(mtcars$mpg~mtcars$wt), col="red")  # regression line (y~x)
lines(lowess(mtcars$mpg~mtcars$wt), col="green")   # lowess line (x, y)


You can create the same graph using ggplot2, which is visually more pleasing and provides a lot of options to ornate your graph. (Check out the package documentation for further information)

# Create the same graph using ggplot2
qplot(mtcars$wt, mtcars$mpg, xlab = "Car Weight", ylab = "Mile/Gallon")

# Following syntax can be used if the two vectors are already in the same data frame.
qplot(wt, mpg, data=mtcars, xlab = "Car Weight", ylab = "Mile/Gallon")


We can also add colors to see correlation to another variable. In this example, cylinder size of cars. 
qplot(wt, mpg, data=mtcars, xlab = "Car Weight", ylab = "Mile/Gallon", color =cyl)

From this graph we can make assumption that light weight cars with smaller cylinder size usually run more miles per gallon. Similarly, we can create different visualizations by comparing other variables or in other words, by considering which information we're trying to provide to our audience. 

A great reference to learn about visualizations with R is R Graphics Cookbook by Winston Chang- 
Chang, W. (2012). R graphics cookbook. " O'Reilly Media, Inc.".