Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Sunday, December 4, 2016

Fun with Data - Basics of R - Part 3 Visualization

In this section we'll cover some basic visualizations with R using the built-in plot function and using the library ggplot2. Before we start, here's a quick review of the topics we covered in previous two sections.

Part 1 - Getting R and R-Studio in your machine, understanding the structure of R-Studio and loading data into R. We also covered some basic functions for viewing the content of loaded dataset.

Part 2 - Understanding prompts in R-Studio, variables, vectors and data frames. Also, summary, mean and length functions.

Now that we can load data into R and see what's inside, let's try to visualize it. Visualization is a great way to actually make the meaning of data visible, especially when we're dealing with large amount of data. For example, we're planning a sports event at school and trying to group students based on their heights. We can see the distribution by looking at the data, but when we create a graph it is easy to comprehend which proportion falls in which category based on height. 

Today we'll see how to create scatter plot using both basic plot function and ggplot2. R comes with many freely available datasets, which you can view by typing data(). We'll use the airquality, mtcars and pressure datasets to create sample graphs. Since they're already in R, we don't have to load them, but look into the content by using some of the functions we covered in previous sections, or by simply typing their names in the console.

airquality()
mtcars()
pressure()

Splitting the View Window
You can split the view window to see more than one graph or plot in the screen. For example, par(mfrow = c(1, 2)) splits the window into 2. Changing it to c(2, 2) will split it into 4. 

Install ggplot2 Package
install.packages("ggplot2")  # installs the package
library(ggplot2)                    # loads the library into the workspace 

Scatter Plot
In the mtcars dataset we can see weight of a car and how many miles it runs per gallon. We can find out if there's any correlation between these two by creating a scatter plot.

plot(mtcars$wt, mtcars$mpg, main="Scatterplot Example", xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)

I'll explain what each of the parameters mean. 
main -> Give a name to the plot
xlab -> Name of X-axis
ylab -> Name of Y-axis
pch -> Type of symbol for the graph. You can see the full list from this page.

Now, this alone can be little difficult to understand. So, you can add fit lines to this to see how many data are positively or negatively correlated. 

abline(lm(mtcars$mpg~mtcars$wt), col="red")  # regression line (y~x)
lines(lowess(mtcars$mpg~mtcars$wt), col="green")   # lowess line (x, y)


You can create the same graph using ggplot2, which is visually more pleasing and provides a lot of options to ornate your graph. (Check out the package documentation for further information)

# Create the same graph using ggplot2
qplot(mtcars$wt, mtcars$mpg, xlab = "Car Weight", ylab = "Mile/Gallon")

# Following syntax can be used if the two vectors are already in the same data frame.
qplot(wt, mpg, data=mtcars, xlab = "Car Weight", ylab = "Mile/Gallon")


We can also add colors to see correlation to another variable. In this example, cylinder size of cars. 
qplot(wt, mpg, data=mtcars, xlab = "Car Weight", ylab = "Mile/Gallon", color =cyl)

From this graph we can make assumption that light weight cars with smaller cylinder size usually run more miles per gallon. Similarly, we can create different visualizations by comparing other variables or in other words, by considering which information we're trying to provide to our audience. 

A great reference to learn about visualizations with R is R Graphics Cookbook by Winston Chang- 
Chang, W. (2012). R graphics cookbook. " O'Reilly Media, Inc.".

Sunday, October 16, 2016

Fun with Data - Basics of R - Part 2

In our first post on R programming language I covered how to download R and R Studio, understanding the structure of R Studio, and loading datasets into R. I meant to continue the series (as I always do), but couldn't get back to it any sooner. Recently I've started working on a workshop where I'll be teaching R to the beginners. So, I thought this would be the best time to add more content to this series as well. 

In this Part 2 I'll write about the following - Understanding prompts of R Studio, doing basic calculations in R, all about variables, functions, the concept of vector in R, and data frame. Some of these I should have covered in Part 1, but better late than never!

Prompts in R Studio
  • In console a new line starts with >, means it is waiting for us to communicate
  • If we give it an incomplete command then it returns +. Press esc button to return to a new line.
  • To quit R type q()
Doing Basic Calculations in R
  • The order of arithmetic operations is  (left [done first] to right [done last]) : ^ / * - +
  • ^ is used for raised to the power of, followed by division, multiplication, subtraction and addition.
  • At the prompt, we enter the expression that we want evaluated and when we hit enter, it will compute the result for us. For example: > 10 + 22 will return [1] 32
All About Variables 
  • Variables are the symbols that store assigned values. We can store a computation under a new variable or change the existing value of an old variable.
  • Variable names in R are case sensitive (upper or lower case).
  • It is a good practice to assign meaningful variable names that helps to refer to easily for complex calculations.
To assign a value: variable_name <- value
Example: x <- 100

ALERT! Reserved Symbols!
In all programming languages certain symbols are reserved for specific purposes. The reserved symbols in R are - c q t C D F I T (So, don't use them for your personal variables ^-^)

Functions
A function is a sub-program that performs a specific task. For example, to find a square root of a given value. It helps to avoid repetition and easy execution in future. 
Try this code to understand how functions work -
firstFunction <- function(n){n*n}
This function named firstFunction is supposed to return square of any integer. Test it out by assigning different values to the function. Think of what other functions you can possibly write.

Vector
Vector has different meanings in different contexts. In math and physics, a vector is an element with both value and direction. But in R, vector is a sequence of data elements of the same basic type. It can be defined by concatenating the members in a set c(). Example: x <- c(1, 2, 4, 5).

Once we have a vector of numbers we can apply certain built-in functions to them to get useful summaries. For example:
> sum(x)        ## sums the values in the vector
> length(x)    ## produces the number of values in the vector, ie its length
> mean(x)     ## the average (mean)

Data Frame
A data frame can be created by defining different variables for each column as vectors and then joining them together.
Example: Let us assume we have a list of different fruits with their names, colors and size.
> name      <- c("apple", "banana", "peach", "watermelon", "grape")
> color      <- c("red", "yellow", "peach", "green", "red")
> size_cm <- c(10, 15, 8, 40, 2)

Then we add these three columns together to create the data frame names fruits.data.
> fruits.data <- data.frame(name, color, size_cm)

To see the values of the data frame -
> fruits.data
        name      color       size_cm
1      apple        red            10
2     banana      yellow      15
3      peach       peach         8
4 watermelon  green         40
5      grape       red             2

--------------------------------------------------------------------------------------------------------------------------
I think we've covered a lot of basics concepts already, so I'll stop here today. In the next post of this series, I'll write about setting work directory, manipulating datasets, and playing around with some plots/visualizations, and hope I can make it sometime soon!