Tuesday, December 1, 2015

Fun with Data - Basics of R - Part 1

It's been a while since I wrote my last technical post, actually I need to finish writing my database series. When I started grad school, my plan was to write post on every new exciting thing I learn for my own reference. But due to time constraint couldn't make it. Recently I'm learning to use statistical software R for data analytics, which is a widely used tool, and I thought it'd be useful to write about the basics for anyone who'd like to learn it from scratch. I'll keep these posts short and write as I learn more.

Using R on your machine
I'm using R with the software R Studio, which requires to have R installed on your computer. Both are supported for all platforms such as Windows, MacOS and Ubuntu. Since I'm using MacOS, I'll write about how it works for this platform. But it'll be pretty similar for other ones. 

First of all, make sure you have R installed. In your terminal type which r, which will return the path of where R is installed, or simply type r, which will start r console. If you don't have it installed then you can easily install it from the following link - https://cran.r-project.org/bin/macosx/
Then download and install R Studio from the following link- https://www.rstudio.com/products/rstudio/download/
If the installation is completed properly then you can run the software, which will look like the screen below, but without any data of course. I've marked the fields as A, B, C, and D to explain their meaning and usage.


The top left field A is to write the scripts like any other text editor. You can write your script and select any portion to run. Or you can use the bottom left field B, which is R console to directly input the scripts and see output. It works similar to terminal or iTerm.

The top right field C is where the data frames are shown once any data frame or table is imported or created. As for me, I've three tables in my working directory now. And in the bottom right field D all the graphical outputs are shown, such as bar plot, scatterplot, etc. 

Loading data into R
The commonly used data formats are .csv or .txt, and usually gathered from other data sources and then loaded into R. Once you have your file saved on your machine, get the full path of the file, which can be done by viewing the information or properties of the file. For example, my ihis_0005.csv file is saved under /Documents/fall2015_classes/SODA_EVD directory. The command to read csv file is simply read.csv() and the file path goes within the parentheses. You would like to give the table a name, which is ihis_data in my case. So the command will be,
your_table_name <- read.csv("your_file_path")

Once you have your data loaded, it'll show up on the top right section. R is very helpful to show the summary of the data by just typing the command summary(your_table_name), which is useful for quantitative variables. It returns minimum and maximum value, mean, median, 1st quartile, 3rd quartile values for each column. For categorical variables using the table() command is more useful. To see the actual distribution of values for any certain column with categorical variable, type table(your_table_name$column_name). The $ sign denotes columns for that table.

You can also see all the data frames in your working directory by ls() command, where ls means list. And delete any data frame by rm(your_table_name), where rm means remove. To delete multiple tables at ones list all the table names you would like to delete like this-
rm(list = c("table1", "table2", ...)

These are the preparatory steps before we can actually go ahead and play with our data. I'll write more about how to actually have fun with it in my following post (which I believe will happen soon!) :)

No comments:

Post a Comment