User Tools

Site Tools


r:flamescourse

By Goele Bossaert (KULeuven), 29/09/2014-02/10/2014

Links

Books

R Studio

  • open a new r script
  • left upper corner = script (= programme)
  • lower left = console
  • upper right corner = workspace & history
  • lower right corner
    • files = file manager
    • plots
    • packages
    • help = same as http://cran.r-project.org/manuals.html (more or less), in the manuals, there are
      • Introduction to R
      • R Data Import/Export
      • R installation and Administration
      • the rest is more advanced

Ask question to R

in the console, question mark + word

?plot

→ opens related documentation in the help screen

Basic information

  • R is case-sensitive
  • end a command with ; or a new line
  • comment lines start with #
  • assign a value with ← or =
x<-5
x=5

R studio != R

  • R studio = syntax coloured
  • R = doesn't have other windows, so you have to ask for them to be shown

Packages

  • R studio
    • lower right corner, tab “packages”
    • click on the link = get information
  • R
    • library() : list of loaded packages
    • library(help=nlme) : help on the nlme package
  • packages have to be installed only once, but loaded every session
  • install a new package (e.g. rsiena)
    • R studio
      • in the “packages” tab, click on install packages and type RSiena
    • R
      • install.packages(“RSiena”)
  • load a package
    • R studio
      • click on the checkbox
    • R
      • library(“RSiena”)
  • list of loaded packages
    • search()

Exercise 1

  • list of available packages
library()
  • load package robustreg
    • actually, first need to install it
install.package("robustreg")
library("robustreg")
  • check the content of this package
library(help=robustreg)
  • check which packages are loaded in your system
search()

working directory

getwd()
setwd()

remove data from the workspace

# remove vector test1
rm(list="test1")

#remove all vectors
rm(list=ls())

Data structures

Vectors

Numeric vectors

cf. script_1.R

Functions:

  • order() = list the place of the smallest number, then the second smallest, etc.
  • rev() = reverse
    • e.g. rev(order(x)) : list from the biggest to the smallest places
  • sort() : output the numbers from smallest to biggest (N.B. : doesn't change your vector)
  • unique(): prune the vector of equal values (if y[2,3,2,4], unique will give [2,3,4])

Numeric functions

  • abs() : absolute value
  • trunc() : cut the decimals
  • round() : round to most proximate integer
    • round(z, digits=2) outputs 2 decimals

Shortcuts

y <- 1:8
#y=(1,2,3,4,5,6,7,8)
z <- 20:1
seq(from=-2, to=3, by=0.5)
u <- rep(2:5,2)
#u=(2,3,4,5,2,3,4,5)
v <- rep(2:5,2:5)
#v=[2 2 3 3 3 4 4 4 4 5 5 5 5 5]

Exercise 2

  1. Create the vector test1 with numbers 1.5, 0.7, 45.6
  2. Create the following vectors:
    1. create a vector x with elements (1,2,3,..,100)
    2. create a vector y with elements (0,5,10,15,….,500)
    3. create a vector z1 with elements (1,1,1,2,2,2,….,50,50,50)
    4. create a vector z2 with elements (1,2,2,3,4,4,…,9,10,10)
test1<-c(1.5, 0.7, 45.6)
test1
x<-c(1:100)
x
y<-c(seq(from=0, to=500, by=5))
y
z<-rep(1:50,3)
z
z1<-sort(z1)
z1
z1<-sort(rep(1:50,3))
z1
z1<-sort(rep(1:50,each=3))
z1
 
z2.1 <- rep(1:2,5)
z2.1
 
z2<-rep(1:10,z2.1)
z2
 
z2 <- rep(1:10, rep(c(1,2),5))
z2

Logical vectors

x <- c(9,10,8,5,9)
y <- x<9
 
#y = FALSE FALSE TRUE TRUE FALSE

N.B.: TRUE = 1 and FALSE = 0

  • Logical expressions : <, ⇐, >, >=, ==, !=
  • logical operators : &, |, !, %in%
    • e.g. x %in% c(3, 6, 2, 10)

How many values >= 10 in a given vector ?

x <- 1:20
y <-x>10
y
sum(y)

Character vectors

nchar(x) = number of characters in x

Mixing type in a vector

  • less information → most information: logical < numeric < character
  • str(x) will give you the type of vector
  • missing values
    • NA for logical & numeric data
    • “” for character

Subset of vectors

x[subset]

y <- c(33, 55, 4, 22, 89)
y
#2nd and 3rd values
y[c(2,3)]
#everything exept the 5th value
y[-5]
y[y > 30]
 
=== Exercise 3 ===
<code>
y1 <- c(seq(from=-5, to=10))
y1
 
y1 <- -5:10
y1
 
y2 <- y1>-2 & y1<7
y2
 
sum(y2)
 
progr <- c("SAS progr", "R progr", "SPSS progr")
progr
 
stat <- substr(progr, c(1,1,1), c(3,1,4))
stat
 
stat <- substr(progr, 1, c(3,1,4))
stat
 
first <- c(1,1,1)
last <- nchar(progr)-6
stat <- substr(prog, first, last)
stat

Matrices

= sum of vectors of same type , same length

bind vectors by binding rows (rbind()) or columns (cbind()) together

x <- rbind(c(1:4), c(5:8))
x

x = 2×4 integer matrix 2 = rows, 4 = columns

name rows and colums (by default [1,] and [,1];

  • dimnames(x)
  • rownames(x)
  • colnames (x)

Exercise 3

#exercise 3
v <- rnorm(100)
v
 
mat1 <- matrix(data=v, nrow=10, ncol=10, byrow=TRUE)
mat1
 
mat1 <- rbind(c(1:10), mat1)
mat1
 
mat1 <- rbind(mat1, 10:1)
mat1
 
mat1 <- cbind(1:12, mat1)
mat1

Data frames

Matrix extended : can contain more than numbers in any “cell”.

Merge

  • merge() to merge data frames
authors <- data.frame(
         surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
         nationality = c("US", "Australia", "US", "UK", "Australia"),
         deceased = c("yes", rep("no", 4)))
books <- data.frame(
         name = c("Tukey", "Venables", "Tierney",
                  "Ripley", "Ripley", "McNeil", "R Core"),
         title = c("Exploratory Data Analysis",
                   "Modern Applied Statistics ...",
                   "LISP-STAT",
                   "Spatial Statistics", "Stochastic Simulation",
                   "Interactive Data Analysis",
                   "An Introduction to R"),
         other.author = c(NA, "Ripley", NA, NA, NA, NA,
                          "Venables & Smith"))
authors
books
 
m1 <- merge(authors, books, by.x = "surname", by.y = "name")
m1
m2 <- merge(books, authors, by.x = "name", by.y = "surname")
m2
  • order has to be correct : x related to authors in the first merge() and w to books
  • by default, takes only what is in common (if only in one data frame, not taken) = natural join (all = FALSE)
    • change it to all=TRUE to get a full outer join
    • change it to all.x=TRUE = take all from x data frame (and intersection between x and y, but not what is in y and not in x) = left outer join
    • same with all.y=TRUE = right outer join

Selection

sub1 <- authors[1:3, 1:2]
sub1
 
sub2 <- authors$nationality
sub2

With the dollar sign ($) = list (other construct, not seen yet)

Can also use subset():

?airquality #need the datasets packages loaded
 
sub1 <- subset(airquality, Temp > 80, select = c(Ozone, Temp))
sub1
 
sub2 <-     subset(airquality, Day == 1, select = -Temp)
sub2
 
sub3 <-     subset(airquality, select = Ozone:Wind)
sub3

order values in the data frame

  • need the doBy library to be loaded (and installed)
air10 <- airquality[1:10,] # nothing after the comma = take all variables
air10
 
sort_air10 <- orderBy(~Temp, data=air10)
sort_air10

Exercise 5

  • Data Loblolly(library(datasets))
  1. Check the data description
  2. Ask about the attributes of the data
  3. Take a subset of data Loblolly which contains the observations from 5 until 12 and only the variables height and age
  4. Take a subset of Loblolly data which has only seed levels 307 and 311
  5. Sort the data Loblolly in a way that:
    1. age will be in descending order (from the largest to the smallest);
    2. age will be sorted in descending order (from the largest to the smallest), height will be sorted in ascending order (from the smallest to the largest) for every level of age
?Loblolly
str(Loblolly)
subl1 <- Loblolly[5:12, 1:2]
subl1
subl2 <- subset(Loblolly, Seed == 307 | Seed == 311,)
subl2
Loblolly_age_desc <- orderBy(~-age, data=Loblolly)
Loblolly_age_desc
Loblolly_age_desc2 <- orderBy(~-age+height, data=Loblolly)
Loblolly_age_desc2

Lists

= sum of vectors (differents types and different length)

#list
x <-1:2
y <- letters[1:2]
z <-1:3
list(x,y,z)
 
#name list
mixlist <- list(logica = c(T,T,T,F), plant = c("tree", "bush","grass"), comment="these components are unrelated")

Reading data from files

  • clickodrome : workspace > import dataset > from text file…
  • with code
chol <- read.table(file=file.choose(), header=TRUE)

If no window opens, check for it in the background

  • import .csv and .xls/.xlsx

package foreign to import from SPSS or SAS

  • SPSS: read.spss
  • SAS: read.xport or read.ssd

Writing your own functions

List of existing functions

builtins()
#or
ls(baseenv(), all = TRUE)

Create a function

Name <- function(arg1, arg2...) {expression}

We will use a t-test as example (even though t-test function already exists).

t.test.p <- function(x, mu = 0)
{
	n <- length(x)
	t.stat <- (mean(x) - mu)/(sd(x)/n^0.5)
	2 * (1 - pt(abs(t.stat), n - 1))
}
 
y <- 5:27
y
t.test.p(y)
  • mu = 0, by default mu equals zero (if not specified)
  • by default, the function outputs the last thing you calculated in your function
    • to circumvent this, create a list with the different values as latest command in your function
  • tip: first test the commands of your function, then put it in a function
  • all variables created by the function are publicly available afterwards

Exercise 6

Write a function which returns the most elementary statistics for a sample x :

min, median, max, mean, sd and length. Apply your function on a vector x with values from 25 to 80.

v = c(25:80)
v
elem_stats <- function(v) 
{
  min1 = min(v)
  median1 = median(v)
  max1 = max(v)
  mean1 = mean(v)
  sd1 = sd(v)
  length1 = length(v)
  list (min = min1, median = median1, max = max1, mean = mean1, sd = sd1, length = length1)
}
 
elem_stats(v)

Graphics with R

  • high-level graphic functions (e.g. plot(), curve()) = drawing a graphic
  • low-level graphic functions to draw extra things on the graphic (title, legend, etc.)
  • graphs are output on the “Plots” tab in the lower right corner window, you can navigate between graphs with the two arrows (← and →) under the tabls
  • par() to set or look at parameter from the graph
  • mfrow() is a par() parameter to split the screen into several sections to have several graphics on one screen
    • par(mfrow(c(2,2))) va créer 4 graphiques (2 x 2)
  • colours: palette() for a list ; colors() for more choice (list them), but use the name corresponding: colors(“sienna4”)
  • add an extra point : points(1000, 260)
    • to add text: text(locator(1), “new point”, cex=1.5)
      • run this line separately, because you have to point where you want to put the text (because locator() was used)
    • add a line with abline() for a straight line or lines() for a curve
    • add a legend with legend()

Exercise 7

plot(sin, -2*pi, 2*pi, col=1, lty=1)
plot(cos, -2*pi, 2*pi, col=2, lty=2, add=TRUE)
plot(tan, -2*pi, 2*pi, col=3, lty=3, add=TRUE)
legend(locator(1), c('Sine', 'Cosine', 'Tangent'), lty=c(1, 2, 3), col=c(1, 2, 3))

01/10/2014

Using R for analysing univariate data

  • different distributions, cumulative probabilities
  • median absolute deviation : explanation

02/10/2014

Exporting data

write.csv(galileo,file='galileo.csv',row.names=FALSE)
 
write.csv(central.park.cloud,file='central_park_cloud',row.names=FALSE)
 
write.csv(cfb,file='cfb',row.names=FALSE)
 
write.csv(alltime.movies,file='alltime.movies',row.names=FALSE)
 
write.csv(normtemp,file='normtemp',row.names=FALSE)

Statistical inference for discrete data

Chi square test

chisq() smokers (pipe, sigaret), dead or alive

chol 		<- read.table(file=file.choose(), header=TRUE)
names(chol)
tab.chol 	<- table(chol$SMOKE, chol$MORT)
tab.chol
chisq.test(chol$SMOKE, chol$MORT)

output:

	Pearson's Chi-squared test

data:  chol$SMOKE and chol$MORT
X-squared = 1.6677, df = 2, p-value = 0.4344

p-value > 0.05 → cannot reject hypothesis (that smokers and non-smokers are different concerning death).

Fisher test

t-test

  • t-test is a parametric test → need to check that data are normally distributed
  • use shapiro.test() to check normality
  • if normal, use t.test()
  • if not, use wilcox.test()

Two sample t-test

  • check that the two samples are normal (with shapiro.test())
  • check whether variances are equal or not (to know what to test)
    • F test to compare two variances: var.test()

Correlation analysis

  • cor()

Regression analysis

  • use lm() for linear model
  • this gives you the a and b from y=ax+b so you can draw it with abline() in your graph
  • different models (e.g. polynomial model)
r/flamescourse.txt · Last modified: 2014/10/03 11:47 by carl