R: Automation with Loops#

Motivation#

Automation (in this chapter)#

As you have and will see in activities and assignments, you often need to repeat essentially the same task over and over again:

  • Strip $ and , from each of columns, provided they represent monetary amounts

  • \(NULL\) out all columns that contain more than 90% missing values

  • Transform columns whose values are ``too skewed” with logarithms

  • Combine sales data on stores from 2000 different files.

  • Calculate course grades using a complex weighting function for everyone in a class.

  • Try a wide range of tweaks for model parameters in order to find the best fitting one.

Loops are a great way to repeat the same chunk of commands again and again, and conditionals allow the exact code to vary based on certain conditions (if \(>\) 90% missing values, NULL out a column, otherwise leave it be).

Condition (in next chapter)#

There will be many times where we want to do something to data, but what we do depends on the type of data, values in the data, etc.

  • If a column is categorical, replace any missing values with a new level called ``Missing”. If a column is quantitative, replace any missing values with the median.

  • Alumni who have donated $10,000 or more should be classified as “major donors”, otherwise “non-major donors”, but if they have never given they should be classified a ``never donor”.

  • Discretize a quantitative variable: depending on income classify as “lower class”, “middle class”, ``upper class”.

  • If a column contains more than 30% missing values, \(NULL\) it out.

Conditional statements (if/then) are necessary to handle these types of processes.

User Defined Functions (in next chapter)#

R has plenty of functions built in for the mean, median, correlation, boxplot, hist, etc. What if you need your own functions and routines?

  • Customized plots (TV show ratings by season)

  • Trimmed mean (throw away observations too far from median)

  • Determine shortest path between two points on a network

  • Create ``sentiment” by weighted frequencies of words in a Twitter post, online review, etc.

Being able to program your own functions and routines will save you tremendous time and is now a critical skill for anyone in business analytics. R is one way to program, as are popular alternatives such as Python, Java, and C++.

Review of Last Chapter#

Variables#

  • Names are case-sensitive, e.g., \(y\) and \(Y\) can be given two different values

  • Some punctuation is allowed in names, e.g. \(x.new\) and \(last.weeks\_sales\)

  • Definitions can be recursive. You can increase \(d\) by 3 by doing \(d <- d+3\)

d <- 2  #standard 
d       #name by itself will print to screen its contents
2 -> d  #also works
d
d = 2 #also works, but typically for defining arguments in functions
d
d <- d+3
d
2
2
2
5

An example of saving#

Evaluate:

\[\frac{ e^{\sqrt{-5+2\times3}}}{1 + e^{\sqrt{-5+2\times3}} + 2e^{\sqrt{-5+2\times3}}}\]
y <- sqrt(-5 + 2*3) 
#print to screen since the results are not being saved
exp(y)/( 1+exp(y)+2*exp(y) ) 
0.296922742475655

Since the expression \(\sqrt{-5+2 \times 3}\) appears multiple times in the equation, it’s useful to left-arrow that computation to something and use it instead (especially helpful if the expression involving raising \(e\) to a power is going to be evaluated for numbers other than \(\sqrt{-5+2 \times 3}\)).

Vectors#

A vector is an array of one or more numbers, letters, words (anything within double quotes). To make a vector we use to the command \(c()\), separating the elements of the vector by a comma. My convention is to give a names to vectors that are in lower case. Note, if you try to mix numbers into a vector that otherwise contains characters, it will treat that number as a character.

d <- c(2) #Since only 1 value, the c() is technically not necessary 
# but it doesn't hurt
d <- c(3,9,10,2); d
d <- c("how","are","u"); d
d <- c("how","r","u",2,"day"); d  #If one element is text, all becomes text
  1. 3
  2. 9
  3. 10
  4. 2
  1. 'how'
  2. 'are'
  3. 'u'
  1. 'how'
  2. 'r'
  3. 'u'
  4. '2'
  5. 'day'

Factors#

A factor stores the values of a categorical variable. By treating something as a factor, R will keep track of all its levels (possible values) which is useful for plotting and modeling. Having a vector of letters/words does not allow this.

fact <- factor( c("how","are","are","are","you","you") )
fact
levels(fact)
plot(fact)  #note:  barplot( table(fact) )  also works!
  1. how
  2. are
  3. are
  4. are
  5. you
  6. you
Levels:
  1. 'are'
  2. 'how'
  3. 'you'
  1. 'are'
  2. 'how'
  3. 'you'
../_images/9b1512149f2fecf41a1ea99f4813e40d16ff0f2f02ceb5657d5e7e3d1d4e2ef3.png

Adding a new value to a factor#

We can’t add a “new” (previously unseen) value to a factor like you can a numerical vector. To add a ``new” value (a new level), you have to add a level to the factor. You can rename factor levels by specifying which position you want to change.

fact[5] <- "how"  #this is ok because "how" is not a new value
fact
fact[5] <- "newlevel" #this is not ok because "newlevel" IS a brand new value
fact
levels(fact) <- c( levels(fact),"newlevel" )  #Add new level
fact[5] <- "newlevel"  #Change 2nd value to this
fact
  1. how
  2. are
  3. are
  4. are
  5. how
  6. you
Levels:
  1. 'are'
  2. 'how'
  3. 'you'
Warning message in `[<-.factor`(`*tmp*`, 5, value = "newlevel"):
“invalid factor level, NA generated”
  1. how
  2. are
  3. are
  4. are
  5. <NA>
  6. you
Levels:
  1. 'are'
  2. 'how'
  3. 'you'
  1. how
  2. are
  3. are
  4. are
  5. newlevel
  6. you
Levels:
  1. 'are'
  2. 'how'
  3. 'you'
  4. 'newlevel'

Changing and combining levels#

You can change the name of a level (and therefore change all values in that factor with that level) by modifying the \(levels\) vector.

fact
#change "are" to "renamed"
levels(fact)[which(levels(fact)=="are")] <-"renamed"; fact
  1. how
  2. are
  3. are
  4. are
  5. newlevel
  6. you
Levels:
  1. 'are'
  2. 'how'
  3. 'you'
  4. 'newlevel'
  1. how
  2. renamed
  3. renamed
  4. renamed
  5. newlevel
  6. you
Levels:
  1. 'renamed'
  2. 'how'
  3. 'you'
  4. 'newlevel'

You can combine levels (and give them a new name if you want) as well.

#combine "how" and "you" into a level called "Yes"
levels(fact)[ which( levels(fact) %in% c("how","you")) ] <- "Yes"; fact
#combine "everything but" the "Yes" level to be "No
levels(fact)[ which( !levels(fact) %in% c("Yes")) ] <- "No"; fact
  1. Yes
  2. renamed
  3. renamed
  4. renamed
  5. newlevel
  6. Yes
Levels:
  1. 'renamed'
  2. 'Yes'
  3. 'newlevel'
  1. Yes
  2. No
  3. No
  4. No
  5. No
  6. Yes
Levels:
  1. 'No'
  2. 'Yes'

Creating a pattern using \(rep\)#

The \(rep(x,times)\) command is useful for creating a vector with a certain pattern, i.e., creating a pattern \(x\) repeated \(times\) times.

  • A long vector of 0s

rep(0,25)
  1. 0
  2. 0
  3. 0
  4. 0
  5. 0
  6. 0
  7. 0
  8. 0
  9. 0
  10. 0
  11. 0
  12. 0
  13. 0
  14. 0
  15. 0
  16. 0
  17. 0
  18. 0
  19. 0
  20. 0
  21. 0
  22. 0
  23. 0
  24. 0
  25. 0
  • Christmas

c("fa",rep("la",8))
  1. 'fa'
  2. 'la'
  3. 'la'
  4. 'la'
  5. 'la'
  6. 'la'
  7. 'la'
  8. 'la'
  9. 'la'
  • The sequence AABBBC repeated 5 times

rep(  c(rep("A",2),rep("B",3),"C"), 5 )
  1. 'A'
  2. 'A'
  3. 'B'
  4. 'B'
  5. 'B'
  6. 'C'
  7. 'A'
  8. 'A'
  9. 'B'
  10. 'B'
  11. 'B'
  12. 'C'
  13. 'A'
  14. 'A'
  15. 'B'
  16. 'B'
  17. 'B'
  18. 'C'
  19. 'A'
  20. 'A'
  21. 'B'
  22. 'B'
  23. 'B'
  24. 'C'
  25. 'A'
  26. 'A'
  27. 'B'
  28. 'B'
  29. 'B'
  30. 'C'

Creating a sequence using \(seq\)#

\(seq(from=,to=,by=,length=)\) is incredibly useful for creating regular sequences. Note: only 3 of the 4 arguments for this function need to be specified.

  • The numbers 1 to 10

seq(from=1,to=10,by=1)
1:10 #note: any integer sequence from a to b can be invoked by a:b
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  • A sequence starting at 365 and going down by 1, with total length 20

seq(from=365,by=-1,length=20)
  1. 365
  2. 364
  3. 363
  4. 362
  5. 361
  6. 360
  7. 359
  8. 358
  9. 357
  10. 356
  11. 355
  12. 354
  13. 353
  14. 352
  15. 351
  16. 350
  17. 349
  18. 348
  19. 347
  20. 346
  • A sequence of numbers between 5 and 12 of length 20

seq(from=5,to=12,length=20)
  1. 5
  2. 5.36842105263158
  3. 5.73684210526316
  4. 6.10526315789474
  5. 6.47368421052632
  6. 6.84210526315789
  7. 7.21052631578947
  8. 7.57894736842105
  9. 7.94736842105263
  10. 8.31578947368421
  11. 8.68421052631579
  12. 9.05263157894737
  13. 9.42105263157895
  14. 9.78947368421053
  15. 10.1578947368421
  16. 10.5263157894737
  17. 10.8947368421053
  18. 11.2631578947368
  19. 11.6315789473684
  20. 12

Overview of Loops#

Loops allow us to repeat a specified ``skeleton” of commands a certain number of times (though at each iteration there may be small nuances in what the code is doing)

  • A fixed number (e.g., calculate the ``lifetime value” for each customer in a dataset). We use a \(for\) loop here because we want to do something for each of a known number of individuals.

  • A unspecified number (e.g., try passwords until we guess the correct one). We use a \(while\) loop because we want to continue running the chunk of code while a certain condition is true.

Writing a loop is strictly a time-saver. Technically, there’s nothing you can do with a loop that you can’t do by writing out multiple lines of code and running them.

However, as you’ll soon see, writing a loop is such a time-saver and shortcut that once you know how to write one you’ll never go back.

for() loops#

A \(for\) loop is a control structure in R (and every other programming language) that can accomplish these tasks with ease.

Imagine you wanted to create a sequence 3, 4, 6, 9, 13, 18, 24, \(\ldots\) (the difference between first two elements is 1, the difference between the next two elements is 2, the difference between the next two elements is 3, etc.).

It’s easy enough to hard-code this:

x <- 3
x[2] <- x[1] + 1
x[3] <- x[2] + 2
x[4] <- x[3] + 3
x[5] <- x[4] + 4
x[6] <- x[5] + 5
x[7] <- x[6] + 6
x
  1. 3
  2. 4
  3. 6
  4. 9
  5. 13
  6. 18
  7. 24

However, if you want the sequence to be 1000 elements long, you’d end up writing 1000 lines of code.

Motivation#

The \(for\) loop serves as shorthand for these 1000 lines of code

x <- 3
for (position in 2:1000) { 
  x[position] <- x[position-1] + (position-1) 
}
head(x,15)
tail(x)
  1. 3
  2. 4
  3. 6
  4. 9
  5. 13
  6. 18
  7. 24
  8. 31
  9. 39
  10. 48
  11. 58
  12. 69
  13. 81
  14. 94
  15. 108
  1. 494518
  2. 495513
  3. 496509
  4. 497506
  5. 498504
  6. 499503

What a time saver!

\(for\) syntax#

This is what a \(for\) loop looks like.

for(looping.variable in vector.of.values) {
  command 1
  command 2
  command 3
  etc.
}

All \(for\) loops will have the same basic syntax.

  • First line has \(for( . in . ) \) followed by a pair of curly brackets.

  • The set of commands in curly brackets will be run each time the code ``goes through the loop”, i.e., during each iteration.

  • The objects to the left and to the right of \(in\) require special attention.

\(for\) execution#

for( looping.variable in vector.of.values ) {
  commands
}

So what happens when R runs a \(for\) loop?

  • As soon as R sees the \(for\), it immediately dives inside the parentheses and set the left of the \(in\) to be the first element of the right of the \(in\). In this case, R runs \(looping.variable <- vector.of.values[1]\).

  • R executes the code inside curly brackets (using the current definition of \(looping.variable\), if any lines of code refers to it). Once it runs out of code, R goes back to the \(( in )\).

  • R will then set the left of the \(in\) equal the second element of the right of the \(in\). In this case, R runs \(looping.variable <- vector.of.values[2]\).

  • etc. Once \(looping.variable\) has had a chance to equal each element of \(vector.of.values\), the loop terminates.

The key things to remember:

  • \(looping.variable\) is the name of the ``looping variable”. It’s up to you what to call this! Coming up with an informative name (like \(position\), \(element\), \(iteration\), \(row\), etc.) depending on the context is a good idea.

  • \(vector.of.values\) is what you are ``looping over”. It can be the name of a vector containing the values that you want \(looping.variable\) to take, or you can hardcode the vector directly, e.g. \(1:nrow(DATA)\), \(c("a","e","i","o","u")\), etc.

  • \(looping.variable\) will equal the first element of \(vector.of.values\) the first time through the loop, the second element of \(vector.of.values\) the second time through the loop, etc.

Why did the motivating loop work?#

Why did the motivating \(for\) loop basically serve as a shortcut for nearly a 1000 lines of code?

Let’s write out what R is doing behind the scenes.

for (position in 2:1000) {
  x[position] <- x[position-1] + (position-1)
}

is just shorthand for

x[2] <- x[1] + 1
x[3] <- x[2] + 2
x[4] <- x[3] + 3
#...
x[999] <- x[998]+998
x[1000] <- x[999]+999

Example of for loop#

Imagine we want to replace all values in a vector that are -999 (a common placeholder used to represent a missing value) to \(NA\).

x <- c(6,8,0,0,10,-999,-999,4,-999,153)
to.change <- which(x==-999)  #positions inside x that contain 999
to.change
for (position in to.change) {x[position] <- NA}
x
  1. 6
  2. 7
  3. 9
  1. 6
  2. 8
  3. 0
  4. 0
  5. 10
  6. <NA>
  7. <NA>
  8. 4
  9. <NA>
  10. 153

The \(for\) loop has been set up so that the name of the looping variable is \(position\) and the vector of values to loop over is called \(to.change\), which contains the integers 6, 7, and 9. The loop is equivalent to:

position <- to.change[1]  #position becomes 6
x[position] <- NA  #6th element of x is made NA
position <- to.change[2] #position becomes 7
x[position] <- NA #7th element of x is made NA
position <- to.change[3] #position becomes 9
x[position] <- NA #9th element of x is made NA

Example of for loop with moving average#

In forecasting, a popular method for ``smoothing out” a jumpy time series is to take a moving average of (say) the two observations before and after the current one. Imagine we have a time series called \(demands\) that contains the demands of a product over 50 weeks. We’d like to create series of moving averages.

#Make up a bogus time series of 50 demands by randomly picking numbers 5-20 
set.seed(474); demands <- sample(5:20,size=50, replace=TRUE) 
moving.avg <- rep(NA,50)#Initialize moving.avg to be 50 NAs
for (time in 3:48) { #why 3 and 48?
  moving.avg[time] <- mean(demands[(time-2):(time+2)] )
}
moving.avg #Some elements are NA by design
  1. <NA>
  2. <NA>
  3. 11.8
  4. 12
  5. 12.4
  6. 13.6
  7. 14.2
  8. 15.6
  9. 15.4
  10. 13
  11. 11
  12. 9.4
  13. 8.8
  14. 10.4
  15. 12.6
  16. 12.8
  17. 14.6
  18. 13
  19. 12.6
  20. 12.2
  21. 14.6
  22. 14
  23. 14.8
  24. 14.6
  25. 12.8
  26. 10.4
  27. 11.6
  28. 11.4
  29. 12
  30. 13.6
  31. 14.4
  32. 14.2
  33. 15
  34. 13.6
  35. 13.4
  36. 12.8
  37. 10.8
  38. 11.2
  39. 11.4
  40. 12.2
  41. 13.4
  42. 12.8
  43. 11.8
  44. 12
  45. 10
  46. 9.6
  47. 9.4
  48. 10.2
  49. <NA>
  50. <NA>

The \(for\) loop is running a series of commands behind the scenes.

moving.avg[3] <- mean(demands[1:5])  #since time is 3
moving.avg[4] <- mean(demands[2:6])  #since time is 4
#etc.
moving.avg[48] <- mean(demands[46:50])  #since time is 48

Followup - initialization#

In the examples, we ``initialized” a vector then used the \(for\) loop to update/define the elements one by one.

If we had not done the initialization step, the code would have given an error. This is because the vector didn’t exist yet in the environment, and you can’t put something into an object that doesn’t exist!

If you are using a \(for\) to fill in elements of a vector one by one, be sure to always initialize that vector, either by defining it to be a bunch of 0s, or just left-arrowing it to an empty vector (e.g., \(x <- c()\)). As long as R knows about the object it is putting elements into, the code runs fine.

Looping over character vectors#

Mechanically, looping over values of a character vector works the same as looping over a numerical vector (the looping variable first equals the first element of the looping vector, then second element of the looping vector, etc.). However, there is an important difference, namely in how the result of a computation is stored in a vector during each iteration of the loop (if that’s what you are doing).

In previous examples, we often used the value of the looping variable to store the result of some computation (the moving average, the weighted grade, the bankroll) into a vector we had defined outside the loop (\(moving.avg\), \(wg\), \(bankroll\)). For example, the looping variable \(time\) was used to define \(moving.average[time]\). This made sense since \(time\) was an integer and referred to a valid position in the vector \(moving.average\).

What happens when the looping vector is a text vector (e.g., levels of a factor)? If \(time\) was equal to the word \(dog\), can we do something like \(moving.average["dog"]\)? Surprisingly, yes!

Example: frequency table by hand#

Let’s use a \(for\) loop to create a frequency table of the values in the vector \(grades\) (in other words, lets write code that reproduces the output from running \(table()\)).

How should we set this up? The first question to ask is what will be on the left and right sides of the \(in\) in the \(for\) loop.

The name of the looping variable (left of \(in\)) is up to us. Let’s name it \(g\), which is short for ``grades’’.

The object on the right of \(in\) is the looping vector and contains the values that we want loop over. Thus, \(levels(grades)\) will be the looping vector since we want to do something for each letter grade.

grades <- factor(c("A","B","A","A","B","B","B","C","B-","B-"))#vector of grades
table(grades) #our target
#Initalize count be an "empty vector" and define elements as we go through loop
count <- c()
for ( g in levels(grades) ) {
  count[g] <- length(which(grades==g))
}
count
grades
 A  B B-  C 
 3  4  2  1 
A
3
B
4
B-
2
C
1

What code is being run behind the scenes?

g <- levels(grades)[1] #g becomes "A"
count[g] <- length(which(grades==g))#  count["A"] <- length(which(grades=="A"))
g <- levels(grades)[2] #g becomes "B"
count[g] <- length(which(grades==g))#  count["B"] <- length(which(grades=="B"))
g <- levels(grades)[3] #g becomes "B-"
count[g] <- length(which(grades==g))#  count["B-"] <- length(which(grades=="B-"))
g <- levels(grades)[4] #g becomes "C"
count[g] <- length(which(grades==g))#  count["C"] <- length(which(grades=="C"))

Named vectors#

You can give vector (numeric or text) names when you create it with \(c()\), or after the fact with \(names\). If you left-arrow a value into an element by name, and that name doesn’t exist, R will append that value to the vector with the given name.

x <- c(bird=21,hog=44,cow=99); x
x <- c(21,44,99); x
names(x)<-c("bird","hog","cow"); x
x["hog"]<-33; x["peacock"]<-1992; x #change and add elements
bird
21
hog
44
cow
99
  1. 21
  2. 44
  3. 99
bird
21
hog
44
cow
99
bird
21
hog
33
cow
99
peacock
1992

When a vector has names we call it a named vector, and you can refer to the position inside that vector by its position or by its name (in quotes). You can extract the names of a named vector with the \(names\).

while() loops#

In a \(for\) loop, you have a specific list of values to loop over (1, 2, 3, …, to the number of rows in the dataframe, all elements of a vector of words, etc.). Sometimes you do not know how many times the loop will run, but you’ll know when it’s done. E.g., While such and such is true, continue to do this and that.

This is the job of a \(while\) loop. A \(while\) loop allows you to repeat a set of commands as long as some logical condition is \(TRUE\).

Example: Vegas roulette

Betting until you are broke or double your money. Perhaps you start with $10, bet $1 each time, and you want to keep track of the total number of games you get to play (and perhaps the money you had after each bet). The probability of winning is 18/38 and losing is 20/38.

Recall that \(sample(c(-1,1),size=1,prob=c(20/38,18/38))\) will randomly generate a -1 or a 1 with the desired probabilities.

Example: Vegas roulette#

Vegas Roulette problem. In English: ``while the amount of money is larger than 0, randomly determine the outcome of the next bet and adjust the amount of money accordingly and increment the total number of bets placed by 1”.

money <- 10  #vector that will keep track of bankroll
number.of.bets <- 0  #Initialize a variable to keep track of the number of bets
set.seed(21) #set random number seed for reproducibility
while(min(money) >= 1){
  number.of.bets <- number.of.bets + 1 #Increment the number of bets that have been made
  result <- sample( c(-1,1),size=1,prob=c(20/38,18/38))#get random outcome of bet
  money <- c(money,tail(money,1) + result)  #add an additional element 
}
plot(money, xlab="Bet Number", ylab="Money", main=paste(number.of.bets, 'bets'), type="l")
../_images/f40691e61e9428dae8f3379a12b0c3daae0f916d8cc0aec3112391309e383c8a.png

Example: Vegas roulette - why does this work?#

Just like a \(for\) loop, a \(while\) loop is really just shortcut for a bunch of code.

#while loop
money <- 10; number.of.bets <- 0
while ( min(money) >= 1) { 
  number.of.bets <- number.of.bets + 1 
  result <- sample( c(-1,1),size=1,prob=c(20/38,18/38)) 
  money <- c(money,tail(money,1) + result)  }
#equivalent commands
money <- 10; number.of.bets <- 0
if( min(money) >= 1 ) { 
  number.of.bets <- number.of.bets + 1 
  result <- sample( c(-1,1),size=1,prob=c(20/38,18/38)) 
  money <- c(money,tail(money,1) + result)  }
if( min(money) >= 1 ) { 
  number.of.bets <- number.of.bets + 1 
  result <- sample( c(-1,1),size=1,prob=c(20/38,18/38)) 
  money <- c(money,tail(money,1) + result)  }
if( min(money) >= 1 ) { 
  number.of.bets <- number.of.bets + 1 
  result <- sample( c(-1,1),size=1,prob=c(20/38,18/38)) 
  money <- c(money,tail(money,1) + result)  }
#etc.

Basic Syntax of a while loop#

commands before loop
while( logical condition ) {
  repeat this chunk of code
}
commands after loop
  1. the logical condition (e.g., \(x > 2\) or \(x != 0\)) is checked.

  2. If the logical condition evaluates to \(TRUE\), we go to step 3) and the code inside the curly brackets is executed. If it evaluates to \(FALSE\), R skips the code in the curly brackets and moves on to step 4) where the commands after the loop.

  3. The code inside the curly brackets is executed, then R goes back to step 1) and checks the logical condition in the parentheses

  4. commands after the loop are evaluated

Example - try, try again#

Four people arrive to a party wearing unique hats. At the end of the party, the host randomly assigns hats to people. If anyone has an incorrect hat, the host collects them all and randomly assigns them again. How many attempts does it take until the assignment is correct? For this example let’s use \(break\) instead of checking to see if some logical condition has become false.

correct <- 1:4#Without loss of generality, take the "correct" assignment to be 1 2 3 4
attempts <- 0#initialize the number of attempts at 0
hats <- 0#Initial assignment
set.seed(5244);
#correct == hats gives a vector of TRUE or FALSE depending on if the elements match up
#sum(correct==hats) will equal 4 when the assignment is correct
while( sum(correct == hats ) < 4  ) {  
  attempts <- attempts + 1#increment attempts by 1
  hats <- sample(1:4,size=4,replace=FALSE)#generate a random sequence of 1, 2, 3, 4
}
attempts#Note:  different value of attempts if change set.seed()
47

Example - drunken sailor#

A drunken sailor staggers about. For each step, a random direction is chosen and he moves one foot forward. How many steps will it take him to reach a distance 10 feet from where he started?

#Let x and y be the current coordinates of the sailor
x <- 0; y <- 0; location <- 0 #define starting location
steps <- 0
x.all <- c(); y.all <- c()
set.seed(54240)
while(location < 10) {
  x.all <- c(x.all, x); y.all <- c(y.all, y)
  angle <- 2*pi*runif(1) #choose a random direction
  x <- x + cos(angle) #new horizontal coordinate after step
  y <- y + sin(angle) #new vertical coordinate after step
  location <- sqrt( x^2 + y^2 ) #location after step
  steps <- steps + 1 #update steps
}
plot(0,0,xlim=c(-10,10),ylim=c(-10,10),xlab="x",ylab="y")
curve(sqrt(10^2-x^2),add=TRUE)
curve(-sqrt(10^2-x^2),add=TRUE)  #Draw a circle 10 units from initial location
lines( x.all, y.all ) #add lines to connect previous and current locations
title(paste(steps,"steps taken")) #Give plot a title based on value of steps
../_images/8dbb43b248c1da17149f01fa2af64f0084fad9274c70c71ccf5c02672750469a.png

Strategies for Writing Loops#

When you are writing your code for a specific case, make your code as general as possible.

  • Decide on the name of your looping variable ahead of time, and think about what the looping vector will be. You’ll want to get your code working for one specific value in the looping vector.

  • Don’t hard-code numbers the value you have chosen! Instead, left-arrow the name of your looping variable to be the value you are testing, and write your code in terms of the looping variable.

Example: average for each day of week#

Goal: find the average demand for each day of the week in \(EX7.BIKE\) (in \(regclass\)).

This is another natural application for a \(for\) loop since we want to do the same thing (calculate the average demand) for each level of the \(Day\) column.

To write the loop, let us first get the code that will go in the curly brackets (calculating the average demand for a particular day) working, then generalize it so that it works inside a loop. Let us name the looping variable \(day\) and choose the test case to be ``Monday”.

library(regclass); data(EX7.BIKE)
Hide code cell output
Loading required package: bestglm
Loading required package: leaps
Loading required package: VGAM
Loading required package: stats4
Loading required package: splines
Loading required package: rpart
Loading required package: randomForest
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Important regclass change from 1.3:
All functions that had a . in the name now have an _
all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
day <- "Monday"  #arbitrarily choose Monday and get the code to work for this day
#take a subset of all rows with Monday, since that is what day equals now
SUB <- subset(EX7.BIKE,Day==day)
#find average demand, removing any NAs if they happen to be present
avg.demand <- mean(SUB$Demand,na.rm=TRUE) 
avg.demand
4744.85714285714

Now let’s generalize this. We want to do the calculation for each day of the week, so the looping vector will be the levels of the \(Day\) column. Since we are storing the average in \(avg.demand\), we need to initialize this vector outside the loop.

avg.demand <- c()#initialize vector to keep track of average demands
for ( day in levels(EX7.BIKE$Day) ) {   
  SUB <- subset(EX7.BIKE,Day==day)  
  avg.demand[day] <- mean(SUB$Demand) 
}
avg.demand
avg.demand[4]; avg.demand["Sunday"]
Friday
4978.15789473684
Monday
4744.85714285714
Saturday
4719.9696969697
Sunday
4511.85245901639
Thursday
4981.81355932203
Tuesday
4880.83636363636
Wednesday
4809.80357142857
Sunday: 4511.85245901639
Sunday: 4511.85245901639

The vector we created is a named vector since we were looping over a character vector. The elements are the averages, and each element has a name!

Debugging for and while loops#

Although it might not feel like it at this point, these have all been examples of ``simple” loops. Because the syntax and logic of \(for\) and \(while\) loops can become quite involved and tricky, you’ll often find that you’ll have a bug in the code and it won’t do what you want!

One way to debug a loop is to pretend you are R and run each line of code that R would be running behind the scenes as the loop executes. After each line, you compare what the command actually did to what do what you wanted it to do. Eventually, you’ll catch the error!

Imagine we want to write a \(for\) loop to make \(x\) be the sequence 1, 4, 49, 196, 529, \(\ldots\) (the \(i\)th element is \((i^2-2)^2\)).

x <- c()
for (i in 1:10) { 
  x <- (i^2-2)^2
}
x
9604

Why is \(x\) equal to 9604 and not the sequence 1, 4, 49, 196, \(\ldots\), 9604?

To debug, run each command manually as R would be running it and see where it does not go as planned.

x <- c()
i <- 1#As R jumps into the loop, i gets left-arrowed to first element of 1:10
x <- (i^2-2)^2 #line in curly bracket 
x  #Good so far, that is the first number we expected
i <- 2#Going back to start of loop, i gets left-arrowed to second element of 1:10
x <- (i^2-2)^2 #line in curly bracket 
x  #oops, the result is just a number and not the vector 1 4
1
4

Ah. We forgot to put \([i]\) after the \(x\), so \(x\) gets overwritten each time the code in the curly brackets is executed!

x <- c()
for (i in 1:10) { 
  x[i] <- (i^2-2)^2
}
x
  1. 1
  2. 4
  3. 49
  4. 196
  5. 529
  6. 1156
  7. 2209
  8. 3844
  9. 6241
  10. 9604

Common errors#

Over the years and to this day, I still make the same types of errors when writing loops! You too will make each of these errors numerous times, so it’s helpful to know what is quite often the issue.

  • Forgetting to “initialize” the vector you are going to store results. You cannot reference \(x[i]\) unless R ``knows” about \(x\) first. This is why we have \(x \)<\(- c()\) or \(x \)<\(- rep(0,100)\) right before \(for\) loops.

  • Giving a single value instead of a vector on the right-hand side of \(in\). For example \(for(i in length(x)) \{ \#do stuff \}\). Most likely, you want \(1:length(x))\) instead of just \(length(x)\) since the latter is just a single value.

  • Forgetting to actually store the result of a computation (or using the wrong index) in each iteration of a \(for\) loop. You probably want \(x[i] \)<\(- computation\) but instead you’ll just put \(x \)<\(- computation\), or you’ll have the wrong index (the loop is over values of \(i\) but in your square brackets you’ll have a different letter, or just a single number)

#BAD
for(i in 1:5){pop[i]<-2*i-5}#can't put something in vector pop unless R knows about it
#GOOD
pop <- c()
for (i in 1:5) { pop[i] <- 2*i-5 }
#BAD
x <- c(3,5,8)
results <- c()
for(i in length(x)){results[i]<-round(x[i]/3+5)}#i will ONLY equal 3 in this example
#GOOD
for(i in 1:length(x)) { results[i] <- round(x[i]/3+5) }  
#BAD
results <- c()
for(i in 1:length(x)){results<-(i-2)^2/5}#results overwrites itself each iteration
for(j in 1:length(x)){ results[i] <- (j-1)^2*8 } #loop uses j but index refers to i
#GOOD
for (j in 1:length(x)) { results[j] <- (j-2)^2/5 }  

Message Motivation and \(cat\)#

At some point, you will be writing loops and functions that can take a long time to finish. You may be unsure if R has ``frozen”, or if R is deep in thought on some complex computation.

Adding messages to give yourself updates on what R is doing can be extremely useful, and can also help in debugging code that you have written (e.g., to make sure the value of a variable is what you think it is).

The \(cat\) function prints to the screen the contents of its arguments, regardless of whether the code is currently working its way through a loop or within a function. Here’s a loop that prints out progress as it evaluates.

for (i in 1:5) {
  cat(i)
}
12345

This is somewhat disappointing because all the messages are printed one after each other with no spaces.

\(cat\) and \(paste\)#

\(paste\) and \(cat\) play very well together because you can construct a custom messages to print out from inside your function or loop.

x <- 5; y <- 2; z <- 3
cat(paste("x is", x, "y is", y, "z is", z, sep=" * "))
x is * 5 * y is * 2 * z is * 3

One important thing is to have the “return symbol” as a \(paste\) argument, or all your messages will be on one line. The symbol is ``backslash n”.

for(i in c(5,1,4)){
  cat(paste("i is currently",i)) #no return character -> all on same line
}
for(i in c(5,1,4)){
  cat(paste("i is currently",i,"\n")) #\n makes any later output on the next line
}
i is currently 5i is currently 1i is currently 4
i is currently 5 
i is currently 1 
i is currently 4