# Stratified random sampling in R using dplyr (archive)

10 Jul 2014# Stratified random sampling with dplr

*Matthew E. Aiello-Lammens*

*July 10, 2014*

#### Setup

Let’s say I have a number of sample units for which I have observed some characteristic(s) at two time-points. In my specific case, I have species abundance data for 120 plots in 1992 and 2011. Using these data, I calculated the species turn-over between the two time points for each plot. I then shuffled the 2011 plots, leading to random pairing of plots between the two time points, and recalculated the turn-over.There are many cases in which we may want to do something similar to this, and many non-parametric randomization methods use a similar setup. The particular problem I faced is that the plots were stratified into broad vegetation types, Fynbos, Thicket, and Grassland. When shuffling the 2011 plots, I wanted to shuffle plots

*only within*their vegetation type. I thought up of a number of complicated ways to write a function to do this, and even started coding one up. Then I thought about how I could use

`dplyr`

to carry out stratified random sampling. Here’s an example of how it works.#### Make a data set

Here is a sample data set including 20 plots (p1, …, p20), randomly assigned into one of three categories. I’ve printed out the data set, since it’s small.`## Load dplyr`

require( dplyr )

`## Loading required package: dplyr`

##

## Attaching package: 'dplyr'

##

## The following objects are masked from 'package:stats':

##

## filter, lag

##

## The following objects are masked from 'package:base':

##

## intersect, setdiff, setequal, union

`## Make data.frame`

df <- data.frame( plot = paste( "p", 1:20, sep = "" ),

category = sample( x = letters[1:3], size = 20, replace = TRUE ),

stringsAsFactors = FALSE )

## Print data.frame, arranged by category

print( arrange( df, category ) )

`## plot category`

## 1 p3 a

## 2 p8 a

## 3 p12 a

## 4 p1 b

## 5 p2 b

## 6 p4 b

## 7 p5 b

## 8 p9 b

## 9 p15 b

## 10 p17 b

## 11 p19 b

## 12 p20 b

## 13 p6 c

## 14 p7 c

## 15 p10 c

## 16 p11 c

## 17 p13 c

## 18 p14 c

## 19 p16 c

## 20 p18 c

#### Simple shuffle

Shuffling plots, disregarding their category classification, is easy - just use`sample`

. Below I’ve printed out the shuffled paired-plots.`## Shuffle plots`

plots_shuffled <- sample( df$plot )

## Print plots and plots_shuffled together

print( cbind( df$plot, plots_shuffled ) )

`## plots_shuffled`

## [1,] "p1" "p5"

## [2,] "p2" "p13"

## [3,] "p3" "p16"

## [4,] "p4" "p14"

## [5,] "p5" "p1"

## [6,] "p6" "p3"

## [7,] "p7" "p20"

## [8,] "p8" "p2"

## [9,] "p9" "p11"

## [10,] "p10" "p19"

## [11,] "p11" "p10"

## [12,] "p12" "p17"

## [13,] "p13" "p9"

## [14,] "p14" "p7"

## [15,] "p15" "p6"

## [16,] "p16" "p8"

## [17,] "p17" "p15"

## [18,] "p18" "p4"

## [19,] "p19" "p12"

## [20,] "p20" "p18"

#### Stratified random sampling (shuffling)

But what if we want to account for the category classification? Here’s how I used`dplyr`

to perform stratified random sampling.`## Use dplyr group_by and mutate to randomly sample within category`

df <-

group_by( df, category ) %.%

mutate( strat_rsamp = sample( plot ) )

print( arrange( df, category ) )

`## Source: local data frame [20 x 3]`

## Groups: category

##

## plot category strat_rsamp

## 1 p3 a p12

## 2 p8 a p3

## 3 p12 a p8

## 4 p1 b p5

## 5 p2 b p20

## 6 p4 b p9

## 7 p5 b p15

## 8 p9 b p17

## 9 p15 b p1

## 10 p17 b p2

## 11 p19 b p4

## 12 p20 b p19

## 13 p6 c p11

## 14 p7 c p14

## 15 p10 c p7

## 16 p11 c p10

## 17 p13 c p16

## 18 p14 c p18

## 19 p16 c p13

## 20 p18 c p6

We could also return just a vector of the shuffled samples, without the data.frame. Convenient, but not very pretty code-wise`( group_by( df, category ) %.%`

mutate( strat_rsamp = sample( plot ) ) )$strat_rsamp

`## [1] "p5" "p20" "p8" "p9" "p15" "p16" "p11" "p3" "p19" "p18" "p14"`

## [12] "p12" "p13" "p10" "p4" "p6" "p1" "p7" "p17" "p2"

#### Conclusion

There you have it - stratified random sampling. There may be an even easier way to do this (perhaps I missed a function or didn’t dive into`sample`

enough?), but this seems pretty easy to me. Thanks `dplyr`

!
Matthew Aiello-Lammens

2014-07-10T14:53:44.110Z

The above is a little Gist that I wrote this morning. The source code can be found here: https://gist.github.com/96c9e597471d48a8f69d.git