Vignette: CRLMM genotyping (100K and 500K)

Author: Henrik Bengtsson
Created on: 2009-01-15
Last updated: 2009-01-15


Introduction

This document shows how to estimate genotypes based on the CRLMM algorithm.  The setup is such that it tries to replicate the estimates of the justCRLMMv2() of the oligo package.  Support for CRLMM in aroma.affymetrix is currently for the 100K and the 500K chip types.  The GWS5 or GWS6 chip types are yet not supported.

Setup 

If this is your first analysis in aroma.affymetrix, please make sure to first read the first few pages in the online 'Users Guide'.  This will explain the importance of following a well defined directory structure and file names.  Understanding this is important and will save you a lot of time.

Annotation data

annotationData/
    chipTypes/
      Mapping250K_Nsp/
       Mapping250K_Nsp.cdf
       Mapping250K_Nsp,na26,HB20080915.ugp 

Raw data

rawData/
   HapMap270,500K,CEU,testSet/
     Mapping250K_Nsp/
       NA06985.CEL    NA06991.CEL    NA06993.CEL
       NA06994.CEL    NA07000.CEL    NA07019.CEL

 There should be 6 CEL files in total.   This data was downloaded from the HapMap website.

Analysis

library("aroma.affymetrix");
log <- verbose <- Arguments$getVerbose(-8, timestamp=TRUE);

Setup raw data set

cdf <- AffymetrixCdfFile$byChipType("Mapping50K_Nsp");
print(cdf);

AffymetrixCdfFile:
Path: annotationData/chipTypes/Mapping250K_Nsp
Filename: Mapping250K_Nsp.cdf
Filesize: 185.45MB
Chip type: Mapping250K_Nsp
RAM: 11.01MB
File format: v4 (binary; XDA)
Number of cells: 6553600
Number of units: 262338
Cells per unit: 24.98
Number of QC units: 6

csR <- AffymetrixCelSet$byName("HapMap270,500K,CEU,testSet", cdf=cdf);
print(csR);

AffymetrixCelSet:
Name: HapMap270
Tags: 500K,CEU,testSet
Path: rawData/HapMap270,500K,CEU,testSet/Mapping250K_Nsp
Platform: Affymetrix
Chip type: Mapping250K_Nsp
Number of arrays: 6
Names: NA06985, NA06991, ..., NA07019
Time period: 2005-11-16 15:17:51 -- 2005-11-18 19:08:24
Total file size: 375.88MB
RAM: 0.01MB

SNPRMA (normalization and summarization)

ces <- justSNPRMA(csR, normalizeToHapmap=TRUE, returnESet=FALSE, verbose=log);
print(ces);

SnpChipEffectSet:
Name: HapMap270
Tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-
Path: plmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-/Mapping250K_Nsp
Platform: Affymetrix
Chip type: Mapping250K_Nsp,monocell
Number of arrays: 6
Names: NA06985, NA06991, ..., NA07019
Time period: 2009-01-10 17:34:45 -- 2009-01-10 17:34:46
Total file size: 57.34MB
RAM: 0.01MB
Parameters: (probeModel: chr "pm", mergeStrands: logi FALSE)

Genotype calling using the CRLMM model

Setting up the model:

crlmm <- CrlmmModel(ces, tags="*,oligo");
print(crlmm);

CrlmmModel:
Data set: HapMap270
Chip type: Mapping250K_Nsp,monocell
Input tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-
Output tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Parameters: (balance: num 1.5; minLLRforCalls: num [1:3] 5 1 5; recalibrate: logi FALSE;flavor: chr "v2").
Path: crlmmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo/Mapping250K_Nsp
RAM: 0.00MB

Fitting the model:

units <- fit(crlmm, ram="oligo", verbose=log);
str(units);

int [1:262264] 135666 261964 227992 3670 227993 159191 212754 24401 ...

Accessing individual genotype calls and confidence scores

Retrieving the genotype calls (without actually loading anything):

callSet <- getCallSet(crlmm);
print(callSet);

AromaUnitGenotypeCallSet:
Name: HapMap270
Tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Full name: HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Number of files: 6
Names: NA06985, NA06991, ..., NA07019
Path (to the first file): crlmmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo/Mapping250K_Nsp
Total file size: 3.00MB
RAM: 0.01MB

calls <- extractGenotypes(callSet, units=2001:2005);
print(calls);

NA06985 NA06991 NA06993 NA06994 NA07000 NA07019
 [1,] "AA" "AA" "AA" "AA" "AA" "AA"
 [2,] "AA" "AB" "AB" "BB" "AB" "AB"
 [3,] "AA" "AB" "AB" "AB" "AA" "AB"
 [4,] "BB" "AB" "AB" "BB" "BB" "BB"
 [5,] "AA" "AA" "AA" "AA" "AA" "AA"

Retrieving the confidence scores for the calls:

confSet <- getConfidenceScoreSet(crlmm);
print(confSet);

AromaUnitSignalBinarySet:
Name: HapMap270
Tags: 500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Full name: HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo
Number of files: 6
Names: NA06985, NA06991, ..., NA07019
Path (to the first file): crlmmData/HapMap270,500K,CEU,testSet,QN,HapMapRef,RMA,oligo,+-,CRLMM,oligo/Mapping250K_Nsp
Total file size: 6.01MB
RAM: 0.01MB

scores <- extractMatrix(confSet, units=2001:2005);
print(scores);

NA06985 NA06991 NA06993 NA06994 NA07000 NA07019
 [1,] 0.9998 0.9998 0.9998 0.9999 0.9999 0.9998
 [2,] 0.9968 0.9932 0.9931 0.9998 0.9976 0.9974
 [3,] 0.9997 0.9993 0.9967 0.9988 0.9998 0.9975
 [4,] 0.9986 0.9961 0.9894 0.9994 0.9994 0.9989
 [5,] 0.9998 0.9999 0.9998 0.9999 0.9998 0.9998

Plotting raw CNs and raw genotypes annotated by genotypes

gi <- getGenomeInformation(cdf);
units <- getUnitsOnChromosome(gi, chromosome=2, region=c(75,90)*1e6);
pos <- getPositions(gi, units=units) / 1e6;

Get data for the last array:

array <- 6;
ce <- getFile(ces, array);
data <- extractTotalAndFreqB(ce, units=units);
str(data);

num [1:1306, 1:2] 2312 2185 1994 3242 4156 ...
 - attr(*, "dimnames")=List of 2
 ..$ : NULL
 ..$ : chr [1:2] "total" "freqB"

Extract the (total) theta and the freqB columns:

theta <- data[,"total"];
freqB <- data[,"freqB"];

We calculate the copy-neutral reference as the robust average of all arrays:

ceR <- getAverageFile(ces);
dataR <- extractTotalAndFreqB(ceR, units=units);
thetaR <- dataR[,"total"];

Then we calculate the log2 copy-number ratios:

M <- log2(theta/thetaR);

Finally, we extract the corresponding genotype calls:

cf <- getFile(callSet, array);
calls <- extractGenotypes(cf, units=units);

Next, we plot the (x,C) and (x,freqB) along the genome annotated with colors according to genotype:

col <- c(AA=1, AB=2, BB=3)[calls];
xlim <- range(pos, na.rm=TRUE);
xlab <- "Physical position (Mb)";
Mlab <- expression(log[2](theta/theta[R]));
Blab <- expression(beta == theta[B]/theta);
subplots(2, ncol=1);
par(mar=c(3,4,1,1)+0.1, pch=".");
plot(pos, M, col=col, cex=3, xlim=xlim, ylim=c(-2,2), xlab=xlab, ylab=Mlab);
stext(side=3, pos=0, getName(ce));
stext(side=3, pos=1, "Chr 2");
plot(pos, freqB, col=col, cex=3, xlim=xlim, ylim=c(0,1), xlab=xlab, ylab=Blab);

 

Comparing with CRLMM in oligo

path <- file.path("oligoData", getFullName(csR), getChipType(csR, fullname=FALSE));
path <- Arguments$getWritablePathname(path);
if (!isDirectory(path)) {
  mkdirs(getParent(path));
  oligo:::justCRLMMv2(getPathnames(csR), tmpdir=path, recalibrate=FALSE, balance=1.5, verbose=TRUE);
}

Comparing genotype calls

Genotype calls according to oligo

calls0 <- readSummaries("calls", path);
dimnames(calls0) <- NULL;

Genotype calls according to aroma.affymetrix

units <- indexOf(cdf, pattern="^SNP");
unitNames <- getUnitNames(cdf, units=units);
units <- units[order(unitNames)];
calls <- extractGenotypes(callSet, units=units, encoding="oligo");
dimnames(calls) <- NULL;

Differences between aroma.affymetrix and oligo

count <- 0;
for (cc in 1:ncol(calls)) {
  idxs <- whichVector(calls[,cc] != calls0[,cc]);
  count <- count + length(idxs);
  cat(sprintf("%s: ", getNames(callSet)[cc]));
  if (length(idxs) > 0) {
    map <- c("AA", "AB", "BB");
    cat(paste(map[calls[idxs,cc]], map[calls0[idxs,cc]], sep="!="), sep=",");
  }
  cat("\n");
}

We identify the following difference for each of the genotyped samples:

NA06985: AB!=AA, AB!=BB, AB!=AA, BB!=AB, BB!=AB
NA06991: AB!=AA
NA06993: AA!=AB, BB!=AB
NA06994: AB!=AA, BB!=AB
NA07000: AB!=BB, AA!=AB
NA07019: AA!=AB, AA!=AB, AA!=AB, AA!=AB

Note that these differences are only in one allele, that is, one implementation calls a SNP heterozygote whereas the other calls a homozygote.

cat(sprintf("Averages number of discrepances per array: %.1f\n", count/ncol(calls)));

Averages number of discrepances per array: 2.7

errorRate <- count/length(calls);
cat(sprintf("Concordance rate: %.5f%%\n", 100*(1-errorRate)));

Concordance rate: 99.99898%

Comparing confidence scores

Confidence scores according to oligo:

conf0 <- readSummaries("conf", path);
dimnames(conf) <- NULL;

Confidence scores according to aroma.affymetrix:

conf <- extractMatrix(confSet, units=units);
dimnames(conf) <- NULL;

Differences between aroma.affymetrix and oligo:

delta <- conf - conf0;
avgDelta <- mean(abs(delta), na.rm=TRUE);
cat(sprintf("Averages difference: %.2g\n", avgDelta));

Averages difference: 7.1e-06

Pairwise plots:

subplots(ncol(conf));
par(mar=c(3,2,1,1)+0.1);
lim <- c(0,1);
for (cc in 1:ncol(conf)) {
  plot(NA, xlim=lim, ylim=lim);
  abline(a=0, b=1, col="#999999");
  points(conf[,cc], conf0[,cc], pch=".", cex=3);
  rho <- cor(conf[,cc], conf0[,cc]);
  stext(side=3, pos=0, line=-1, sprintf("rho=%.4f", rho));
  stext(side=3, pos=0, getNames(confSet)[cc]);
  cat(sprintf("Array #%d: Correlation: %.4f\n", cc, rho));
}

 

Appendix

Information about R and packages used

> sessionInfo()

R version 2.8.1 Patched (2008-12-22 r47296)
i386-pc-mingw32
locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:

[1] splines tools stats graphics grDevices utils datasets
[8] methods base

other attached packages:

[1] pd.mapping250k.nsp_0.4.1 oligoClasses_1.3.8
[4] AnnotationDbi_1.3.12 preprocessCore_1.3.4 RSQLite_0.7-0
[7] DBI_0.2-4 Biobase_2.1.7 aroma.affymetrix_1.0.0
[10] aroma.apd_0.1.4 R.huge_0.1.6 affxparser_1.15.1
[13] aroma.core_1.0.0 aroma.light_1.11.1 oligo_1.5.9
[16] digest_0.3.1 matrixStats_0.1.3 R.rsp_0.3.4
[19] R.cache_0.1.7 R.utils_1.1.3 R.oo_1.4.6
[22] R.methodsS3_1.0.3