How to: Create a CDF file from scratch (or from CLF and PGF files)

Author: Elizabeth Purdom (pruned by Henrik Bengtsson)
Created on: 2008-12-19
Last updated: 2014-01-22

I am collecting here functions that Mark Robinson and I have developed to create CDFs from scratch. Ultimately the goal will be to create single R functions, but right now these are functions that make up a series of steps (not all of which are in Perl).

An example of steps for Ensembl CDF for the exon array:

Download the Ensembl Annotation with biomaRt and use it to map the Affymetrix probesets (exons) to the gene boundaries in Ensembl using annotateProbesets.R
Make single file of all probes and their mapping with convertProbesetCSV_differentInput.pl (i.e. now map the corresponding probes)
Create CDF with flat2Cdf.R

To filter, see example below.

Make CDF (Main File)

The main file flat2Cdf.R contains flat2Cdf() for making the CDF, which is a function in R that takes a 'flat' file and converts it to a binary CDF file. The flat file is a tab-delimited file with each row corresponding to a probe to be included in the CDF. Right now the format of a 'flat' file is a bit rigid.

It must be:

tab-delimited
have a header row
A column 'X' and 'Y' with the x,y coordinates of a file (there is an argument that allows to pass other names, but not really tested yet)
A column identifying a probe to a group and a column identifying a probe to a unit (for exon arrays, group=exon, unit=gene; for gene arrays, unit=group=gene); these identifications will the group and unit names respectively.
Other columns are allowed and will be ignored.

Example:  
Probe_ID        X       Y       Probe_Sequence  Group_ID
Unit_ID  
15      14      0       TCTCCAGTGAAGTGCACATTGCTCA       3029044
ENSG00000106144  
17      16      0       TGATCGCCTGTCTGCAGATAGGGCA       2400195
ENSG00000090432  
29      28      0       TGTAGCTACATGAGGTCTCAGCAGT       2690968
ENSG00000121577  
34      33      0       TGGGATGAATGATCAGGAACTGCTG       3032901
ENSG00000130226  
35      34      0       TCTGGAAATGCATCAGGGACATCTG       3087574
ENSG00000155975  
51      50      0       TATGACAGGATAGCACCAGACGAGC       2808483
ENSG00000112992  
58      57      0       TGAGTACGAATGTGCATCTTCAGTC       2783822
ENSG00000138738  
59      58      0       TATGTGACTGGATTGCAAGTCTCTC       2403372
ENSG00000158161  
62      61      0       CATGTAGATCTATGCCAGTTGAATC       2507416
ENSG00000048991

The R function takes the following arguments:

filename: of flat file
chipType: the chip type of the CDF
tags: tags to add to the chip type
ucol: integer giving the column with the unit id
gcol: integer giving the column with the group id (can be same as unit id)
col.class: column classes of file (see help on read.table());
splitn: parameter that controls the number of initial chunks that are unwrapped (number of characters of unit names used to keep units together for initial chunks)
rows: number of rows on the array
cols: number of columns on the array
xynames: names for the X,Y columns

Example (within R):

flat2Cdf(file="hjay.r1.flat", chipType="hjay", tag="r1,TC")

Convert Affymetrix annotation to Flat file (Perl)

Affymetrix generally has one file that gives the probe to probeset identification. Then there might be a separate file that relates probesets to genes. This information must be combined to get a single flat file for flat2Cdf(). The Perl script convertProbesetCSV_differentInput.pl converts these files to an output 'flat' file. This may not work for all input (I use the file *.probe.tab from Affymetrix).

It takes 3 arguments:

a file with the probeset assignments to genes (comma-delimited, one row per probeset)
file with probe assignments to probesets (tab-delimited, one row per probe, e.g. *.probe.tab)
outfile name (will be tab-delimited)

Optional arguments concerning the first file:

-p column with the probeset (default 1)
-g column with the genes (default 2)

Example (system command line; one row):

convertProbesetCSV_differentInput.pl ensembl2affy.csv MoEx-1_0-st-v1.probe.tab MoEnsembl50.flat -p 1 -g 6

Map probesets to outside gene definition

The annotateProbesets() function (R script annotateProbesets.R) is for taking Affymetrix's probesets ('psr') and mapping them to a different gene definition. This might be a bit exon array specific. It was designed for Human Ensembl annotation, though I used it for Mouse. It could be extended to be just probes, perhaps.

This is an R function with the following arguments:

probesetFile: file name with the psr regions
exonBoundariesFile: file name of delimited file with the exon boundaries that you want to map the probesets to (new gene definitions)
probesetFileParameters, exonBoundariesFileParameters: list of parameters to pass to read.table() in reading the file; note that header=TRUE and stringsAsFactors=FALSE are hard-coded in and can not be changed by user (and should not be passed in this argument)
probesetFileNames: vector of the header (names) of the necessary columns to match the probeset to the annotation (see default assignments to see what these are). This allows the user to have a file with different headers, which will be detected and converted to the standard headers in the default. Note that the order of this argument is ESSENTIAL. Also, the sixth value (by default NA) can be an additional column (like level); it is only used to table the matching and non-matching probesets in the verbose output, so is not essential. All columns will be returned in the final output (though in different order)
exonFileNames: similar to probesetFileNames, but see default to see the specifics for the exonBoundariesFile. Note that if gene_id=exon_id, and the stop and start are the boundaries of the gene, it should just map to the gene and not coding region specific.
overlapType: either a character vector specifying overlap type or a user-defined function that returns the fraction of overlap of a set of genes (untested, see internal code to see operation of function). If character vector, should be either "none" (no adjustment), "probeset" (percentage of probeset overlap in the resulting merged gene), "coding" (untested -- the percentage of coding overlap as fraction of coding length of resulting merged gene), or "genomic" (untested-the percentage of genomic overlap as fraction of length of the resulting merged gene)
overlapCutoff: between 0 and 1; the required shared region required to merge genes together (as defined by overlapType function); if genes are not merged, the overlapping probesets are removed.

Some strange assumptions about the files (should be fixed...):

'probesets' file assumes Affymetrix format for chromosome data: chr1, chr2, etc. and +/- for strand
assume numeric for chromosomes in exonBoundaries (1-24)...; and -1/1 for strand
assumes that X,Y is encoded as 23,24 in exonBoundaries...very human specific, should be careful

Description of the overlapping strategy (It is run after probesets are assigned to genes based on the inputted exonBoundariesFile.)

If overlapType is "none" there is no further action done (and probesets can be assigned to multiple genes)
When there is overlap (defined as probesets assigned to multiple genes, and not based on the external exon boundaries), there is an initial scan that creates groups of genes that overlap (if A overlaps B and B overlaps C, but C does not overlap A, then A,B,C are still all included in an overlapping group).
For each group, the % that are common to all of the genes in the group is calculated based on the function given by overlapType.
If this % is greater or equal to overlapCutoff then the genes in the group (and all of their probesets, regardless of which gene they are from) are merged into one gene.
Otherwise all of the genes are kept distinct and any overlapping probesets are removed.
overlapCutoff=1 means that only genes with 100% overlap (as defined by the function in overlapType) are merged and otherwise probesets mapped to multiple genes are removed.
overlapCutoff > 1 means there is absolutely no merging of genes, even if 100% equivalent (note: error is returned if overlapType returns > 1)
The merged gene has a name that is the concatenation of the names of the genes in the set: "Gene1.Gene2.Gene3" etc.

Example Usage:

path <- "HumanEnsembl50Build"
probesetName <- file.path(path, "HuEx-1_0-st-v2.na26.hg18.probeset.csv")
annotExonBoundaries <- file.path(path, "exonBoundaries_proteinCoding_20080819.txt")

colClasses <- rep("NULL", 39); # 39=number of columns in HuEx-1_0-st-v2.naXX.hg18.probeset.csv
colClasses[c(1:5,16)] <- NA;   # the columns that I want to pull in
overlapType <- "probeset"
overlapCutoff <- 1.0
pbsetHeaderNames <- c("probeset_id", "seqname", "strand", "start", "stop", "level")
exonHeaderNames <- c("gene_id", "exon_id", "chr", "strand", "start", "end")
annotateProbesets(probesetFile=probesetName,
    exonBoundariesFile=annotExonBoundaries,
    probesetFileParameters=list(sep=",", colClasses=colClasses, comment.char="#"),
    probesetFileNames=pbsetHeaderNames,
    exonBoundariesFileParameters=list(sep="\t", colClasses=NA, comment.char="#"),
    exonBoundariesFileNames=exonHeaderNames,
    overlapType=overlapType, overlapCutoff=overlapCutoff,
    outdir=path, verbose=TRUE)

Filter probes (Perl)

If you already have an existing flat file, you might want to create a new CDF with just a subset of the probes. The Perl script selectProbes.pl takes a starting flat file and a file with the xy coordinates of the probes to keep and returns a new flat file.

The program takes 3 arguments:

the file with the original probe to probeset and unit assignments (ie.flat file)
file with new probes to keep (x,y in column 1 and 2 respectively)
outfile name

Then if I want to filter out some probes based on my own analysis of my data, I can write these probes to a file

cells <- unlist(ensb50NOvHuCells, use.names=FALSE)
probeValues <- readRawData(secLgNHuOnHu_HM, indices=cells, fields=c("xy","intensities"))
probeValues$geneId <- rep(names(ensb50NOvHuCells), times=perGeneProbeCountHu)
probeValues$Indices <- cells
write.table(probeValues[myProbes,c("x","y")], file="probeFile.txt", col.names=TRUE, row.names=FALSE, quote=FALSE, sep="\t")

Then in command line:

selectProbes.pl HuEnsembl50_nooverlap.flat probeFile.txt probeFile.flat

Then back to R:

flat2Cdf(file=probeFile.flat, chipType="HuEx-1_0-st-v2", tags="U-Ensembl50,G-Affy,customMH,joint,80PctLowMo,80PctHighHu,EP", col.class=c("integer", "character")[c(1,1,1,2,2,2)], splitn=5)

Then the CDF must be moved to the correct annotationData/chipTypes/ subdirectory.

Combine Affymetrix CLF and PGF file formats

Affymetrix sometimes provides CLF and PGF files. These are not necessary fixed formats. CLF files (that I have) are tab-delimited, each line a probe with three columns corresponding to 1) Probe id 2) x position 3) y position. PGF files are not tab-delimited files, but rather nested and give the assignment of probes to probesets.

#level0=probeset_id\\ttype\\tprobeset_name
  #level1=\\tatom_id\\texon_position"
  #level2=\\t\\tprobe_id\\ttype\\tgc_count\\tprobe_length\\tinterrogation_position\\tprobe_sequence"

For example:

1\\t\\tPSR150003628_x_st
  \\t1\\t130
  \\t\\t5892621\\tpm:st\\t16\\t25\\t13\\tGTCCTCCTCTACGAGGCTCTCGTCC

\\t2\\t155
  \\t\\t4734090\\tpm:st\\t17\\t25\\t13\\tTCCGCGTCTCTCACGCCCTCGTCCT
  \\t3\\t161
  \\t\\t3931023\\tpm:st\\t15\\t25\\t13\\tTCTCTCACGCCCTCGTCCTCTCTGA

The Perl script combineProbeInfo.pl creates a single tab-delimited file with each row for a probe that looks like this after read into R. (This is the same format as the BGP).

##   probeset_id type     probeset_name atom_id exon_position probe_id type.1 gc_count
## 1           1   NA PSR150003628_x_st       1           130 5892621  pm:st       16
## 2           1   NA PSR150003628_x_st       2           155 4734090  pm:st       17
## 3           1   NA PSR150003628_x_st       3           161 3931023  pm:st       15
## 4           1   NA PSR150003628_x_st       4            70 3244040  pm:st       12
## 5           1   NA PSR150003628_x_st       5            78 5124413  pm:st       13
##   probe_length  interrogation_position            probe_sequence    x    y  
## 1           25                      13 GTCCTCCTCTACGAGGCTCTCGTCC 2060 2301  
## 2           25                      13 TCCGCGTCTCTCACGCCCTCGTCCT  649 1849  
## 3           25                      13 TCTCTCACGCCCTCGTCCTCTCTGA 1422 1535  
## 4           25                      13 CACCTTTTGTTAGTCCGGAACTCAG  519 1267  
## 5           25                      13 GTTAGTCCGGAACTCAGAGGAATCG 1852 2001

The program takes 1 argument which is the prefix, then looks for the files <prefix>.pgf and <prefix>.clf. It creates the file <prefix>.probeflat with the per probe info and also <prefix>.psr which is a tab-delimited file with each row corresponding to a probeset in the CLF/PGF, and the columns giving the name and probeset id.

Currently there are not really checks that the PGF and the CLF contain the same information.

Note there are no further 'unit'/gene groupings here. The Perl script addGeneId.pl can take this file plus another probeset->Gene mapping file and add the necessary column with that information.