How to: Create a CDF file from scratch

Author: Elizabeth Purdom (pruned by Henrik Bengtsson)
Created on: 2008-12-19
Last updated: 2009-12-12

I am collecting here functions that Mark Robinson and I have developed to create CDFs from scratch. Ultimately the goal will be to create single R functions, but right now these are functions that make up a series of steps (not all of which are in perl).

An example of steps for Ensembl CDF for the exon array:

  1. Download the Ensembl Annotation with BiomaRt and use it to map the Affymetrix probesets (exons) to the gene boundaries in Ensembl using annotateProbesets.R
  2. Make single file of all probes and their mapping with convertProbesetCSV_differentInput.pl (i.e. now map the corresponding probes)
  3. Create CDF with flat2cdf.R

To filter, see example below

Make CDF (Main File)

The main file flat2cdf.R contains flat2Cdf() for making the CDF, which is a function in R that takes a 'flat' file and converts it to a binary CDF file. The flat file is a tab-delimited file with each row corresponding to a probe to be included in the CDF.  Right now the format of a 'flat' file is a bit rigid.

It must be:

  • tab-delimited
  • have a header row
  • A column 'X' and 'Y' with the x,y coordinates of a file (there is an argument that allows to pass other names, but not really tested yet)
  • A column identifying a probe to a group and a column identifying a probe to a unit (for exon arrays, group=exon, unit=gene; for gene arrays, unit=group=gene); these identifications will the group and unit names respectively.
  • Other columns are allowed and will be ignored.

Example:
Probe_ID        X       Y       Probe_Sequence  Group_ID        Unit_ID
15      14      0       TCTCCAGTGAAGTGCACATTGCTCA       3029044 ENSG00000106144
17      16      0       TGATCGCCTGTCTGCAGATAGGGCA       2400195 ENSG00000090432
29      28      0       TGTAGCTACATGAGGTCTCAGCAGT       2690968 ENSG00000121577
34      33      0       TGGGATGAATGATCAGGAACTGCTG       3032901 ENSG00000130226
35      34      0       TCTGGAAATGCATCAGGGACATCTG       3087574 ENSG00000155975
51      50      0       TATGACAGGATAGCACCAGACGAGC       2808483 ENSG00000112992
58      57      0       TGAGTACGAATGTGCATCTTCAGTC       2783822 ENSG00000138738
59      58      0       TATGTGACTGGATTGCAAGTCTCTC       2403372 ENSG00000158161
62      61      0       CATGTAGATCTATGCCAGTTGAATC       2507416 ENSG00000048991

The R function takes the following arguments:

  • file name: of flat file
  • chipType: the chip type of the CDF
  • tags: tags to add to the chip type
  • ucol: integer giving the column with the unit id
  • gcol: integer giving the column with the group id (can be same as unit id)
  • col.class: column classes of file (see help on read.table());
  • splitn: parameter that controls the number of initial chunks that are unwrapped (number of characters of unit names used to keep units together for initial chunks)
  • rows: number of rows on the array
    cols: number of columns on the array
  • xynames: names for the X,Y columns

Example (within R):

flat2Cdf(file="hjay.r1.flat", chipType="hjay", tag="r1,TC");

Convert Affymetrix annotation to Flat file (Perl)

Affymetrix generally has one file that gives the probe to probeset idenfication. Then there might be a separate file that relates probesets to genes. This information must be combined to get a single flat file for flat2Cdf(). The perl script convertProbesetCSV_differentInput.pl converts these files to an output 'flat' file. This may not work for all input (I use the file *.probe.tab from Affymetrix).

It takes 3 arguments:

  1. a file with the probeset assignments to genes (comma-delimited, one row per probeset)
  2. file with probe assignments to probesets (tab-delimited, one row per probe, e.g. *.probe.tab)
  3. outfile name (will be tab-delimited)

Optional arguments concerning the first file:

-p column with the probeset (default 1)
-g column with the genes (default 2)

Example (command line; one row):

convertProbesetCSV_differentInput.pl ensembl2affy.csv MoEx-1_0-st-v1.probe.tab MoEnsembl50.flat -p 1 -g 6

Map probesets to outside gene definition

The annotateProbesets() function (annotateProbesets.R) is for taking Affymetrix's probesets ('psr') and mapping them to a different gene definition. This might be a bit exon array specific. It was designed for Human Ensembl annotation, though I used it for Mouse. It could be extended to be just probes, perhaps.

This is an R function with the following arguments:

  • probesetFile=file name with the psr regions
  • exonBoundariesFile=file name of delimited file with the exon boundaries that you want to map the probesets to (new gene definitions)
  • probesetFileParameters,exonBoundariesFileParameters=list of parameters to pass to read.table() in reading the file; note that header=TRUE and stringsAsFactors=FALSE are hard-coded in and can not be changed by user (and should not be passed in this argument)
  • probesetFileNames=vector of the header (names) of the necessary columns to match the probeset to the annotation (see default assignments to see what these are). This allows the user to have a file with different headers, which will be detected and converted to the standard headers in the default. Note that the order of this argument is ESSENTIAL. Also, the sixth value (by default NA) can be an additional column (like level); it is only used to table the matching and non-matching probesets in the verbose output, so is not essential. All columns will be returned in the final output (though in different order)
  • exonFileNames=similar to probesetFileNames, but see default to see the specifics for the exonBoundariesFile. Note that if gene_id=exon_id, and the stop and start are the boundaries of the gene, it should just map to the gene and not coding region specific.
  • overlapType=either a character vector specifying overlap type or a user-defined function that returns the fraction of overlap of a set of genes (untested, see internal code to see operation of function). If character vector, should be either 'none' (no adjustment), 'probeset' (percentage of probeset overlap in the resulting merged gene), 'coding' (untested -- the percentage of coding overlap as fraction of coding length of resulting merged gene), or 'genomic' (untested-the percentage of genomic overlap as fraction of length of the resulting merged gene)
  • overlapCutoff=between 0,1; the required shared region required to merge genes together (as defined by overlapType function); if genes are not merged, the overlapping probesets are removed.

Some strange assumptions about the files (should be fixed...):

  • 'probesets' file assumes Affymetrix format for chromosome data: chr1, chr2, etc. and +/- for strand
  • assume numeric for chromosomes in exonBoundaries (1-24)...; and -1/1 for strand
  • assumes that X,Y is encoded as 23,24 in exonBoundaries...very human specific, should be careful

Description of the overlapping strategy (It is run after probesets are assigned to genes based on the inputted exonBoundaries file.)

  1. If overlapType is 'none' there is no further action done (and probesets can be assigned to multiple genes)
  2. When there is overlap (defined as probesets assigned to multiple genes, and not based on the external exon boundaries), there is an initial scan that creates groups of genes that overlap (if A overlaps B and B overlaps C, but C doesn't overlap A, A,B,C are still all included in an overlapping group).
  3. For each group, the % that are common to all of the genes in the group is calculated based on the function given by 'overlapType'.
  4. If this % is greater or equal to overlapCutoff then the genes in the group (and *all* of their probesets, regardless of which gene they are from) are merged into one gene.
  5. Otherwise all of the genes are kept distinct and *any* overlapping probesets are removed.
  6. overlapCutoff=1 means that only genes with 100% overlap (as defined by the function in overlapType) are merged and otherwise probesets mapped to multiple genes are removed.
  7. overlapCutoff>1 means there is absolutely no merging of genes, even if 100% equivalent (note: error is returned if overlapType returns > 1)
  8. The merged gene has a name that is the concatination of the names of the genes in the set: "Gene1.Gene2.Gene3" etc.         

Example Usage:

path <- "HumanEnsembl50Build";
probesetName <- file.path(path, "HuEx-1_0-st-v2.na26.hg18.probeset.csv");
annotExonBoundaries <- file.path(path, "exonBoundaries_proteinCoding_20080819.txt");

colClasses <- rep("NULL", 39); # 39=number of columns in HuEx-1_0-st-v2.naXX.hg18.probeset.csv
colClasses[c(1:5,16)] <- NA;   # the columns that I want to pull in
overlapType <- "probeset";
overlapCutoff <- 1.0;
pbsetHeaderNames <- c("probeset_id", "seqname", "strand", "start", "stop", "level");
exonHeaderNames <- c("gene_id", "exon_id", "chr", "strand", "start", "end");
annotateProbesets(probesetFile=probesetName,
    exonBoundariesFile=annotExonBoundaries,
    probesetFileParameters=list(sep=",", colClasses=colClasses, comment.char="#"),
    probesetFileNames=pbsetHeaderNames,
    exonBoundariesFileParameters=list(sep="\t", colClasses=NA, comment.char="#"),
    exonBoundariesFileNames=exonHeaderNames,
    overlapType=overlapType, overlapCutoff=overlapCutoff, 
    outdir=path, verbose=TRUE);

Filter probes (Perl)

If you already have an existing flat file, you might want to create a new CDF with just a subset of the probes. selectProbes.pl takes a starting flat file and a file with the xy coordinates of the probes to keep and returns a new flat file.

The program takes 3 arguments:

  1. the file with the original probe to probeset and unit assignments (ie.flat file)
  2. file with new probes to keep (x,y in column 1 and 2 respectively)
  3. outfile name

Then if I want to filter out some probes based on my own analysis of my data, I can write these probes to a file

cells <- unlist(ensb50NOvHuCells, use.names=FALSE);
probeValues <- readRawData(secLgNHuOnHu_HM, indices=cells, fields=c("xy","intensities"));
probeValues$geneId <- rep(names(ensb50NOvHuCells), times=perGeneProbeCountHu);
probeValues$Indices <- cells;
write.table(probeValues[myProbes,c("x","y")], file="probeFile.txt", col.names=TRUE, row.names=FALSE, quote=FALSE, sep="\t");

Then in command line:

selectProbes.pl HuEnsembl50_nooverlap.flat probeFile.txt probeFile.flat

Then back to R:

flat2Cdf(file=probeFile.flat, chipType="HuEx-1_0-st-v2", tags="U-Ensembl50,G-Affy,customMH,joint,80PctLowMo,80PctHighHu,EP", col.class=c("integer", "character")[c(1,1,1,2,2,2)], splitn=5);

Then the CDF must be moved to the correct annotationData/ directory.

 

Combine Affymetrix CLF and PGF file formats

Affymetrix sometimes provides CLF and PGF files. These are not necessary fixed formats. CLF files (that I have) are tab-delimited, each line a probe with three columns corresponding to 1) Probe id 2) x position 3) y position. PGF files are not tab-delimited files, but rather nested and give the assignment of probes to probesets.

#level0=probeset_id\ttype\tprobeset_name
#level1=\tatom_id\texon_position"
 #level2=\t\tprobe_id\ttype\tgc_count\tprobe_length\tinterrogation_position\tprobe_sequence"
 For example:
     1\t\tPSR150003628_x_st                               
    \t1\t130              
    \t\t5892621\tpm:st\t16\t25\t13\tGTCCTCCTCTACGAGGCTCTCGTCC         
    \t2\t155                                                                                    
    \t\t4734090\tpm:st\t17\t25\t13\tTCCGCGTCTCTCACGCCCTCGTCCT
    \t3\t161        
    \t\t3931023\tpm:st\t15\t25\t13\tTCTCTCACGCCCTCGTCCTCTCTGA
 

The file combineProbeInfo.pl creates a single tab-delimited file with each row for a probe that looks like this after read into R. (This is the same format as the BGP).

    ##  probeset_id type     probeset_name atom_id exon_position probe_id type.1 gc_count probe_length
    ##1           1   NA PSR150003628_x_st       1           130  5892621  pm:st       16           25
    ##2           1   NA PSR150003628_x_st       2           155  4734090  pm:st       17           25
    ##3           1   NA PSR150003628_x_st       3           161  3931023  pm:st       15           25
    ##4           1   NA PSR150003628_x_st       4            70  3244040  pm:st       12           25
    ##5           1   NA PSR150003628_x_st       5            78  5124413  pm:st       13           25
    ##  interrogation_position            probe_sequence    x    y
    ##1                     13 GTCCTCCTCTACGAGGCTCTCGTCC 2060 2301
    ##2                     13 TCCGCGTCTCTCACGCCCTCGTCCT  649 1849
    ##3                     13 TCTCTCACGCCCTCGTCCTCTCTGA 1422 1535
    ##4                     13 CACCTTTTGTTAGTCCGGAACTCAG  519 1267
    ##5                     13 GTTAGTCCGGAACTCAGAGGAATCG 1852 2001

The program takes 1 argument which is the prefix, then looks for the files <prefix>.pgf and <prefix>.clf. It creates the file <prefix>.probeflat with the per probe info and also <prefix>.psr which is a tab-delimited file with each row corresponding to a probeset in the CLF/PGF, and the columns giving the name and probeset id.

Currently there are not really checks that the PGF and the CLF contain the same information.

Note there are no further 'unit'/gene groupings here.addGeneId.pl can take this file plus another probeset->Gene mapping file and add the necessary column with that information.