- USING CDF FILES TO IDENTIFY GENES CODE
- USING CDF FILES TO IDENTIFY GENES PLUS
- USING CDF FILES TO IDENTIFY GENES DOWNLOAD
These are DAT files, which contain raw image data, CEL files which contain information about the intensity values of the individual probes, CHP files which contain information about probe sets, and EXP files, which contain information about experimental conditions and protocols. for (file in cel.The function affyread can read four types of Affymetrix data files. CEL-files one-by-one and checks for any missing values. Edit3: No NAs or NaNs in "raw data"įor what it is worth, I've tried to read in the raw probe-level expression values to check if they contain NAs or NaNs. # Trying with all other CEL than GSM776462.CELĬel.files2 <- grep("GSM776462.CEL", cel.files, invert = TRUE, value = TRUE)Ī.776462 <- justRMA(filenames = cel.files2)Į.776462 <- justRMA(filenames = cel.files2, verbose = TRUE, Edit2: NaNs vanish when excluding sample but not with standard CDFįor what it is worth, when I exclude the GSM776462.CEL file and RMA normalize with and without the custom CDF files the NaNs only disappear in the custom CDF case. I still don't quite know what to make of this. It is also weird, that using regular CDF has seemingly randomly scattered NaNs in the expression matrix. Hence GSM776462.CEL appears to be the culprit.īut the regular CDF annotation does not give any problems: affy <- justRMA(filenames = "CEL/GSM776462.CEL") Next, I count the number of NaN appearing other places sum(is.na(X)) # = 0 X <- as.ame(do.call(cbind, lapply(ensg, exprs)))īy looking at head(X) it appears that GSM776462.CEL is all NaNs. # Extract the expression matrices for each file and combine them But I imagine that the quantile normalization step becomes the identity function and the summarization (median polish) simply stops after the first iteration.Īnyway, to perform the one-by-one RMA normalization we run: ensg <- vector("list", length(cel.files)) # Initialize a listĮnsg] <- justRMA(filenames = file, verbose = TRUE,Ĭat("File", which(file = cel.files), "done.\n\n") What actually happens underneath the RMA hood when justRMA is given a single array I'm not sure of. Can anybody help me track down the origin of the NaNs and get some useful expression values?Īnders Edit: A single file appears to be the issue (but not quite) However, one might worry that it is one or more bad arrays that is the cause. So the problem appears to stem from the background correction. I've tried using the expresso function to perform background correction only (with no normalization and summarization) which also yield NaNs. Sum(apply(is.na(exprs(affy.rma)), 1, all)) # There are relatively few NaNs in total (but the really should be none) Interestingly, there are quite a few NaNs in exprs(affy.rma) when using the standard CDF. affy.rma <- justRMA(filenames = cel.files, verbose = TRUE)Įnsg.rma <- justRMA(filenames = cel.files, verbose = TRUE,Īs can be seen, the function returns without warning an expression matrix exprs(ensg.ram) where all entries in the expression matrix are NaN. Perform the RMA normalization with and without the custom CDF. gz files gz.files <- list.files("CEL", pattern = "\\.gz$",Ĭel.files <- list.files("CEL", pattern = "\\.cel$",ĭownload, install, and load the custom Brainarray Ensembl ENSG gene annotation package download.file(paste0("", Untar the data in a dir called CEL #Sys.setenv(TAR = '/usr/bin/tar') # For (some) OS X uncommment this line
USING CDF FILES TO IDENTIFY GENES DOWNLOAD
Library("GEOquery") # To automatically download the dataĭownload the array data to the work dir. #biocLite(pkgs = c("GEOquery", "affy", "AnnotationDbi", "R.utils"))
USING CDF FILES TO IDENTIFY GENES CODE
Since the problem is specific to the dataset, the following R code to reproduce the problem is unfortunately quite cumbersome (2 GB download, 8.8 GB unpacked). I'm using the affy package to perform the RMA normalization.
USING CDF FILES TO IDENTIFY GENES PLUS
The dataset (GSE31312) is freely available at the GEO website and uses the Affymetrix HG-U133 Plus 2.0 array platform. Unfortunately, the RMA normalized expression matrix is all NaNs, and I don't understand why. I'm trying to RMA normalize a particular gene expression dataset concerning diffuse large B-cell lymphoma using custom gene-level annotation CDF (chip definition file) files from Brainarray.