Puneet Wadhwa's BIOINFORMATICS BLOG

Thursday, March 02, 2006

Waking a sleeping giant: annotating the Incyte Gene Collection (IGC)

Hey Readers:

Following is an article about my project which appeared on my company website's blog. We have got some great results from mining the IGC collection, and I am very very excited to share them with you.

Here it goes:
--------------
With over one million cDNAs for human, rat, monkey, and dog, the IGC is a monster collection that, on statistical grounds, is certain to contain some good stuff. Consider this: of the more than 400,000 human cDNAs, Incyte has categorized 11,377 as full length and 16,756 as potentially full length. However, while these latter clone sets have been fully sequenced, they were never submitted to GenBank and no annotations (if they ever existed) were passed along to Open Biosystems. Currently, the only way to mine the IGC is by BLASTing query sequences online against our Incyte clone database. The IGC is a potentially valuable resource, but with largely (and frustratingly) unknown content.

To enable better exploitation of the IGC, we have begun a high throughput annotation program with these goals: (1) to associate human RefSeq accession numbers with IGC clones when this can be done with high confidence and (2) to discover which human RefSeq-associated IGC sequences are not found in the Mammalian Gene Collection (MGC). I’d like to share with you some preliminary results from our pilot study.

We began by filtering the ~28,000 full-length and potential full-length human cDNAs by size and selecting the set of 367 sequences that are 3 kilobases or longer. These sequences were then BLASTed against every CDS in human RefSeq and filtered for 100% identity. Even at this high stringency there were 118 “hits”, that is, IGC sequences that contain a complete human RefSeq CDS. The 118 coding sequences were then BLASTed against the MGC, yielding 47 hits at 100% identity. Taking into account a few cases in which the same IGC sequence corresponded to two or more RefSeq accessions, there were 64 RefSeq-certified IGC cDNAs not found in the MGC.

For 25 of the brave new 64, there was no MGC clone within the same UniGene cluster; for the 39 others, one or more MGC clones were found in the same cluster. However, spot-checking of these MGC sequences confirmed that they are either apparent splice variants of the RefSeq CDS or contain single nucleotide substitutions. So far, so good*. If you are curious about these preliminary results, a spreadsheet can be downloaded from our website by navigating to www.OpenBiosystems.com > Genomics > Mammalian Resources > cDNAs > Incyte Gene Collection and clicking on the data file icon under the ordering information for IHS1380.

We have already begun BLASTing the larger set of 28,000 against the RefSeq coding sequences. Our goal is to identify 1000 RefSeq-certified human IGC cDNAs that are not in the MGC. Eventually we hope to dig deeper by identifying IGC cDNAs that are not perfect matches to any RefSeq CDS, but represent plausible splice variants or SNPs. I am also intrigued by the possibility of identifying IGC cDNAs that are completely outside RefSeq—representing entirely novel genes. We shall see!

0 Comments:

Post a Comment

<< Home