Extracting ORF (CDS portion) from Sequence using Bioperl
I have been recently wrestling with creation of an ORF Database from Refseq Human Sequences, by extracting out the CDS portion of the sequence from the Genbank record. I needed this database for my project since we were analyzing our proprietary Incyte Gene Collection (IGC) for exact matches to Refseq sequences over the ORF (CDS) region.
Building this database was not easy, and I hope others could learn from this experience of mine...
I first downloaded the RAW Genbank files (.gbff) and then extracted them into a folder. I then searched these files using unix commands like grep etc. to find out which files contained Human species. And then, I wrote a regular expressions based C#.NET GBFF Parser, which used regular expressions to extract out the features from the record like gene name, definition, locus, CDS, species etc. I then grabbed the sequence from the Genbank record using my parser, and then stripped out the CDS part using my program, from the CDS begin to the CDS end locations on the sequence. This was NOT very easy and certainly not the best way.
I now have a surefire way of extracting ORF's from the Genbank records, and it is very elegant, uses very less amount of code, and has been written in Bioperl. Attached below, is the code to do the same. However, is you use or distribute this code, please leave its header and author information intact, and provide a link back to my website.
!/usr/bin/perl
#===========================================
# ORF DATABASE BUILDER UTILITY - Puneet Wadhwa #
# http://puneetwadhwa.blogspot.com * pwadhwa@gmail.com #
#-------------------------------------------
# The purpose of this utility is to fetch the Genbank #
# record live from the Genbank website, and strip out #
# and display the CDS sequence from the record. #
# Takes a list of NCBI Accessions as input and outputs a #
# fasta record. #
# USAGE: cat input_file ./GenerateOrfDatabase.pl #
# Input file contains NCBI accessions (one per line) for #
# which the CDS is to be extracted. #
#===============================================
use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::SeqIO::swiss;
use Bio::Seq;
use Bio::SeqIO::FTHelper;
use Bio::SeqFeature::Generic;
use Bio::Species;
use Bio::AnnotationI;
use Bio::FeatureHolderI;
use Bio::SeqFeatureI;
my $gb = new Bio::DB::GenBank;
my $seqout = new Bio::SeqIO(-fh => \*STDOUT, -format => 'fasta');
%acc_hash = 0;
while (<>)
{
chomp;
$acc = $_;
my $seq_obj = $gb->get_Seq_by_acc($acc);
foreach my $feat ( $seq_obj->top_SeqFeatures )
{
if ( $feat->primary_tag eq 'CDS' )
{
my $cds_obj = $feat->spliced_seq;
print ">".$seq_obj->display_id()."\n".$cds_obj->seq."\n";
}
}
}
0;