Tuesday, November 4, 2008

Perl chop-sequence.pl

Perl code that takes a sequence and a read length (N) and outputs reads of size N that overlaps by two nucleotides.

Sample run
[ppt@bioinf tut09]$ ./chop-sequence.pl aacgtacgcttt 5 ; cat ./chopped-sequence.txt
aacgt
gtacg
cgctt
ttt

Code
1 #!/usr/bin/perl
2 #
3 # 1. Takes as input two parameters: (i) a DNA sequence and (ii) the size N of the chopped fragments.
4 # 2. Generates fragments of size N that are such that they overlap in two nucleotides with another fragment at its right side. Each fragment is writte n in a separate line in a file called chopped-sequence.txt.
5 # file: chop-sequences.pl
6
7 my $arg_count = $#ARGV + 1;
8 unless ($arg_count == 2) { die "Need two arguments $!\n"; }
9
10 my $out_file = "chopped-sequence.txt";
11 my $seq = shift;
12 my $chop_len = shift;
13 my $seq_len = length $seq;
14 my $overlap = 2;
15
16 open OUT, ">$out_file" or die "Can't create output file $out_file $!\n";
17
18 if ($seq_len <= $chop_len) {
19 #print "$seq\n";
20 print OUT "$seq\n";
21 close OUT;
22 exit;
23 }
24
25 for (my $i = 0; $i < $seq_len; $i += $chop_len - $overlap) {
26 $contig = substr $seq, $i, $chop_len;
27
28 # for this case, don't output tc on second line
29 # acctcacctcc 5
30 if ((length $contig) == $overlap) {
31 last;
32 }
33
34 #print "$contig $seq $i $chop_len $seq_len \n";
35 print OUT "$contig\n";
36
37 # handle boundary conditions, two case
38 # acctcc 5
39 if ( (length $contig) < $chop_len ) {
40 # reached the end
41 last;
42 }
43 }
44 close OUT;
45

Trivia:
Q. How do you copy-n-paste this code?
A. One way is to use a 1-liner awk script to filter out the first two columns.

No comments: