I need some help with a combination of awk
& while
loop.
I have two simple files with columns (normal ones are very large), one representing simple intervals for an ID=10(of coding regions(exons),for chromosome 10 here):
#exons.bed
10 60005 60100
10 61007 61130
10 61200 61300
10 61500 61650
10 61680 61850
and the other representing sequenced reads(=just intervals again but smaller) with an other value as last column, that I ll need later:
#reads.bed
10 60005 60010 34
10 61010 61020 40
10 61030 61040 22
10 61065 61070 35
10 61100 61105 41
So, I would like to search in a quick and efficient way and find which read intervals (of which line in the file) and how many, fall in one coding region:
exon 1(first interval of table 1) contains reads of line 1,2,3, etc.
of reads.file(2nd table)
so that I can get the value of 4th column of these lines later, for each exon.
I 've written a code,that probably needs some corrections on the while loop, since I cannot make it parse the reads lines one by one for each awk. Here it is:
while read chr a b cov; do #for the 4-column file
#if <a..b> interval of read falls inside exon interval:
awk '($2<=$a && $b <= $3) {print NR}' exons.bed >> out_lines.bed
done < reads.bed
At the moment I can make the awk line running when I give manually a,b, but I want to make it run automatically for each a,b pair by file.
Any suggestion on changing syntax, or way of doing it, is highly appreciated!
FOLLOW UP
Finally I worked it out with this code:
awk 'NR==FNR{
a[NR]=$2;
b[NR]=$3;
next; }
{ #second file
s[i]=0; m[i]=0; k[i]=0; # Add sum and mean calculation
for (i in a){
if($2>=a[i] && $3<=b[i]){ # 2,3: cols of second file here
k[i]+=1
print k #Count nb of reads found in
out[i]=out[i]" "FNR # keep Nb of Line of read
rc[i]=rc[i]" "FNR"|"$4 #keep Line and cov value of $4th col
s[i]= s[i]+$4 #sum over coverages for each exon
m[i]= s[i]/k[i] #Calculate mean (k will be the No or
#reads found on i-th exon)
}}
}
END{
for (i in out){
print "Exon", i,": Reads with their COV:",rc[i],\
"Sum=",s[i],"Mean=",m[i] >> "MeanCalc.txt"
}}' exons.bed reads.bed
OUTPUT:
Exon 2 : Reads with their COV: 2|40 3|22 4|35 5|41 Sum= 138 Mean= 34.5
etc.
uniq
in it — isn't a valid command; it seems to be missing a program name at the beginning, What is that supposed to be? – Scott Mar 20 at 12:06