Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

Hi I would like to extract strings from input file like the below:

>a11
UCUUUGGUUAUCUAGCUGUAUGA
>a11
UCUUUGGUUAUCUAGCUGUAUGA
>b22
UGGUCGACCAGUUGGAAAGUAAU
>b22
ACUUCACCUGGUCCACUAGCCGU
>b22
AGGUUGUCUGUGAUGAGUUCG
>t33
UUAAUGCUAAUCGUGAUAGGGGU
>t33
CAGUAACAAAGAUUCAUCCUUGU

The line starts with ">" is a header and the line below is a sequence.

I would like to extract the sequences with header only strats with ">b22"

This is my code which do not give the properl answer.

def extractData():
    filename = ("data.txt")
    infile = open(filename,'r')

    for x in infile.readlines():
        x = x.strip()
        if x.startswith(">"):
            header = x
        else:
            sequence = x
        if header.startswith(">b22"):
            print(header, sequence)
    infile.close()

extractData()

It gives result like this:

>b22 UCUUUGGUUAUCUAGCUGUAUGA
>b22 UGGUCGACCAGUUGGAAAGUAAU
>b22 UGGUCGACCAGUUGGAAAGUAAU
>b22 ACUUCACCUGGUCCACUAGCCGU
>b22 ACUUCACCUGGUCCACUAGCCGU
>b22 AGGUUGUCUGUGAUGAGUUCG

But, my expected result is like this:

>b22 UGGUCGACCAGUUGGAAAGUAAU
>b22 ACUUCACCUGGUCCACUAGCCGU
>b22 AGGUUGUCUGUGAUGAGUUCG

Can somebody fix this please? What is the matter and what should I imply to get the correct result?

Thank you in advacne.

share|improve this question
add comment

2 Answers

up vote 2 down vote accepted

Minor changes in your code:

def extractData():
    filename = ("data.txt")
    infile = open(filename,'r')

    for x in infile.readlines():
        x = x.strip()
        if x.startswith(">"):
            header = x
        else:
            sequence = x
            if header.startswith(">b22"):
                print(header, sequence)
                header = ''


    infile.close()

extractData()

Btw, you can use debugger to identify what is wrong with the flow of program. If you are new to Python then I would recommend using Eclipse with Pydev plugin for interactive debugging. Link for Tutorial on Pydev in Eclipse

Having said that, issue appears because if header.startswith(">b22") is being evaluated for each line parsed from file. When you move it inside else block it will only be evaluated after you are done parsing sequence (and it does not evaluate for header lines, obviously).

share|improve this answer
 
Great, but it doesnt work with the line header = ''. If I erase it, then it works. Why give the header null? Thank you @chandan –  Karyo 21 hours ago
 
fixed the problem. once the header was used with sequence i.e. print(header, sequence) then it should be safe to set it to empty string. –  Chandan 19 hours ago
add comment

Here is a different approach:

>>> with open('data.txt') as f:
...     for line in f:
...         if line.startswith('>b22'):
...             print('{0} {1}'.format(line.strip(), next(f).strip()))
...
>b22 UGGUCGACCAGUUGGAAAGUAAU
>b22 ACUUCACCUGGUCCACUAGCCGU
>b22 AGGUUGUCUGUGAUGAGUUCG

Since the file can be iterated over, when you reach the line with >b22, you can use next() to read the next line.

share|improve this answer
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.