Join the Stack Overflow Community
Stack Overflow is a community of 6.8 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

So I'm trying to extract some data from a text file. Currently I'm able to get the correct lines that contain the data, which in turn gives me an output looking like this:

[   0.2      0.148  100.   ]
[   0.3      0.222  100.   ]
[   0.4      0.296  100.   ]
[   0.5     0.37  100.  ]
[   0.6      0.444  100.   ]

So basically I have 5 lists with one string in each. However, as you can imagine I would like to get all of this into a numpy array with each string split into the 3 values. Like this:

[[0.2, 0.148, 100],
[0.3, 0.222, 100],
[0.4, 0.296, 100],
[0.5, 0.37, 100],
[0.6, 0.444, 100]]

But since the separator in the output is random, i.e. I don't know if it will be 3 spaces, 5 spaces or a tab, I'm kind of lost in how to do this.

UPDATE:

So the data looks a bit like this:

data_file = 

Equiv. Sphere Diam. [cm]: 6.9
Conformity Index: N/A
Gradient Measure [cm]: N/A

Relative dose [%]           Dose [Gy] Ratio of Total Structure Volume [%]
                0                   0                       100
              0.1               0.074                       100
              0.2               0.148                       100
              0.3               0.222                       100
              0.4               0.296                       100
              0.5                0.37                       100
              0.6               0.444                       100
              0.7               0.518                       100
              0.8               0.592                       100

Uncertainty plan: U1 X:+3.00cm   (variation of plan: CT1)
Dose Cover.[%]: 100.0
Sampling Cover.[%]: 100.0

Relative dose [%]           Dose [Gy] Ratio of Total Structure Volume [%]
                0                   0                       100
              0.1               0.074                       100
              0.2               0.148                       100
              0.3               0.222                       100
              0.4               0.296                       100
              0.5                0.37                       100
              0.6               0.444                       100

And the code to get the lines is:

with open(data_file) as input_data:
        # Skips text before the beginning of the interesting block:
        for line in input_data:
            if line.strip() == 'Relative dose [%]           Dose [Gy] Ratio of Total Structure Volume [%]':  # Or whatever test is needed
                break
        # Reads text until the end of the block:
        for line in input_data:  # This keeps reading the file
            if line.strip() == 'Uncertainty plan: U1 X:+3.00cm   (variation of plan: CT1)':
                break
            text_line = np.fromstring(line, sep='\t')
            print text_line

So the text before the data it self is random, so I can't just say "skip the first 5 lines", but the headers are always the same, and it ends at the same as well (before the next data begins). So I just need a way to get out the raw data, put it into a numpy array, and then I can play with it from there.

Hopefully it makes more sense now.

share|improve this question
    
Use a regex to split on \s+ – BlackBear yesterday
    
The input lacks quotes in case it should be strings? – languitar yesterday
    
It doesn't have quotes, that's for sure. What is the correct term then if it is not a string ? – Denver Dang yesterday
    
@DenverDang The string also contains the brackets [, ]? – Szabolcs yesterday
    
The first code is how I get the output when I print the lines. So by using append my idea was to put every list into a numpy array. But as stated, each list only contains "one" item, which is actually what I want to split up into 3 values for each list, and in turn end up with the array structure as seen in the second snippet of code. – Denver Dang yesterday
up vote 1 down vote accepted

With the print text_line, you are seeing arrays formatted as strings. They are formatted individually, so columns don't line up.

[   0.2      0.148  100.   ]
[   0.3      0.222  100.   ]
[   0.4      0.296  100.   ]
[   0.5     0.37  100.  ]
[   0.6      0.444  100.   ]

Instead of printing you could collect the values in a list, and concatenate that at the end.

Without actually testing, I think this would work:

data = []
with open(data_file) as input_data:
        # Skips text before the beginning of the interesting block:
        for line in input_data:
            if line.strip() == 'Relative dose [%]           Dose [Gy] Ratio of Total Structure Volume [%]':  # Or whatever test is needed
                break
        # Reads text until the end of the block:
        for line in input_data:  # This keeps reading the file
            if line.strip() == 'Uncertainty plan: U1 X:+3.00cm   (variation of plan: CT1)':
                break
            arr_line = np.fromstring(line, sep='\t')
            # may need a test on len(arr_line) to weed out blank lines
            data.append(arr_line)
data = np.vstack(data)

Another option is to collect the lines without parsing, and pass them to np.genfromtxt. In other words use your code as a filter to feed the numpy function just the right lines. It takes input from anything that feeds it lines - a file, a list, a generator.

def filter(input_data):
    # Skips text before the beginning of the interesting block:
    for line in input_data:
        if line.strip() == 'Relative dose [%]           Dose [Gy] Ratio of Total Structure Volume [%]':  # Or whatever test is needed
            break
    # Reads text until the end of the block:
    for line in input_data:  # This keeps reading the file
        if line.strip() == 'Uncertainty plan: U1 X:+3.00cm   (variation of plan: CT1)':
            break
        yield line
with open(data_file) as f:
    data = np.genfromtxt(filter(f))  # delimiter?
print(data)
share|improve this answer

Given a text file called tmp.txt like this:

   0.2      0.148  100.   
   0.3      0.222  100.   
   0.4      0.296  100.   
   0.5     0.37  100.  
   0.6      0.444  100.   

The snippet:

with open('tmp.txt', 'r') as in_file:
    print [map(float, line.split()) for line in in_file.readlines()]

Will output:

[[0.2, 0.148, 100.0], [0.3, 0.222, 100.0], [0.4, 0.296, 100.0], [0.5, 0.37, 100.0], [0.6, 0.444, 100.0]]

Which is your desired one hopefully.

share|improve this answer
    
The problem (I think) is, that I parse through an entire .txt file with a lot of content that are not just the values as seen. So I'm not quite sure if that procedure will work? (I've updated my question so it might make more sense) – Denver Dang yesterday

1) Add before with open:

import re
d_input = []

2) replace

        text_line = np.fromstring(line, sep='\t')
        print text_line

to

        d_input.append([float(x) for x in re.sub('\s+', ',', line.strip()).split(',')])

3) Add at the end:

d_array = np.array(d_input)
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.