Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.

Join them; it only takes a minute:

Sign up
Join the Stack Overflow community to:
  1. Ask programming questions
  2. Answer and help your peers
  3. Get recognized for your expertise

I'm a 'newbie' in python (started learning 2 weeks ago) and I'm trying to plot a file that looks like this:

"1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112

Using any of the following: pandas, matplotlib and numpy. But I am not having much of a success. I tried searching for examples but none applied to my data format.

Can somebody help me find out how to load this file in a pandas dataframe or (what would be even better) show me how to plot this?

Details:

  • The number of lines in each series is different for different datasets that I have, but in the same dataset(file) the number of rows are the same (as pictured in the code excerpt).
  • There is a blank line between each serie exactly as reproduced in the code excerpt.
  • The title of the serie is the whole string, but with one word (if it is easier to import using two words/colums) I can change the titles.

UPDATE 1:

After the help from @Goyo I changed my method convert() to be something like this:

#!/usr/bin/env python3
def convert(in_file, out_file):
name = ""
for line in in_file:
    line = line.strip()
    print(line)
    if line == "":
        continue
    if line.startswith('"'):
        name = line.strip('"')
        print("NAME:: " + name)
    else:
        out_file.write("{0}\n".format(','.join([name] + line.split("\t")) ) )

To plot I'm using the following code:

with open('nro_caribou.dat') as in_file:
    with open('output.txt', 'w+') as out_file:
       convert(in_file, out_file)
    df = pd.read_csv('output.txt', header=None,names=['Methods', 'Param', 'Time'], sep=",", )
    print(df)
    df.pivot(values='Time', index='Param', columns='Methods').plot()

My original data: https://gist.github.com/pedro-stanaka/c3eda0aa2191950a8d83

And my plot:

the final chart

share|improve this question
1  
In this sample are headers "1st Serie" and "1st Serie" and then 3 rows of data. Has this file constant number of rows of data? Are this series separated by empty row ? – jezrael Feb 14 at 17:40
    
Just edited the post to answer your questions. – PedroTanaka Feb 14 at 17:57
    
Names of Series are 1st Series and 2nd Series or can be a , b , it means is not created by two string - 1st and Series? – jezrael Feb 14 at 18:01
up vote 2 down vote accepted

AFAIK there's no builtin features in pandas, matplotlib or numpy to read files like that one. If you have some control on the data format I encourage you to change it.

If you have no options but using that format, you can parse the data yourself using just the python I/O and string manipulation features (I do not think pandas can make this easier, it is not designed to deal with these kind of files).

This function can convert data from your format to another more suitable for pandas:

def convert(in_file, out_file):
    for line in in_file:
        line = line.rstrip(' \n\r') 
        if not line:
            continue
        if line.startswith('"'):
            name = line.strip('"')
        else:
            out_file.write('{}\n'.format(','.join([name] + line.split())))

If your original file is 'input.txt' you would use it this way:

with open('input.txt') as in_file:
    with open('output.txt', 'w') as out_file:
        convert(in_file, out_file)
df = pd.read_csv('output.txt', header=None,
                 names=['Series', 'X', 'Y'])
print(df)

      Series  X    Y
0  1st Serie  2   23
1  1st Serie  4   12
2  1st Serie  6   12
3  2nd Serie  2   51
4  2nd Serie  4   90
5  2nd Serie  6  112

df.pivot(index='X', columns='Series', values='Y').plot()

enter image description here

share|improve this answer
    
Man, your solution was the best to me because it was more generic. I had to fix the convert method to be something like that: gist.github.com/pedro-stanaka/9f00abcc9efb820bb9d9 – PedroTanaka Feb 14 at 20:47
    
is there a way for you to see if you can reproduce the error I'm getting? The plot is not showing for me. Thanks in advance. – PedroTanaka Feb 14 at 21:12
    
@PedroTamaka My answer assumed that each series had a unique name. After the conversion there is no way to tell apart the two series called "CRE". That could be achieved by adding a counter to convert so to give each series a unique identifier. I won't be able to give it a try until tomorrow. – Goyo Feb 14 at 22:56
    
I had an error in my data (it was not supposed to have two 'CRE' series). I fixed and your solution worked. Now everything is fine. – PedroTanaka Feb 15 at 11:02

I think you can read_csv only once and then post processing create dataframe:

import pandas as pd
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112
"""

s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 squeeze=True,
                 header=None) #try convert output to series

print s

"1stSerie"    NaN
2              23
4              12
6              12
"2ndSerie"    NaN
2              51
4              90
6             112
Name: 0, dtype: float64

df = s.reset_index()
#set column names
df.columns = ['idx','val']
#try convert column idx to numeric, if string get NaN
print pd.to_numeric(df['idx'], errors='coerce')
0   NaN
1     2
2     4
3     6
4   NaN
5     2
6     4
7     6
Name: idx, dtype: float64

#find NaN - which values are string
print pd.isnull(pd.to_numeric(df['idx'], errors='coerce'))
0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
Name: idx, dtype: bool

#this values get to new column names
df.loc[pd.isnull(pd.to_numeric(df['idx'], errors='coerce')), 'names'] = df['idx']

#forward fill NaN values
df['names'] = df['names'].ffill()

#remove values, where column val in NaN
df = df[pd.notnull(df['val'])]
print df
  idx  val       names
1   2   23  "1stSerie"
2   4   12  "1stSerie"
3   6   12  "1stSerie"
5   2   51  "2ndSerie"
6   4   90  "2ndSerie"
7   6  112  "2ndSerie"

df.pivot(index='idx', columns='names', values='val').plot()

graphs

Or you can use read_csv and plot. If you need set names of Series to legend, use figure and legend:

import pandas as pd
import matplotlib.pyplot as plt
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112"""
s1 = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 nrows=3, #read only 3 rows of data
                 squeeze=True) #try convert output to series

print s1
2    23
4    12
6    12
Name: "1stSerie", dtype: int64

#after testing replace io.StringIO(temp) to filename
s2 = pd.read_csv(io.StringIO(temp), 
                 sep="\s+", 
                 header=4, #read row 4 to header - series name
                 engine='python',      
                 nrows=3,
                 squeeze=True)

print s2
2     51
4     90
6    112
Name: "2ndSerie", dtype: int64

plt.figure()
s1.plot()
ax = s2.plot()
ax.legend(['1stSerie','2ndSerie'])

graph

Or you can read file only once and then cut Serie s to Series s1, s2 and s3 and then create DataFrame:

import pandas as pd
import matplotlib.pyplot as plt
import io

temp=u""""1stSerie"    
2 23
4 12
6 12

"2ndSerie"
2 51
4 90
6 112

"3rdSerie"
2 51
4 90
6 112
"""
s = pd.read_csv(io.StringIO(temp), #after testing replace io.StringIO(temp) to filename
                 sep="\s+", 
                 engine='python', #because ParserWarning
                 squeeze=True) #try convert output to series

print s
2              23
4              12
6              12
"2ndSerie"    NaN
2              51
4              90
6             112
"3rdSerie"    NaN
2              51
4              90
6             112
Name: "1stSerie", dtype: float64
s1 = s[:3]
print s1
2    23
4    12
6    12
Name: "1stSerie", dtype: float64

s2 = s[4:7]
s2.name='2ndSerie'
print s2
2     51
4     90
6    112
Name: 2ndSerie, dtype: float64

s3 = s[8:]
s3.name='3rdSerie'
print s3
2     51
4     90
6    112
Name: 3rdSerie, dtype: float64

print pd.DataFrame({'a': s1, 'b': s2, 'c': s3})
    a    b    c
2  23   51   51
4  12   90   90
6  12  112  112
share|improve this answer
    
I edit answer, please check it. – jezrael Feb 14 at 20:55

You can step through the file using itertools.groupby. The LastHeader class below checks each line for a sentinal character. If the character is there, the headerline is updated, and itertools.groupby starts a new segment. The only place this runs into trouble with your dataset is where you have two series labeled "CRE". My workaround was to just delete the second one from the textfile, but you'll probably want to do something else.

The upshot here is that you can just injest the data in a single pass. No writing out and reading back in required.

from itertools import groupby
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pandas import DataFrame, Series

class LastHeader():
    """Checks for new header strings. For use with groupby"""
    def __init__(self, sentinel='#'):
        self.sentinel = sentinel
        self.lastheader = ''
        self.index=0

    def check(self, line):
        self.index += 1
        if line.startswith(self.sentinel):
            self.lastheader = line
        return self.lastheader

fname = 'dist_caribou.dat'

with open(fname, 'r') as fobj:
    lastheader = LastHeader('"')
    data = []
    for headerline, readlines in groupby(fobj, lastheader.check):
        name = headerline.strip().strip('"')
        thisdat = np.loadtxt(readlines, comments='"')
        data.append(Series(thisdat[:, 1], index=thisdat[:, 0], name=name))
data = pd.concat(data, axis=1)
data.plot().set_yscale('log')
plt.show()
share|improve this answer
    
Thank you very much for your help. The code ran faster with this modification. And is very good to know the groupby function. – PedroTanaka Feb 16 at 12:39

Given the appropriate parameters for read_csv in pandas, this is relatively trivial to plot.

s1 = pd.read_csv('series1.txt', 
                 index_col=0, 
                 sep=" ", 
                 squeeze=True, 
                 header=0, 
                 skipinitialspace=True)
>>> s1
tSerie
2    23
4    12
6    12
Name: Unnamed: 1, dtype: int64

s2 = pd.read_csv('series2.txt', 
                 index_col=0, 
                 sep=" ", 
                 squeeze=True, 
                 header=0, 
                 skipinitialspace=True)

%matplotlib inline  # If not already enabled.
s1.plot();s2.plot()

enter image description here

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.