Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I have an array (in numpy, or in pandas) containing (non-unique) strings. Some of them are ints written as strings, some comprise of both digits and letters. What I would like to do is to map these strings onto (some) int or float values, in order to process them further.

I don't mean simple int(string,base). I mean a procedure that would, say go through all the strings, and then say "Aha, so lets's assign to this string such and such 'int/float-key'".

What's the most efficient way of doing that?

share|improve this question
4  
How would getting an int or float from a string containing digits and letters work? Ignore the letters? Parse them in some way? You haven't told us enough to answer this. Also, you should show us your current code and where it fails (doesn't produce the result you need, or throws and exception). – Lattyware Jun 26 at 17:03
2  
@Lattyware At a stretch some of them are ints written as strings could even cover "twelve" :) – Jon Clements Jun 26 at 17:04
1  
Or maybe you just want to create a dictionary? – Elazar Jun 26 at 17:04
It's not clear form your question if you're asking how to convert a string to an int or how to get a unique integer for each arbitrary string. For example, let's say you have ['1', 'a5', 'cde9', '1', 'cde9']. Do you want the result to be [1, 5, 9, 1, 9] or [0, 1, 2, 0, 2]? – Joe Kington Jun 26 at 17:05
1  
@SimonRighley - Sorry, the edits are still unclear. Can you give a concrete example? – Joe Kington Jun 26 at 17:07
show 5 more comments

put on hold as unclear what you're asking by Lattyware, Gerrat, Henry Keiter, C. Ross, Graviton 22 hours ago

Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking.If this question can be reworded to fit the rules in the help center, please edit the question.

2 Answers

up vote 2 down vote accepted

It sounds like you have a pandas DataFrame with various strings that you want to convert to indexed values such that each unique string has a unique integer value.

numpy.unique does what you need. (You already mentioned that you were using numpy, so I'm going to post a numpy solution.)

For example:

import numpy as np
import pandas

df = pandas.DataFrame(dict(x=['1', 'a5', 'cde9', '1', 'cde9']))

unique_vals, df['keys'] = np.unique(df.x, return_inverse=True)

print df
share|improve this answer
Thank you! That's what I wanted. – Simon Righley Jun 26 at 17:18

In case anyone viewing this has a similar need but with a normal list of strings like:

x = ['1', 'a5', 'cde9', '1', 'cde9']

You can use a dictionary comprehension to build a dictionary mapping strings to a unique id like so:

x_set = set(x)
dict = {z:id for z,id in zip(x_set,range(len(x_set)))}

set(x) gets you the unique values in x and range(len(x_set)) provides unique ids from 0 through len(x_set)-1. Use any sequence of ids you want.

Example:

>>> x = ['1', 'a5', 'cde9', '1', 'cde9']
>>> x_set = set(x)
>>> x_set
set(['1', 'cde9', 'a5'])
>>> dict = {z:id for z,id in zip(x_set,range(len(x_set)))}
>>> dict
{'1': 0, 'cde9': 1, 'a5': 2}
share|improve this answer

Not the answer you're looking for? Browse other questions tagged or ask your own question.