# variable - python convert numeric to categorical

## numpy convert categorical string arrays to an integer array (5)

... years later....

For completeness (because this isn't mentioned in the answers) and personal reasons (I always have `pandas` imported in my modules but not necessarily `sklearn`), this is also quite straightforward with `pandas.get_dummies()`

``````import numpy as np
import pandas

In [1]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])

In [2]: b = pandas.get_dummies(a)

In [3]: b
Out[3]:
a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
4  0  1  0
5  0  0  1

In [3]: b.values.argmax(1)
Out[4]: array([0, 1, 2, 0, 1, 2])``````

I'm trying to convert a string array of categorical variables to an integer array of categorical variables.

Ex.

``````import numpy as np
a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])
print a.dtype
>>> |S1

b = np.unique(a)
print b
>>>  ['a' 'b' 'c']

c = a.desired_function(b)
print c, c.dtype
>>> [1,2,3,1,2,3] int32``````

I realize this can be done with a loop but I imagine there is an easier way. Thanks.

...some more years pass...

Thought I would provide a pure python solution for completeness:

``````def count_unique(a):
def counter(item, c=[0], items={}):
if item not in items:
items[item] = c[0]
c[0] += 1
return items[item]
return map(counter, a)

a = [0, 2, 6, 0, 2]
print count_unique(a)
>> [0, 1, 2, 0, 1]``````

Another option is to use a categorical pandas Series:

``````>>> import pandas as pd
>>> pd.Series(['a', 'b', 'c', 'a', 'b', 'c'], dtype="category").cat.codes.values

array([0, 1, 2, 0, 1, 2], dtype=int8)``````

One way is to use the `categorical` function from scikits.statsmodels. For example:

``````In [60]: from scikits.statsmodels.tools import categorical

In [61]: a = np.array( ['a', 'b', 'c', 'a', 'b', 'c'])

In [62]: b = categorical(a, drop=True)

In [63]: b.argmax(1)
Out[63]: array([0, 1, 2, 0, 1, 2])``````

The return value from `categorical` (`b`) is actually a design matrix, hence the call to `argmax` above to get it close to your desired format.

``````In [64]: b
Out[64]:
array([[ 1.,  0.,  0.],
[ 0.,  1.,  0.],
[ 0.,  0.,  1.],
[ 1.,  0.,  0.],
[ 0.,  1.,  0.],
[ 0.,  0.,  1.]])``````

np.unique has some optional returns

return_inverse gives the integer encoding, which I use very often

``````>>> b, c = np.unique(a, return_inverse=True)
>>> b
array(['a', 'b', 'c'],
dtype='|S1')
>>> c
array([0, 1, 2, 0, 1, 2])
>>> c+1
array([1, 2, 3, 1, 2, 3])``````

it can be used to recreate the original array from uniques

``````>>> b[c]
array(['a', 'b', 'c', 'a', 'b', 'c'],
dtype='|S1')
>>> (b[c] == a).all()
True``````