python find - Group by pandas dataframe and select most common string factor

frequent value (6)

Formally, the correct answer is the @eumiro Solution. The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don't have the mode. Example:

``````import pandas as pd
df = pd.DataFrame({'client' : ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E','E','E','A'], 'total' : [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 'bla':[10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]})
``````

If you compute like @HYRY you obtain:

``````df.groupby(['socio']).agg(lambda x: x.value_counts().index[0])
``````

and you obtain:

Which is clearly wrong (see the A value that should be 1 and not 4) because it can't handle with unique values.

Thus, the other solution is correct:

``````import scipy.stats
df3.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0])
``````

getting:

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

My code:

``````import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])
``````

Last line of code doesn't work, it says "Key error 'Short name'" and if I try to group only by City, then I got an AssertionError. What can I do fix it?

You can use `value_counts()` to get a count series, and get the first row:

``````import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
``````

A little late to the game here, but I was running into some performance issues with HYRY's solution, so I had to come up with another one.

It works by finding the frequency of each key-value, and then, for each key, only keeping the value that appears with it most often.

There's also an additional solution that supports multiple modes.

On a scale test that's representative of the data I'm working with, this reduced runtime from 37.4s to 0.5s!

Here's the code for the solution, some example usage, and the scale test:

``````import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
data=  [[ 1,              'A'    ],
[ 1,              'B'    ],
[ 1,              'B'    ],
[ 1,              np.nan ],
[ 2,              np.nan ],
[ 3,              'C'    ],
[ 3,              'C'    ],
[ 3,              'D'    ],
[ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.

The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the mode.

The output is a DataFrame with a record per group that has at least one mode
(null values are not counted). The `key_cols` are included as columns, `value_col`
contains a mode (ties are broken arbitrarily and deterministically) for each
group, and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size() \
.to_frame(count_col).reset_index() \
.sort_values(count_col, ascending=False) \
.drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.

The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the modes.

The output is a DataFrame with a record per group that has at least
one mode (null values are not counted). The `key_cols` are included as
columns, `value_col` contains lists indicating the modes for each group,
and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size() \
.to_frame(count_col).reset_index() \
.groupby(key_cols + [count_col])[value_col].unique() \
.to_frame().reset_index() \
.sort_values(count_col, ascending=False) \
.drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start
``````

Running this code will print something like:

``````   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
key value  count
1    1     B      2
2    3     C      2
key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537
``````

Hope this helps!

The problem here is the performance, if you have a lot of rows it will be a problem.

If it is your case, please try with this:

``````import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()
``````

A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases.

``````import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],
'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
'Short name' : ['NY','New','Spb','NY']})

grouped_df = source.groupby(['Country','City','Short name']
)[['Short name']].count().rename(columns={
'Short name':'count'}).reset_index()
grouped_df = grouped_df.sort_values('count',ascending=False)
grouped_df = grouped_df.drop_duplicates(subset=['Country','City']).drop('count', axis=1)
grouped_df
``````

Super simple column assignment

A pandas dataframe is implemented as an ordered dict of columns.

This means that the `__getitem__` `[]` can not only be used to get a certain column, but `__setitem__` `[] =` can be used to assign a new column.

For example, this dataframe can have a column added to it by simply using the `[]` accessor

``````    size      name color
0    big      rose   red
1  small    violet  blue
2  small     tulip   red
3  small  harebell  blue

df['protected'] = ['no', 'no', 'no', 'yes']

size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes
``````

Note that this works even if the index of the dataframe is off.

``````df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes
``````

[]= is the way to go, but watch out!

However, if you have a `pd.Series` and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:

``````df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no
``````

This is because a `pd.Series` by default has an index enumerated from 0 to n. And the pandas `[] =` method tries to be "smart"

What actually is going on.

When you use the `[] =` method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. `df['column'] = series`

Side note

This quickly causes cognitive dissonance, since the `[]=` method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the `[]=` in code bases, but when exploring data in a notebook, it is fine.

Going around the problem

If you have a `pd.Series` and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.

You could downcast the `pd.Series` to a `np.ndarray` or a `list`, this will do the trick.

``````df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values
``````

or

``````df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))
``````

But this is not very explicit.

Some coder may come along and say "Hey, this looks redundant, I'll just optimize this away".

Explicit way

Setting the index of the `pd.Series` to be the index of the `df` is explicit.

``````df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)
``````

Or more realistically, you probably have a `pd.Series` already available.

``````protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index

3     no
2     no
1     no
0    yes
``````

Can now be assigned

``````df['protected'] = protected_series

size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes
``````

Alternative way with `df.reset_index()`

Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.

``````df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series

size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes
``````

Note on `df.assign`

While `df.assign` make it more explicit what you are doing, it actually has all the same problems as the above `[]=`

``````df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no
``````

Just watch out with `df.assign` that your column is not called `self`. It will cause errors. This makes `df.assign` smelly, since there are these kind of artifacts in the function.

``````df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'
``````

You may say, "Well, I'll just not use `self` then". But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.