python find frequent - Group by pandas dataframe and select most common string factor





3 Answers

For agg, the lambba function gets a Series, which does not have a 'Short name' attribute.

stats.mode returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.

With these two simple changements:

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

returns

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY
value column groupby

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

My code:

import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

Last line of code doesn't work, it says "Key error 'Short name'" and if I try to group only by City, then I got an AssertionError. What can I do fix it?




Formally, the correct answer is the @eumiro Solution. The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don't have the mode. Example:

import pandas as pd
df = pd.DataFrame({'client' : ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E','E','E','A'], 'total' : [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 'bla':[10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]})

If you compute like @HYRY you obtain:

df.groupby(['socio']).agg(lambda x: x.value_counts().index[0])

and you obtain:

Which is clearly wrong (see the A value that should be 1 and not 4) because it can't handle with unique values.

Thus, the other solution is correct:

import scipy.stats
df3.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0])

getting:




The problem here is the performance, if you have a lot of rows it will be a problem.

If it is your case, please try with this:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()



Related