1

I want to pick two biggest values on each row, sort them and get column names as values. Other values are dropped from dataframe.

import pandas as pd
d = {'col1': [1, 2, np.nan], 'col2': [2,3,3], 'col3': [3,6,5], 'col4': [4,9,10], 'col5': [5,1, np.nan], 'col6': [7,np.nan,2], 'col7': [np.nan, 5,6]}
df = pd.DataFrame(data=d)

I'm now able to get two biggest values of each row but reshaping dataframe based on column values is another task. Code below leaves rest of the values as Nan.

lasttwo = df.stack().sort_values(ascending=True).groupby(level=0).tail(2).unstack()

There is this example from another thread in here Find the column names which have top 3 largest values for each row Example code is down below and it works nearly fine except it doesn't work with Nan-values.

last = pd.DataFrame(df.apply(lambda x:list(df.columns[np.array(x).argsort()[::-1][:2]]), axis=1).to_list(),  columns=['Last', 'Second last'])

How to deal with these nan's?

Example:

--- col1 col2 col3 col4 col5 col6 col7
a 1 2 3 4 5 7 Nan
b 2 3 6 9 1 Nan 5
c Nan 3 5 10 Nan 2 6

Intended result:

--- Last Second last
a col6 col5
b col4 col3
c col4 col7
Eraseri
  • 15
  • 3

1 Answers1

2

You can use another alternative solutions:

lasttwo = df.apply(lambda x: pd.Series(x.nlargest(2).index[:2]), axis=1)
lasttwo.columns = ['Last',  'Second last']
print (lasttwo)
   Last Second last
0  col6        col5
1  col4        col3
2  col4        col7

Or:

lasttwo = (df.stack()
             .sort_values(ascending=False)
             .reset_index(level=1)
             .groupby(level=0)['level_1']
             .apply(lambda x: pd.Series(list(x)[:2]))
             .unstack())
lasttwo.columns = ['Last',  'Second last']
print (lasttwo)
   Last Second last
0  col6        col5
1  col4        col3
2  col4        col7

If performnace is important is possible use masked numpy arrays:

a = df.to_numpy()
mask = np.isnan(a)
ma = np.ma.masked_array(a, mask=mask)
print (ma)
[[1.0 2.0 3.0 4.0 5.0 7.0 --]
 [2.0 3.0 6.0 9.0 1.0 -- 5.0]
 [-- 3.0 5.0 10.0 -- 2.0 6.0]]

arr = df.columns.to_numpy()[ma.argsort(endwith=False, axis=1)[:, ::-1][:, :2]]
lasttwo = pd.DataFrame(arr,  columns=['Last', 'Second last'])
print (lasttwo)
   Last Second last
0  col6        col5
1  col4        col3
2  col4        col7
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I really like the first solution. It's so compact. Thank you. Am I thinking the right way if I think of this solution as following: - x.nlargest(2).index[:2] <- Get two largest numbers from dataframe and return their index names - (lambda x: pd.Series(), axis=1) <- Loop through each row by defining axis 1 - df.apply() apply to entire dataframe. Only question I'm wondering is that how pandas is able to construct dataframe from this line of code? I assume it returns dataframe but is it the apply-method? – Eraseri Feb 02 '21 at 09:32