0

I have a DataFrame with multiple columns. I am trying to normalize all the columns except for one, price.

I found a code that works perfectly on a sample DataFrame I created, but when I use it on the original DataFrame I have, it gives an error ValueError: Columns must be same length as key

Here is the code I am using:

df_final_1d_normalized = df_final_1d.copy()

cols_to_norm = df_final_1d.columns[df_final_1d.columns!='price']
df_final_1d_normalized[cols_to_norm] = df_final_1d_normalized[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

The issue is with reassigning the columns to themselves in the third line of code.

Specifically, this works df_final_1d_normalized[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min())).

But, this does not work df_final_1d_normalized[cols_to_norm] = df_final_1d_normalized[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

Here is a sample dataframe in case you want to test it out to see that it actually works on other DataFrames

df  = pd.DataFrame()
df['A'] = [1,2,3,4, np.nan, np.nan]
df['B'] = [2,4,2,4,5,np.nan]
df['C'] = [np.nan, np.nan, 4,5,6,3]
df['D'] = [np.nan, np.nan, np.nan, 5,4,9]

df_norm = df.copy()
cols_to_norm = df.columns[df.columns!="D"]
df_norm[cols_to_norm] = df_norm[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

What could the error be?

MathMan 99
  • 665
  • 1
  • 7
  • 19

1 Answers1

0

If I am understanding correctly, you dont need a lambda function. You can just write:

df_final_1d_normalized[cols_to_norm] = (df_final_1d_normalized[cols_to_norm] - df_final_1d_normalized[cols_to_norm].min())/(df_final_1d_normalized[cols_to_norm].max() - df_final_1d_normalized[cols_to_norm].min())

This will do the work.

Here is the example from the question:

df  = pd.DataFrame()
df['A'] = [1,2,3,4, np.nan, np.nan]
df['B'] = [2,4,2,4,5,np.nan]
df['C'] = [np.nan, np.nan, 4,5,6,3]
df['D'] = [np.nan, np.nan, np.nan, 5,4,9]

df_norm = df.copy()
cols_to_norm = df.columns[df.columns!="D"]
df_norm[cols_to_norm] = (df_norm[cols_to_norm] - df_norm[cols_to_norm].min()) / (df_norm[cols_to_norm].max() - df_norm[cols_to_norm].min())
df_norm

The result is then:

    A           B           C           D
0   0.000000    0.000000    NaN         NaN
1   0.333333    0.666667    NaN         NaN
2   0.666667    0.000000    0.333333    NaN
3   1.000000    0.666667    0.666667    5.0
4   NaN         1.000000    1.000000    4.0
5   NaN         NaN         0.000000    9.0
coco18
  • 836
  • 8
  • 18
  • I am still getting the same error `ValueError: Columns must be same length as key` – MathMan 99 Feb 10 '23 at 19:28
  • @MathMan99 I am using your code and it is working: `df_norm[cols_to_norm] = (df_norm[cols_to_norm] - df_norm[cols_to_norm].min()) / (df_norm[cols_to_norm].max() - df_norm[cols_to_norm].min())` – coco18 Feb 10 '23 at 19:31
  • I'll dig deeper on my end to see what the issue could be. The code works on the sample DataFrame. It's the reassigning part that's giving me the error. – MathMan 99 Feb 10 '23 at 19:32
  • 1
    @MathMan99 here is an answer for a similar problem: https://stackoverflow.com/questions/46585193/pandas-error-in-python-columns-must-be-same-length-as-key – coco18 Feb 10 '23 at 19:36