0

Similarly to these questions, How to add an empty column to a dataframe? and Adding a new column to a df each cycle of a for loop, I would like to add new labels within a column, initially initialized to null, each cycle of a for loop.

I have an initial dataset of 10 rows. In a for loop, at every loop, I add more rows. I would like to assign to the new rows a label 0, to distinguish them from the original ones, already in the dataset (1).

For example:

df = pd.DataFrame(d = {'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame

>>> df
   a  b
0  1  5
1  2  6
2  3  7

Before starting the for loop, I am creating a new column, initializing its values to 1:

   a  b  Label
0  1  5  1
1  2  6  1 
2  3  7  1

After the first run, the loop adds new rows to the df. How can I assign to those rows the Label=0? Expected output:

   a  b   Label
0  1  5     1
1  2  6     1 
2  3  7     1
3  4  8     0
4  5  9     0

...

I tried as follows:

df['Label']=1    
labels=df['Label']

         for x in difference: # I will need to assign a label 0 to rows not initially included in my original df. Since 5,6 and 7 are not in a, the first run is for x in (5,6,7). I will need to skip this first run otherwise I will assign 0 to my first three rows - that I had initialised to 1

           # omitted steps

            labels=0

df = pd.DataFrame({"a": a_list, "b": b_list, "Labels": labels})

As mentioned, difference includes all the values in b not included in a. Instead of the expected output, I am getting the following:

   a  b   Label
0  1  5     0
1  2  6     0 
2  3  7     0
3  4  8     0
4  5  9     0

...

The problem is that currently the value of labels = 0 is also assigned to my first original rows, because the cycle is also running for those rows, so the values 1 initially assigned are replaced.

I think an approach can be to look at length of the initial dataframe (assigning Label=1) and assign to rows greater than that a value 0. Defining a thrershold=len(df) at the beginning and, before creating the df with the new values, assigning to rows less than threshold a value 1, otherwise 0. But I do not know how to treat with rows number to try this approach. I think that .loc could solve the problem, but I do not know how to write the condition (maybe rows below the initial length, defined before the for loop).

I was thinking of something like this:

  • for those rows within the initial threshold (i.e., len of my df), then assign 1;
  • otherwise 0.

This should be set probably after defining df in my code, in order to create a column that takes into consideration la position of the value (row index). I tried with: df.iloc[0:int(len(df)), "Label"]=1, but it gives me an errors: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

V_sqrt
  • 537
  • 8
  • 28
  • how're you adding new rows? can you post the code – Nk03 Apr 17 '21 at 18:26
  • Unfortunately I cannot post the whole code for that. But it could be also fine to think to add randomly generated values: `np.random.randint(0,10)`, if the values in b are not in a. – V_sqrt Apr 17 '21 at 18:29
  • I think another approach might be to look at length of the initial dataframe (assigning Label=1) and assign to rows greater than that a value= 0. Defining a thrershold=len(df) at the beginning and, before creating the df with the new values, assigning to rows less than threshold a value 1, otherwise 0. But I do not know how to treat with rows number to try this approach – V_sqrt Apr 17 '21 at 19:42

1 Answers1

1

Keep a copy of original index. After adding new rows to dataframe, use boolean indexing to assign new rows Label column to 0.

import pandas as pd

df = pd.DataFrame({'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame

df['Label'] = 1

origin_index = df.index.tolist()

df = df.append(df, ignore_index=True)

df.loc[~df.index.isin(origin_index), 'Label'] = 0
print(df)

   a  b  Label
0  1  5      1
1  2  6      1
2  3  7      1
3  1  5      0
4  2  6      0
5  3  7      0
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52