Similarly to these questions, How to add an empty column to a dataframe? and Adding a new column to a df each cycle of a for loop, I would like to add new labels within a column, initially initialized to null, each cycle of a for loop.
I have an initial dataset of 10 rows. In a for loop, at every loop, I add more rows. I would like to assign to the new rows a label 0, to distinguish them from the original ones, already in the dataset (1).
For example:
df = pd.DataFrame(d = {'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame
>>> df
a b
0 1 5
1 2 6
2 3 7
Before starting the for loop, I am creating a new column, initializing its values to 1:
a b Label
0 1 5 1
1 2 6 1
2 3 7 1
After the first run, the loop adds new rows to the df. How can I assign to those rows the Label=0? Expected output:
a b Label
0 1 5 1
1 2 6 1
2 3 7 1
3 4 8 0
4 5 9 0
...
I tried as follows:
df['Label']=1
labels=df['Label']
for x in difference: # I will need to assign a label 0 to rows not initially included in my original df. Since 5,6 and 7 are not in a, the first run is for x in (5,6,7). I will need to skip this first run otherwise I will assign 0 to my first three rows - that I had initialised to 1
# omitted steps
labels=0
df = pd.DataFrame({"a": a_list, "b": b_list, "Labels": labels})
As mentioned, difference includes all the values in b not included in a.
Instead of the expected output, I am getting the following:
a b Label
0 1 5 0
1 2 6 0
2 3 7 0
3 4 8 0
4 5 9 0
...
The problem is that currently the value of labels = 0 is also assigned to my first original rows, because the cycle is also running for those rows, so the values 1 initially assigned are replaced.
I think an approach can be to look at length of the initial dataframe (assigning Label=1) and assign to rows greater than that a value 0. Defining a thrershold=len(df) at the beginning and, before creating the df with the new values, assigning to rows less than threshold a value 1, otherwise 0. But I do not know how to treat with rows number to try this approach. I think that .loc could solve the problem, but I do not know how to write the condition (maybe rows below the initial length, defined before the for loop).
I was thinking of something like this:
- for those rows within the initial threshold (i.e., len of my df), then assign 1;
- otherwise 0.
This should be set probably after defining df in my code, in order to create a column that takes into consideration la position of the value (row index).
I tried with: df.iloc[0:int(len(df)), "Label"]=1, but it gives me an errors: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices