1

I am doing the first Kaggle challenge and I am stupified by this behaviour.

combine consists of two pd.DataFrame, one is the training set the other the test set. I wanted to drop two columns so I created a for loop that iterates over the items in combine.

for dataset in combine:
    dataset = dataset.drop(['Ticket', 'Cabin'], axis=1)
    print(dataset.columns)
for dataset in combine:
    print(dataset.columns) 

For some reason, the assignment happens only locally and doing another for loop reveals that the actual data has not changed. The output is as follows.


Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked'],
      dtype='object')
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

As you can see, in the second for loop the columns are back. Where is the problem? Am I misunderstanding how the for loop works in Python?


edit:

@kaya3 It is not the case with pandas.Series.map

for dataset in combine:  
    dataset['Name'] = dataset['Name'].map(name_map)  
    dataset['Name'] = dataset['Name'].fillna(0)

This code changes the original dataFrames in combine. The docs say that it returns series (not None). How do I tell whether the function will mutate the value?

  • 2
    You dropped the columns from your local copy of the data set. You never changed the original data sets. – Prune Feb 18 '20 at 19:05
  • 1
    According to [the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html), the `drop` method returns a new dataframe, so it presumably doesn't also mutate the old one. – kaya3 Feb 18 '20 at 19:07
  • @kaya3, so does the new dataframe have the updated table without `Ticket` and `Cabin`...just curious – de_classified Feb 18 '20 at 19:08
  • 2
    Obligatory reading: https://nedbatchelder.com/text/names.html – chepner Feb 18 '20 at 19:22
  • @chepner thank you for the link. Now it makes clear sense. I was assigning the local variable to a value, rather than mutating the values. – Helvijs Sebris Feb 20 '20 at 10:41
  • @kaya3 How do you tell that it is returning a new dataframe? It just says 'DataFrame without the removed index or column labels.' – Helvijs Sebris Feb 20 '20 at 10:47
  • 1
    Because if it wasn't a new one, there would be no need to return it. See [this answer](https://stackoverflow.com/a/7301481/12299000) for example. – kaya3 Feb 20 '20 at 10:50
  • @kaya3 It is not the case with [pandas.Series.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) `name_map = {'Mr': 1, 'Mrs':2, 'Miss':3, 'Master':4, 'Rare':5} for dataset in combine: dataset['Name'] = dataset['Name'].map(name_map) dataset['Name'] = dataset['Name'].fillna(0)` This code changes the original dataFrames in combine. The docs say that it returns series (not None). – Helvijs Sebris Feb 20 '20 at 10:59
  • Well if you do `dataset['Name'] = ...` then of course it changes the dataframe. If `map` changed it in-place then you wouldn't need to write `dataset['Name'] = ...`. See this part of chepner's link: https://nedbatchelder.com/text/names.html#presto_chango – kaya3 Feb 20 '20 at 11:43
  • Also this part: *"Note that “i = x” assigns to the name i, but “i[0] = x” doesn’t, it assigns to the first element of i’s value. It’s important to keep straight what exactly is being assigned to."* In your map example, the `dataset['Name'] = ...` mutates the dataset, in the other case `dataset = ...` doesn't mutate the dataset, it just rebinds the local variable named `dataset`. – kaya3 Feb 20 '20 at 11:49
  • You can also see from the docs, *"**inplace bool, default False** If True, do operation inplace and return None."* So the fact that `inplace` is `False` by default, and you didn't make it `True` explicitly, means the `drop` and the `map` are both not done in-place. – kaya3 Feb 20 '20 at 11:52

1 Answers1

1

When you are in the loop dataset is a copy of the DataFrame in combine, so when you change dataset you aren't changing the actual DataFrame in combine, just the copy. To change the actual DataFrame in the list try something like this:

for ii in range(len(combine)):
    combine[ii] = combine[ii].drop(['Ticket', 'Cabin'], axis=1)

Now you are changing the variable in the list and not just the copy.

Denver
  • 629
  • 4
  • 6
  • 2
    Or use the `inplace=True` option `dataset = dataset.drop(['Ticket', 'Cabin'], axis=1, inplace=True)` – Croves Feb 18 '20 at 19:15