I am doing the first Kaggle challenge and I am stupified by this behaviour.
combine consists of two pd.DataFrame, one is the training set the other the test set. I wanted to drop two columns so I created a for loop that iterates over the items in combine.
for dataset in combine:
dataset = dataset.drop(['Ticket', 'Cabin'], axis=1)
print(dataset.columns)
for dataset in combine:
print(dataset.columns)
For some reason, the assignment happens only locally and doing another for loop reveals that the actual data has not changed. The output is as follows.
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Fare', 'Embarked'],
dtype='object')
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
'Embarked'],
dtype='object')
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
As you can see, in the second for loop the columns are back. Where is the problem? Am I misunderstanding how the for loop works in Python?
edit:
@kaya3 It is not the case with pandas.Series.map
for dataset in combine:
dataset['Name'] = dataset['Name'].map(name_map)
dataset['Name'] = dataset['Name'].fillna(0)
This code changes the original dataFrames in combine. The docs say that it returns series (not None). How do I tell whether the function will mutate the value?