I was trying sklearn pipeline for the first time and using Titanic dataset. I want to first impute missing value in Embarked and then do one hot encoding. While in Sex attribute, I just want to do one hot encoding. So, I have the below steps in which two steps are for Embarked. But it is not working as expected as the Embarked column remains in addition to its one hot encoding as shown in the output(column having 'S').
If I do imputation and one hot encoding for Embarked in single step, it is working as expected.
What is the reason behind this or I am doing something wrong? Also, I didn't find any information related to this.
categorical_cols_impute = ['Embarked']
categorical_impute = Pipeline([
("mode_impute", SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='S')),
# ("one_hot", OneHotEncoder(sparse=False))
])
categorical_cols = ['Embarked', 'Sex']
categorical_one_hot = Pipeline([
("one_hot", OneHotEncoder(sparse=False))
])
preprocesor = ColumnTransformer([
("cat_impute", categorical_impute, categorical_cols_impute),
("cat_one_hot", categorical_one_hot, categorical_cols)
], remainder="passthrough")
pipe = Pipeline([
("preprocessor", preprocesor),
# ("model", RandomForestClassifier(random_state=0))
])
