[머신러닝] Titanic 데이터를 사용한 분류: Feature Engineering(결측치 보간법) + 하이퍼파리미터튜닝(RandomizedSearchCV)
train["Age_fill"] = train["Age"]
train["Age_fill"] = train["Age"].fillna(method="ffill")
train["Age_fill"] = train["Age"].fillna(method="bfill")
# 보간법
train["Age_interpol"] = train["Age"].interpolate(method="linear", limit_direction="both")
train[["Age", "Age_ffill", "Age_bfill", "Age_interpol"]].head()
# test도 마찬가지로 채워준다
# ESC + F
test["Age_ffill"] = test["Age"].fillna(method="ffill")
test["Age_bfill"] = test["Age"].fillna(method="bfill")
test["Age_interpol"] = test["Age"].interpolate(method="linear", limit_direction="both")
test[["Age", "Age_ffill", "Age_bfill", "Age_interpol"]]ㅠ.
label_name = "Survived"
feature_names = ["Pclass", "Sex", "Age_interpol", "Fare_fill", "Embarked"]
X_train = pd.get_dummies(train[feature_names])
X_test = pd.get_dummies(test[feature_names])
y_train = train[label_name]
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
from sklearn.model_selection import RandomizedSearchCV
param_distribution = {"max_depth": np.random.randint(3, 100, 10),
"max_features": np.random.uniform(0, 1, 100)}
✅ Python의 Numpy는 효율적으로 무작위 샘플을 만들 수 있는 numpy.random 모듈을 제공하고 있다.
np.random.uniform(low, high, size) 균등분포로부터 무작위 표본 추출
np.random.randint(low, high, size) 이산형 균등분포에서 정수형 무작위 표본 추출
clf = RandomizedSearchCG(estimator=model,
param_distributions = param_distributions,
n_iter = 5,
n_jobs = -1,
random_state=42)
clf.fit(X_train, y_train)
clf.best_estimator_
>>> RandomForestClassifier(max_depth=58, max_features=0.7754808538715331, n_jobs=-1,
random_state=42)
clf.best_score_
>>> 0.8081413596133326
pd.DataFrame(clf_cv.result_).sort_values("rank_test_score").head()
• Fitting 5 folds for each of 5 candidates, totalling 25 fits
❓질문❓
best_model = clf.best_estimator_
y_predict = best_model.fit(X_train, y_train).predict(X_test)
sns.barplot(x=best_model.feature_importances_, y=best_model.feature_names_in_)
[KMOOC-실습으로배우는머신러닝]2. Machine Learning Pipeline (0) | 2022.11.21 |
---|---|
[KMOOC-실습으로배우는머신러닝] 1. Introduction to Machine Learning (0) | 2022.11.21 |
[머신러닝] Titanic 데이터를 사용한 분류: Feature Engineering(파생변수 생성, 원핫인코딩,결측치 대체), Cross Validation (1) | 2022.11.17 |
[머신러닝] Titanic 데이터를 사용한 분류: Decision Tree, Binary Encoding, Entropy (0) | 2022.11.17 |
[머신러닝] pima 데이터를 사용한 분류: Random Forest (0) | 2022.11.17 |