반응형
연습문제1 Solution¶
STEP1. 데이터셋 불러오기¶
In [1]:
import pandas as pd
X_train = pd.read_csv('data/연습문제/FIFA_X_train.csv', encoding = 'cp949')
X_test = pd.read_csv('data/연습문제/FIFA_X_test.csv', encoding = 'cp949')
y_train = pd.read_csv('data/연습문제/FIFA_y_train.csv', encoding = 'cp949')
STEP2. 데이터셋 확인하기¶
STEP2-1. 데이터셋 일부 확인¶
In [2]:
print(X_train.head())
print(X_test.head())
print(y_train.head())
ID Age Nationality Overall Club Preferred_Foot
0 190972 2* Argentina 81 SL Benfica Right \
1 179646 29 Denmark 66 Aarhus GF Left
2 225440 23 Guinea Bissau 68 Palermo Left
3 212642 22 Sweden 61 IF Brommapojkarna Right
4 245804 18 Turkey 57 Alanyaspor Right
Work_Rate Position Position_Class Jersey_Number
0 High/ Medium RW Forward 18 \
1 High/ Medium LCB Defender 18
2 High/ Low LW Forward 11
3 Medium/ Medium LW Forward 10
4 High/ Low ST Forward 47
Contract_Valid_Until Height Height_cm Weight_lb Release_Clause Wage
0 2019 5'8 170.0 170.0 37000 19
1 2019 6'0 180.0 179.0 625 4
2 2020 5'9 172.5 163.0 2000 2
3 2020 5'10 175.0 154.0 431 1
4 2022 5'10 175.0 132.0 578 1
ID Age Nationality Overall Club Preferred_Foot
0 137351 34 Germany 79 FC Augsburg Right \
1 207707 29 Brazil 77 Olympique Lyonnais Left
2 242101 22 Korea Republic 64 Pohang Steelers Right
3 200104 25 Korea Republic 84 Tottenham Hotspur Right
4 244229 2* Italy 66 Cittadella Right
Work_Rate Position Position_Class Jersey_Number
0 Medium/ High CDM Defender 10 \
1 Medium/ Medium LB Defender 20
2 High/ Medium RW Forward 18
3 High/ High LM Midfielder 7
4 Medium/ Medium CB Defender 15
Contract_Valid_Until Height Height_cm Weight_lb Release_Clause Wage
0 2019 5'9 172.5 172 6500 28
1 2021 5'10 175.0 163 12400 54
2 2021 6'1 182.5 170 1000 2
3 2023 6'0 180.0 143 71200 125
4 2020 6'2 NaN 170 1400 1
ID Value
0 190972 18500
1 179646 500
2 225440 1200
3 212642 325
4 245804 220
STEP2-2. 데이터셋 요약 정보 확인¶
In [3]:
print(X_train.info())
print(X_test.info())
print(y_train.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3600 entries, 0 to 3599
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 3600 non-null int64
1 Age 3600 non-null object
2 Nationality 3600 non-null object
3 Overall 3600 non-null int64
4 Club 3600 non-null object
5 Preferred_Foot 3600 non-null object
6 Work_Rate 3600 non-null object
7 Position 3600 non-null object
8 Position_Class 2842 non-null object
9 Jersey_Number 3600 non-null int64
10 Contract_Valid_Until 3600 non-null int64
11 Height 3600 non-null object
12 Height_cm 3308 non-null float64
13 Weight_lb 3528 non-null float64
14 Release_Clause 3600 non-null int64
15 Wage 3600 non-null int64
dtypes: float64(2), int64(6), object(8)
memory usage: 450.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 2400 non-null int64
1 Age 2400 non-null object
2 Nationality 2400 non-null object
3 Overall 2400 non-null int64
4 Club 2400 non-null object
5 Preferred_Foot 2400 non-null object
6 Work_Rate 2400 non-null object
7 Position 2400 non-null object
8 Position_Class 1891 non-null object
9 Jersey_Number 2400 non-null int64
10 Contract_Valid_Until 2400 non-null int64
11 Height 2400 non-null object
12 Height_cm 2212 non-null float64
13 Weight_lb 2400 non-null int64
14 Release_Clause 2400 non-null int64
15 Wage 2400 non-null int64
dtypes: float64(1), int64(7), object(8)
memory usage: 300.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3600 entries, 0 to 3599
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 3600 non-null int64
1 Value 3600 non-null int64
dtypes: int64(2)
memory usage: 56.4 KB
None
STEP2-3. 기초통계량 확인¶
In [4]:
# 수치형 컬럼들의 기초통계 확인
print(X_train.describe())
print(X_test.describe())
print(y_train.describe())
ID Overall Jersey_Number Contract_Valid_Until
count 3600.000000 3600.000000 3600.000000 3600.000000 \
mean 214137.233611 66.112500 19.623333 2020.267500
std 29536.484864 6.903121 15.665325 1.314569
min 2147.000000 46.000000 1.000000 2018.000000
25% 199536.500000 62.000000 8.000000 2019.000000
50% 221322.500000 66.000000 17.000000 2020.000000
75% 236814.000000 71.000000 26.000000 2021.000000
max 246606.000000 89.000000 99.000000 2026.000000
Height_cm Weight_lb Release_Clause Wage
count 3308.000000 3528.000000 3600.000000 3600.000000
mean 178.276149 166.061508 4414.664722 9.325000
std 6.514353 15.590009 9993.717205 19.607412
min 155.000000 115.000000 13.000000 1.000000
25% 172.500000 154.000000 536.000000 1.000000
50% 177.500000 165.000000 1200.000000 3.000000
75% 182.500000 176.000000 3600.000000 9.000000
max 202.500000 229.000000 165800.000000 250.000000
ID Overall Jersey_Number Contract_Valid_Until
count 2400.000000 2400.000000 2400.000000 2400.000000 \
mean 213967.265417 66.200000 19.720000 2020.251667
std 31874.993242 7.053302 16.201985 1.287383
min 16.000000 47.000000 1.000000 2018.000000
25% 200245.500000 62.000000 8.750000 2019.000000
50% 222183.000000 66.000000 17.000000 2020.000000
75% 237457.750000 71.000000 26.000000 2021.000000
max 246617.000000 92.000000 99.000000 2025.000000
Height_cm Weight_lb Release_Clause Wage
count 2212.000000 2400.000000 2400.000000 2400.000000
mean 178.297920 165.845417 4796.633750 10.177500
std 6.638623 15.311598 12610.117139 25.544364
min 157.500000 121.000000 15.000000 1.000000
25% 172.500000 154.000000 521.500000 1.000000
50% 177.500000 165.000000 1100.000000 3.000000
75% 182.500000 176.000000 3400.000000 9.000000
max 200.000000 220.000000 228100.000000 455.000000
ID Value
count 3600.000000 3600.000000
mean 214137.233611 2348.593056
std 29536.484864 5120.938813
min 2147.000000 10.000000
25% 199536.500000 325.000000
50% 221322.500000 675.000000
75% 236814.000000 2100.000000
max 246606.000000 78000.000000
STEP3. 데이터셋 전처리¶
STEP3-1. 불필요한 컬럼 삭제¶
In [5]:
# ID 컬럼은 선수에 대한 고유 정보로 key 역할로 모델에는 불필요함
# 결과 제출 시에는 X_test의 ID 컬럼이 필요하기 때문에 별도 저장
ID = X_test['ID'].copy()
# 데이터들에서 ID 컬럼 삭제
X_train = X_train.drop(columns = 'ID')
X_test = X_test.drop(columns = 'ID')
y_train = y_train.drop(columns = 'ID')
STEP3-2. 결측치 처리¶
In [6]:
# 결측치 확인
X_train.isna().sum()
Out[6]:
Age 0
Nationality 0
Overall 0
Club 0
Preferred_Foot 0
Work_Rate 0
Position 0
Position_Class 758
Jersey_Number 0
Contract_Valid_Until 0
Height 0
Height_cm 292
Weight_lb 72
Release_Clause 0
Wage 0
dtype: int64
In [7]:
X_test.isna().sum()
Out[7]:
Age 0
Nationality 0
Overall 0
Club 0
Preferred_Foot 0
Work_Rate 0
Position 0
Position_Class 509
Jersey_Number 0
Contract_Valid_Until 0
Height 0
Height_cm 188
Weight_lb 0
Release_Clause 0
Wage 0
dtype: int64
누락된 Position_Class 컬럼 채우기(전처리)¶
In [8]:
####### Position_Class 컬럼
# 선수 포지션을 의미하는 Position의 카테고리를 통합하는 과정에서 누락되었을 것
# 기존의 Position를 활용해 결측치를 대체
X_train['Position_Class'].value_counts() # 누락된 범주는 카운트되지 않음
Out[8]:
Position_Class
Defender 1400
Midfielder 790
Forward 652
Name: count, dtype: int64
In [9]:
# unknown으로 대체
X_train['Position_Class'] = X_train['Position_Class'].fillna('unknwon')
X_train['Position_Class'].value_counts()
Out[9]:
Position_Class
Defender 1400
Midfielder 790
unknwon 758
Forward 652
Name: count, dtype: int64
In [10]:
# pandas.crosstab(index, colums)는 교차표를 생성하는 판다스 함수
# Postion 내 'CM', 'GK', 'LF', 'RDM', 'RWB'가 어느 Position_Class에도 속하지 않음
pd.crosstab(index = X_train['Position'], columns = X_train['Position_Class'])
Out[10]:
Position_Class | Defender | Forward | Midfielder | unknwon |
---|---|---|---|---|
Position | ||||
CAM | 0 | 0 | 212 | 0 |
CB | 349 | 0 | 0 | 0 |
CDM | 178 | 0 | 0 | 0 |
CF | 0 | 12 | 0 | 0 |
CM | 0 | 0 | 0 | 295 |
GK | 0 | 0 | 0 | 391 |
LAM | 0 | 0 | 4 | 0 |
LB | 283 | 0 | 0 | 0 |
LCB | 136 | 0 | 0 | 0 |
LCM | 0 | 0 | 86 | 0 |
LDM | 48 | 0 | 0 | 0 |
LF | 0 | 0 | 0 | 3 |
LM | 0 | 0 | 208 | 0 |
LS | 0 | 42 | 0 | 0 |
LW | 0 | 71 | 0 | 0 |
LWB | 24 | 0 | 0 | 0 |
RAM | 0 | 0 | 5 | 0 |
RB | 263 | 0 | 0 | 0 |
RCB | 119 | 0 | 0 | 0 |
RCM | 0 | 0 | 65 | 0 |
RDM | 0 | 0 | 0 | 44 |
RF | 0 | 4 | 0 | 0 |
RM | 0 | 0 | 210 | 0 |
RS | 0 | 36 | 0 | 0 |
RW | 0 | 62 | 0 | 0 |
RWB | 0 | 0 | 0 | 25 |
ST | 0 | 425 | 0 | 0 |
In [11]:
# X_train에 대해 누락된 카테고리 채우기
PC_train = X_train['Position_Class'].copy()
PC_train[X_train['Position'] == 'LF'] = 'Forward'
PC_train[X_train['Position'] == 'CM'] = 'Midfielder'
PC_train[X_train['Position'] == 'RDM'] = 'Defender'
PC_train[X_train['Position'] == 'RWB'] = 'Defender'
PC_train[X_train['Position'] == 'GK'] = 'Goalkeeper'
X_train['Position_Class'] = PC_train
# X_test에 대해 누락된 카테고리 채우기
PC_test = X_test['Position_Class'].copy()
PC_test[X_test['Position'] == 'LF'] = 'Forward'
PC_test[X_test['Position'] == 'CM'] = 'Midfielder'
PC_test[X_test['Position'] == 'RDM'] = 'Defender'
PC_test[X_test['Position'] == 'RWB'] = 'Defender'
PC_test[X_test['Position'] == 'GK'] = 'Goalkeeper'
X_test['Position_Class'] = PC_test
In [12]:
# 재확인
pd.crosstab(index = X_train['Position'], columns = X_train['Position_Class'])
Out[12]:
Position_Class | Defender | Forward | Goalkeeper | Midfielder |
---|---|---|---|---|
Position | ||||
CAM | 0 | 0 | 0 | 212 |
CB | 349 | 0 | 0 | 0 |
CDM | 178 | 0 | 0 | 0 |
CF | 0 | 12 | 0 | 0 |
CM | 0 | 0 | 0 | 295 |
GK | 0 | 0 | 391 | 0 |
LAM | 0 | 0 | 0 | 4 |
LB | 283 | 0 | 0 | 0 |
LCB | 136 | 0 | 0 | 0 |
LCM | 0 | 0 | 0 | 86 |
LDM | 48 | 0 | 0 | 0 |
LF | 0 | 3 | 0 | 0 |
LM | 0 | 0 | 0 | 208 |
LS | 0 | 42 | 0 | 0 |
LW | 0 | 71 | 0 | 0 |
LWB | 24 | 0 | 0 | 0 |
RAM | 0 | 0 | 0 | 5 |
RB | 263 | 0 | 0 | 0 |
RCB | 119 | 0 | 0 | 0 |
RCM | 0 | 0 | 0 | 65 |
RDM | 44 | 0 | 0 | 0 |
RF | 0 | 4 | 0 | 0 |
RM | 0 | 0 | 0 | 210 |
RS | 0 | 36 | 0 | 0 |
RW | 0 | 62 | 0 | 0 |
RWB | 25 | 0 | 0 | 0 |
ST | 0 | 425 | 0 | 0 |
In [13]:
# 반복문으로 하는 방법
lbl_pos= ['LF', 'CM', 'RDM', 'RWB', 'GK']
lbl_pc = ['Forward', 'Midfielder', 'Defender', 'Defender', 'Goalkeeper']
for r, s in zip(lbl_pos, lbl_pc):
PC_train[X_train['Position'] == r] = s
PC_test[X_test['Position'] == r] = s
In [14]:
# 완료 후 Position 컬럼을 삭제
X_train = X_train.drop(columns = 'Position')
X_test = X_test.drop(columns = 'Position')
누락된 Height 컬럼 채우기 (전처리)¶
In [15]:
# X_train의 Height_cm 누락된 값 채우기
# 복사본 만들기
train_Height_cm = X_train['Height_cm'].copy()
train_Height = X_train['Height'].copy()
In [16]:
# 인치를 cm로 변환, '를 기준으로
split_train_Height = train_Height.str.split("'", expand = True).astype('float64')
# '를 기준으로 앞은 *30, 뒤는 *2.5 계산
train_Height_cm = (split_train_Height[0] * 30) + (split_train_Height[1] * 2.5)
# 전처리된 값을 원본데이터 'Height_cm'컬럼에 변환
X_train['Height_cm'] = train_Height_cm
In [17]:
# X_test도 똑같히 진행
test_Height_cm = X_test['Height_cm'].copy()
test_Height = X_test['Height'].copy()
In [18]:
split_test_Height = test_Height.str.split("'", expand = True).astype('float64')
test_Height_cm = (split_test_Height[0] * 30) + (split_test_Height[1] * 2.5)
X_test['Height_cm'] = test_Height_cm
In [19]:
# 완료 후 Height 컬럼 삭제
X_train = X_train.drop(columns = 'Height')
X_test = X_test.drop(columns = 'Height')
누락된 Weight_lb 컬럼 채우기(전처리)¶
In [20]:
cond_na = X_train['Weight_lb'].isna()
# X_train의 Weight_lb컬럼에서 72개 결측치를 제거한 y_train 데이터셋
y_train = y_train[~ cond_na]
# 똑같히 X_train도
X_train = X_train[~ cond_na]
print(X_train.shape, y_train.shape)
(3528, 13) (3528, 1)
STPE3-3. 카테고리형 컬럼 전처리¶
In [21]:
# 문자열(object) 컬럼들의 유일값 수 확인
print(X_train.select_dtypes('object').nunique())
Age 29
Nationality 129
Club 648
Preferred_Foot 2
Work_Rate 9
Position_Class 4
dtype: int64
In [22]:
print(X_test.select_dtypes('object').nunique())
Age 31
Nationality 116
Club 634
Preferred_Foot 2
Work_Rate 9
Position_Class 4
dtype: int64
Age컬럼¶
- 1, 2, 3은 10대, 20대, 30대
In [23]:
# 파생변수 생성
X_train['Age_gp'] = X_train['Age'].str[0]
X_test['Age_gp'] = X_test['Age'].str[0]
X_train = X_train.drop(columns = 'Age', axis = 1)
X_test = X_test.drop(columns = 'Age', axis = 1)
Club컬럼¶
In [24]:
# 현재 소속된 클럽으로, 예측에 불필요할 것으로 가정하고 컬럼을 삭제
X_train = X_train.drop(columns = 'Club')
X_test = X_test.drop(columns = 'Club')
Preferred_Foot컬럼¶
In [25]:
print(X_train['Preferred_Foot'].value_counts())
print(X_test['Preferred_Foot'].value_counts())
Preferred_Foot
Right 2682
Left 846
Name: count, dtype: int64
Preferred_Foot
Right 1854
Left 546
Name: count, dtype: int64
Work_Rate컬럼¶
- 공격 운동량 / 수비 운동량
- / 뒤에 공백 있음. 제거
- expand = True 필수. 다른 열에 할당
In [26]:
# 복사
train_Work_Rate = X_train['Work_Rate'].copy()
# /와 공백을 같이 분리해보자. train data
X_train['WR_Attack'] = train_Work_Rate.str.split("/ ", expand = True)[0]
X_train['WR_Defend'] = train_Work_Rate.str.split("/ ", expand = True)[1]
# 복사
test_Work_Rate = X_test['Work_Rate'].copy()
# /와 공백을 같이 분리해보자. test datg
X_test['WR_Attack'] = test_Work_Rate.str.split("/ ", expand = True)[0]
X_test['WR_Defend'] = test_Work_Rate.str.split("/ ", expand = True)[1]
# Work Rate 컬럼 삭제
X_train = X_train.drop(columns = 'Work_Rate')
X_test = X_test.drop(columns = 'Work_Rate')
STEP 3-4 수치형 컬럼 전처리¶
In [27]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3528 entries, 0 to 3599
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Nationality 3528 non-null object
1 Overall 3528 non-null int64
2 Preferred_Foot 3528 non-null object
3 Position_Class 3528 non-null object
4 Jersey_Number 3528 non-null int64
5 Contract_Valid_Until 3528 non-null int64
6 Height_cm 3528 non-null float64
7 Weight_lb 3528 non-null float64
8 Release_Clause 3528 non-null int64
9 Wage 3528 non-null int64
10 Age_gp 3528 non-null object
11 WR_Attack 3528 non-null object
12 WR_Defend 3528 non-null object
dtypes: float64(2), int64(5), object(6)
memory usage: 385.9+ KB
Jersey_Number 컬럼¶
In [28]:
# 불필요한 컬럼으로 삭제
X_train = X_train.drop(columns = 'Jersey_Number', axis = 1)
X_test = X_test.drop(columns = 'Jersey_Number', axis = 1)
Contract_Valid_Until 컬럼¶
In [29]:
print(X_train['Contract_Valid_Until'].unique())
print(X_test['Contract_Valid_Until'].unique())
# 연도들이 카테고리 의미를 가지고 있기 때문에 dtype 변환
X_train['CVU_gp'] = X_train['Contract_Valid_Until'].astype('object')
X_test['CVU_gp'] = X_test['Contract_Valid_Until'].astype('object')
# Contract_valid_Until 컬럼 삭제
X_train = X_train.drop(columns = 'Contract_Valid_Until', axis = 1)
X_test = X_test.drop(columns = 'Contract_Valid_Until', axis = 1)
[2019 2020 2022 2021 2018 2023 2024 2025 2026]
[2019 2021 2023 2020 2022 2018 2024 2025]
수치형 컬럼 간 상관관계¶
- 약한 상관관계 : [Overall - Release_Clause], [Overall - Wage]
- 강한 상관관계 : [Release_Clause - Wage]
- Wage를 입력데이터로 선택
In [30]:
columns_conti = ['Overall', 'Height_cm', 'Weight_lb', 'Release_Clause', 'Wage']
X_train[columns_conti].corr()
Out[30]:
Overall | Height_cm | Weight_lb | Release_Clause | Wage | |
---|---|---|---|---|---|
Overall | 1.000000 | 0.033070 | 0.142888 | 0.630162 | 0.593469 |
Height_cm | 0.033070 | 1.000000 | 0.763351 | -0.007983 | -0.001661 |
Weight_lb | 0.142888 | 0.763351 | 1.000000 | 0.033941 | 0.053856 |
Release_Clause | 0.630162 | -0.007983 | 0.033941 | 1.000000 | 0.835107 |
Wage | 0.593469 | -0.001661 | 0.053856 | 0.835107 | 1.000000 |
In [31]:
# Release_Clause 컬럼 제외
X_train = X_train.drop('Release_Clause', axis = 1)
X_test = X_test.drop('Release_Clause', axis = 1)
STEP 3-5. 데이터 분할¶
- X_train 데이터를 훈련데이터(X_TRAIN) 검증데이터(X_VAL)로 나눔
- y_train 데이터를 훈련데이터(y_TRAIN) 검증데이터(y_VAL)로 나눔
In [32]:
from sklearn.model_selection import train_test_split
X_TRAIN, X_VAL, y_TRAIN, y_VAL = train_test_split(X_train,y_train,test_size = 0.3,
random_state = 1234)
print(X_TRAIN.shape)
print(X_VAL.shape)
print(y_TRAIN.shape)
print(y_VAL.shape)
(2469, 11)
(1059, 11)
(2469, 1)
(1059, 1)
STEP 3-6. 인코딩¶
In [33]:
# 카테고리형 컬럼에 대하여 원-핫 인코딩 수행
from sklearn.preprocessing import OneHotEncoder
# 인코딩할 카테고리형 컬럼만 별도 저장
X_TRAIN_category = X_TRAIN.select_dtypes('object').copy()
X_VAL_category = X_VAL.select_dtypes('object').copy()
X_TEST_category = X_test.select_dtypes('object').copy()
# Nationality의 유일 값 수가 데이터셋마다 다름
# handle_unknown = 'ignore'은 Train에 없는 레이블이 Test에 있더라도 이들을 모두 0이됨
enc = OneHotEncoder(handle_unknown = 'ignore',
sparse_output = False).fit(X_TRAIN_category)
# 원-핫 인코딩
X_TRAIN_OH = enc.transform(X_TRAIN_category)
X_VAL_OH = enc.transform(X_VAL_category)
X_TEST_OH = enc.transform(X_TEST_category)
STEP 3-7. 스케일링¶
In [34]:
from sklearn.preprocessing import StandardScaler
# StandardScaler 평균 0, 표준편차 1인 형태로 표준화
colnm_conti = ['Overall', 'Height_cm', 'Weight_lb', 'Wage']
# 복사
X_TRAIN_conti = X_TRAIN[colnm_conti].copy()
X_VAL_conti = X_VAL[colnm_conti].copy()
X_TEST_conti = X_test[colnm_conti].copy()
# 스케일링
scale = StandardScaler().fit(X_TRAIN_conti)
# z-점수 표준화
X_TRAIN_STD = scale.transform(X_TRAIN_conti)
X_VAL_STD = scale.transform(X_VAL_conti)
X_TEST_STD = scale.transform(X_TEST_conti)
In [35]:
print(X_TRAIN_OH.shape, X_VAL_OH.shape, X_TEST_OH.shape)
print(X_TRAIN_STD.shape, X_VAL_STD.shape, X_TEST_STD.shape)
(2469, 145) (1059, 145) (2400, 145)
(2469, 4) (1059, 4) (2400, 4)
STEP 3-8. 입력 데이터셋 준비¶
In [36]:
import numpy as np
# 인코딩과 스케일링된 넘파이 ndarray 연결
X_TRAIN = np.concatenate([X_TRAIN_OH, X_TRAIN_STD], axis = 1)
X_VAL = np.concatenate([X_VAL_OH, X_VAL_STD], axis = 1)
# 1차원 넘파이 ndarray로 평탄화
y_TRAIN = y_TRAIN.values.ravel()
y_VAL = y_VAL.values.ravel()
STEP 4. 모델 학습¶
In [37]:
from sklearn.tree import DecisionTreeRegressor # 연속형
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, AdaBoostRegressor
STEP 4-1. random forest¶
In [38]:
# 랜덤포레스트 모델 생성
rf = RandomForestRegressor(n_estimators = 500, max_depth = 3,
min_samples_leaf = 10, max_features = 50, random_state = 2022)
# 랜덤포레스트 모델 학습
model_rf = rf.fit(X_TRAIN, y_TRAIN)
STEP 4-2. Bagging¶
- max_depth : 트리의 최대 깊이
- min_samples_leaf : 리프 노드에 있어야하는 최소 샘플 수
In [39]:
dtr = DecisionTreeRegressor(max_depth = 3, min_samples_leaf = 10)
bag = BaggingRegressor(estimator = dtr, n_estimators = 500,
random_state = 2022)
model_bag = bag.fit(X_TRAIN, y_TRAIN)
STEP 4-3. AdaBoost¶
In [40]:
dtr = DecisionTreeRegressor(max_depth = 3, min_samples_leaf = 10)
ada = AdaBoostRegressor(estimator = dtr, n_estimators = 500,
learning_rate = 0.5, random_state = 2022)
model_ada = ada.fit(X_TRAIN, y_TRAIN)
STEP 4-4. 성능평가(기준 : RMSE)를 통한 모델 선정¶
In [41]:
from sklearn.metrics import mean_squared_error
# 검증용 데이터셋을 통한 예측
pred_rf = model_rf.predict(X_VAL)
pred_bag = model_bag.predict(X_VAL)
pred_ada = model_ada.predict(X_VAL)
In [42]:
rmse_rf = mean_squared_error(y_VAL, pred_rf, squared = False)
rmse_bag = mean_squared_error(y_VAL, pred_bag, squared = False)
rmse_ada = mean_squared_error(y_VAL, pred_ada, squared = False)
print(rmse_rf)
print(rmse_bag)
print(rmse_ada)
1846.2123354846974
1502.89475016663
2396.352145550122
- 가장 낮은 bagging 모델 선정
STEP 5. 결과 제출하기¶
- 실제 시험에서 답 제출시에는 성능이 가장 우수한 모형 하나만 구현!
In [43]:
X_TEST = np.concatenate([X_TEST_OH, X_TEST_STD], axis = 1)
y_pred = model_bag.predict(X_TEST)
# 문제에서 요구하는 형태로 변환 필요
obj = {'ID' : ID, 'Purchase' : y_pred}
result = pd.DataFrame(obj)
# 하위에 12345.csv 이름으로 저장
result.to_csv("12345.csv", index = False)
STEP 6. 채점 모델 평가(번외)¶
In [51]:
actual = pd.read_csv('data/연습문제/FIFA_y_test.csv', encoding = 'cp949')
actual = actual['Value'].ravel()
In [53]:
# 채점 기준이 될 성과지표 값
mean_squared_error(actual, y_pred, squared = False)
Out[53]:
2656.2235458310656
반응형