Advertisement
Guest User

Untitled

a guest
Aug 25th, 2019
98
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 23.75 KB | None | 0 0
  1. '''
  2.  
  3.  
  4. Overview
  5. The data has been split into two groups:
  6.  
  7. 1) training set (train.csv) 891 Rows
  8. 2) test set (test.csv) 418 Rows
  9.  
  10.  
  11.  
  12.  
  13. Contents:
  14. 1)Import Necessary Libraries
  15. 2)Read In and Explore the Historic Data
  16. 3)Data Analysis
  17. 4)Data Visualization
  18. 5)Cleaning Data
  19. 6)Choosing the Best Model
  20. 7)Creating Submission File
  21.  
  22. '''
  23.  
  24. #1) Import Necessary Libraries
  25. #First off, we need to import several Python libraries such as numpy, pandas,
  26. # matplotlib and seaborn.
  27.  
  28. #data analysis libraries
  29. import numpy as np
  30. import pandas as pd
  31. pd.set_option('display.width', 1000)
  32. pd.set_option('display.max_column', 16)
  33. pd.set_option('precision', 2)
  34.  
  35. #visualization libraries
  36. import matplotlib.pyplot as plt
  37. import seaborn as sbn
  38.  
  39. #ignore warnings
  40. import warnings
  41. warnings.filterwarnings('ignore')
  42.  
  43. #STEP-2) Read in and Explore the Data
  44. #*********************************************
  45. #It's time to read in our training and testing data using pd.read_csv, and take a first look at the training data using the describe() function.
  46.  
  47. #import train and test CSV files
  48. train = pd.read_csv('train.csv') #12 columns
  49. test = pd.read_csv('test.csv') #11 columns
  50.  
  51. #take a look at the training data
  52.  
  53. print("A look at the training data : \n", train.describe() )
  54.  
  55. print( "\n" )
  56.  
  57. print( train.describe(include="all") )
  58.  
  59.  
  60.  
  61. print( "\n" )
  62.  
  63.  
  64.  
  65. #STEP-3) Data Analysis
  66. #**************************************************
  67. #We're going to consider the features in the
  68. # dataset and how complete they are.
  69.  
  70. #get a list of the features within the dataset
  71. print( "\n\n" , train.columns )
  72.  
  73. #OUTPUT
  74. #Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
  75. # 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
  76. # dtype='object')
  77.  
  78.  
  79.  
  80. #see a sample of the dataset to get an idea of the variables
  81. print()
  82. print( train.head() )
  83.  
  84. print()
  85. print( train.sample(5) )
  86.  
  87.  
  88.  
  89. #Observations from above output
  90. #-----------------------------
  91. #Numerical Features: Age (Continuous), Fare (Continuous), SibSp (Discrete), Parch (Discrete)
  92. #Categorical Features: Survived, Sex, Embarked, Pclass
  93. #Alphanumeric Features: Name, Ticket, Cabin
  94.  
  95.  
  96.  
  97. print( "Data types for each feature : -" )
  98. print( train.dtypes )
  99.  
  100.  
  101.  
  102.  
  103. #Now that we have an idea of what kinds of features we're working with,
  104. # we can see how much information we have about each of them.
  105.  
  106. #see a summary of the training dataset
  107. print( train.describe(include = "all") )
  108.  
  109.  
  110.  
  111.  
  112. #Some Observations from above output
  113. #------------------------------------
  114. #1)There are a total of 891 passengers in our training set.
  115.  
  116. #2)The Age feature is missing approximately 19.8% of its values.
  117. # Hence Age feature is pretty important to survival,
  118. # so we should probably attempt to fill these gaps.
  119.  
  120. #3)The Cabin feature is missing approximately 77.1% of its values.
  121. # Since so much of the feature is missing, it would be hard to fill in the missing values.
  122. # We'll probably drop these values from our dataset.
  123.  
  124. #4)The Embarked feature is missing only 2 passeners,
  125. # which should be relatively harmless.
  126.  
  127. #check for any other unusable values
  128.  
  129. print()
  130. print( pd.isnull(train).sum() )
  131.  
  132.  
  133.  
  134. #We can see that except for the above mentioned missing values,
  135. # no NaN values exist.
  136.  
  137.  
  138.  
  139. #Relationship between Features and Survival
  140. #In this section, we analyze relationship between different features
  141. # with respect to Survival. We see how different feature values
  142. # show different survival chance. We also plot different kinds of
  143. # diagrams to visualize our data and findings.
  144.  
  145.  
  146. #4) Data Visualization
  147. #*************************************
  148. #It's time to visualize our data so we can estimate few predictions
  149.  
  150. #-----------------
  151. #4.A) Sex Feature
  152. #-----------------
  153. #draw a bar plot of survival by sex
  154. sbn.barplot(x="Sex", y="Survived", data=train)
  155. plt.show()
  156.  
  157.  
  158.  
  159.  
  160. print(" percentages of females vs. males that survive")
  161. print( "Percentage of females who survived:", train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True)[1]*100 )
  162.  
  163.  
  164. print( "------------------\n\n" )
  165. print( train )
  166.  
  167.  
  168.  
  169. print( "------------------\n\n" )
  170. print( train["Survived"] )
  171.  
  172. print( "------------------\n\n" )
  173. print( train["Sex"] == 'female' )
  174.  
  175.  
  176.  
  177.  
  178. print( "**********\n\n" )
  179. print( train["Survived"][ train["Sex"] == 'female' ] )
  180.  
  181.  
  182.  
  183.  
  184. print( "*****************\n\n" )
  185. print(train["Survived"][train["Sex"] == 'female'].value_counts() )
  186.  
  187.  
  188.  
  189.  
  190.  
  191. print( "====================================\n\n" )
  192. print( train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True) )
  193.  
  194.  
  195.  
  196.  
  197.  
  198.  
  199. print( train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True)[1] )
  200.  
  201.  
  202.  
  203.  
  204. print( "Percentage of females who survived:", train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True)[1]*100 )
  205. print( "Percentage of males who survived:", train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True)[1]*100 )
  206.  
  207.  
  208.  
  209. #Percentage of females who survived: 74.2038216561
  210. #Percentage of males who survived: 18.8908145581
  211.  
  212.  
  213. #Some Observations from above output
  214. #------------------------------------
  215. # As predicted, females have a much higher chance of survival than males.
  216. # The Sex feature is essential in our predictions.
  217.  
  218.  
  219.  
  220.  
  221.  
  222.  
  223. #--------------------
  224. #4.B) Pclass Feature
  225. #--------------------
  226. #draw a bar plot of survival by Pclass
  227. sbn.barplot(x="Pclass", y="Survived", data=train)
  228. plt.show()
  229.  
  230.  
  231. #print( percentage of people by Pclass that survived
  232. print("Percentage of Pclass = 1 who survived:", train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)[1]*100)
  233.  
  234. print("Percentage of Pclass = 2 who survived:", train["Survived"][train["Pclass"] == 2].value_counts(normalize = True)[1]*100)
  235.  
  236. print("Percentage of Pclass = 3 who survived:", train["Survived"][train["Pclass"] == 3].value_counts(normalize = True)[1]*100)
  237. #Percentage of Pclass = 1 who survived: 62.962962963
  238. #Percentage of Pclass = 2 who survived: 47.2826086957
  239. #Percentage of Pclass = 3 who survived: 24.2362525458
  240.  
  241. print()
  242. print( "Percentage of Pclass = 1 who survived:\n\n", train["Survived"][train["Pclass"] == 1].value_counts() )
  243.  
  244. print()
  245. print( "Percentage of Pclass = 1 who survived:\n\n", train["Survived"][train["Pclass"] == 1].value_counts(normalize = True) )
  246.  
  247. print()
  248. print( "Percentage of Pclass = 1 who survived:\n\n", train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)[1] )
  249.  
  250.  
  251.  
  252.  
  253.  
  254. #Some Observations from above output
  255. #------------------------------------
  256. #As predicted, people with higher socioeconomic class had a higher rate of survival. (62.9% vs. 47.3% vs. 24.2%)
  257.  
  258.  
  259.  
  260. #----------------------
  261. #4.C) SibSp Feature
  262. #----------------------
  263. #draw a bar plot for SibSp vs. survival
  264. sbn.barplot(x="SibSp", y="Survived", data=train)
  265.  
  266. #I won't be printing individual percent values for all of these.
  267. print("Percentage of SibSp = 0 who survived:",
  268. train["Survived"][train["SibSp"] == 0].value_counts(normalize = True)[1]*100)
  269.  
  270. print("Percentage of SibSp = 1 who survived:",
  271. train["Survived"][train["SibSp"] == 1].value_counts(normalize = True)[1]*100)
  272.  
  273. print("Percentage of SibSp = 2 who survived:",
  274. train["Survived"][train["SibSp"] == 2].value_counts(normalize = True)[1]*100)
  275. #OUTPUT:-
  276. #Percentage of SibSp = 0 who survived: 34.5394736842
  277. #Percentage of SibSp = 1 who survived: 53.5885167464
  278. #Percentage of SibSp = 2 who survived: 46.4285714286
  279.  
  280. plt.show()
  281.  
  282.  
  283.  
  284.  
  285.  
  286.  
  287.  
  288. #Some Observations from above output
  289. #------------------------------------
  290. #In general, its clear that people with more siblings or
  291. # spouses aboard were less likely to survive.
  292. # However, contrary to expectations, people with no siblings
  293. # or spouses were less to likely to survive than those with one or two. (34.5% vs 53.4% vs. 46.4%)
  294.  
  295.  
  296.  
  297.  
  298.  
  299. #--------------------
  300. #4.D)Parch Feature
  301. #--------------------
  302.  
  303. #draw a bar plot for Parch vs. survival
  304. sbn.barplot(x="Parch", y="Survived", data=train)
  305. plt.show()
  306.  
  307.  
  308.  
  309.  
  310. #Some Observations from above output
  311. #------------------------------------
  312. #People with less than four parents or children aboard are more likely to survive than those with four or more.
  313. # Again, people traveling alone are less likely to survive than those with 1-3 parents or children.
  314.  
  315.  
  316.  
  317. #-----------------
  318. #4.E)Age Feature
  319. #-----------------
  320.  
  321.  
  322. #sort the ages into logical categories
  323. train["Age"] = train["Age"].fillna(-0.5)
  324. test["Age"] = test["Age"].fillna(-0.5)
  325.  
  326. bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
  327. labels = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
  328. train['AgeGroup'] = pd.cut(train["Age"], bins, labels = labels)
  329. test['AgeGroup'] = pd.cut(test["Age"], bins, labels = labels)
  330. print( train )
  331. #draw a bar plot of Age vs. survival
  332. sbn.barplot(x="AgeGroup", y="Survived", data=train)
  333. plt.show()
  334.  
  335. #Done********************************************************
  336.  
  337.  
  338.  
  339.  
  340.  
  341.  
  342. #Some Observations from above output
  343. #------------------------------------
  344. #Babies are more likely to survive than any other age group.
  345.  
  346.  
  347.  
  348.  
  349. #--------------------
  350. #4.F) Cabin Feature
  351. #--------------------
  352.  
  353. #I think the idea here is that people with recorded cabin numbers are of higher socioeconomic class,
  354. # and thus more likely to survive.
  355.  
  356. train["CabinBool"] = (train["Cabin"].notnull().astype('int'))
  357. test["CabinBool"] = (test["Cabin"].notnull().astype('int'))
  358.  
  359. print( "###################################\n\n" )
  360. print( train )
  361.  
  362.  
  363.  
  364. #calculate percentages of CabinBool vs. survived
  365. print("Percentage of CabinBool = 1 who survived:",
  366. train["Survived"][train["CabinBool"] == 1].value_counts(
  367. normalize = True)[1]*100)
  368.  
  369. print("Percentage of CabinBool = 0 who survived:",
  370. train["Survived"][train["CabinBool"] == 0].value_counts(
  371. normalize = True)[1]*100)
  372.  
  373. #draw a bar plot of CabinBool vs. survival
  374. sbn.barplot(x="CabinBool", y="Survived", data=train)
  375. plt.show()
  376.  
  377.  
  378. #OUTPUT :-
  379. #Percentage of CabinBool = 1 who survived: 66.6666666667
  380. #Percentage of CabinBool = 0 who survived: 29.9854439592
  381.  
  382. #Some Observations from above output
  383. #------------------------------------
  384. #People with a recorded Cabin number are, in fact,
  385. #more likely to survive. (66.6% vs 29.9%)
  386.  
  387.  
  388.  
  389.  
  390.  
  391. #5) Cleaning Data
  392. #*********************************
  393.  
  394. #Time to clean our data to account for missing values and unnecessary information!
  395.  
  396. #Looking at the Test Data
  397. #Let's see how our test data looks!
  398.  
  399. print( test.describe(include="all") )
  400.  
  401.  
  402. #Some Observations from above output for test.csv data
  403. #----------------------------------------------------
  404. #1) We have a total of 418 passengers.
  405. #2) 1 value from the Fare feature is missing.
  406. #3) Around 20.5% of the Age feature is missing in training file
  407. # we will need to fill that in.
  408.  
  409.  
  410. #Cabin Feature
  411. #we'll start off by dropping the Cabin feature since not a lot more useful information can be extracted from it.
  412. train = train.drop(['Cabin'], axis = 1)
  413. test = test.drop( ['Cabin'], axis = 1)
  414.  
  415. #Ticket Feature
  416. #we can also drop the Ticket feature since it's unlikely to yield any useful information
  417. train = train.drop(['Ticket'], axis = 1)
  418. test = test.drop(['Ticket'], axis = 1)
  419.  
  420.  
  421.  
  422. #Embarked Feature
  423. #now we need to fill in the missing values in the Embarked feature
  424. print( "Number of people embarking in Southampton (S):" , train[train["Embarked"]=="S"] )
  425.  
  426.  
  427. print( "\n\nSHAPE = " , train[train["Embarked"] == "S"].shape )
  428. print( "SHAPE[0] = " , train[train["Embarked"] == "S"].shape[0] )
  429.  
  430.  
  431.  
  432.  
  433.  
  434.  
  435. southampton = train[train["Embarked"] == "S"].shape[0]
  436. print( southampton )
  437.  
  438.  
  439. print( "Number of people embarking in Cherbourg (C):" , )
  440. cherbourg = train[train["Embarked"] == "C"].shape[0]
  441. print( cherbourg )
  442.  
  443. print( "Number of people embarking in Queenstown (Q):" , )
  444. queenstown = train[train["Embarked"] == "Q"].shape[0]
  445. print( queenstown )
  446.  
  447.  
  448.  
  449. #It's clear that the majority of people embarked in Southampton (S).
  450. # Let's go ahead and fill in the missing values with S.
  451.  
  452. #replacing the missing values in the Embarked feature with S
  453. train = train.fillna({"Embarked": "S"})
  454.  
  455.  
  456. #Age Feature
  457. #Next we'll fill in the missing values in the Age feature.
  458. # Since a higher percentage of values are missing,
  459. # it would be illogical to fill all of them with the same value (as we did with Embarked).
  460. # Instead, let's try to find a way to predict the missing ages.
  461.  
  462. #create a combined group of both datasets
  463. combine = [train, test]
  464. print( "combined data : \n",combine[0] )
  465.  
  466.  
  467. #extract a title for each Name in the train and test datasets
  468. for dataset in combine:
  469. dataset['Title'] = dataset['Name'].str.extract(', ([A-Za-z]+)\.', expand=False)
  470.  
  471.  
  472.  
  473. print( "\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n" )
  474. print( train )
  475. print()
  476.  
  477. # crosstab function builds a cross-tabulation table that can show the frequency with which certain groups of data appear.
  478. print( pd.crosstab(train['Title'], train['Sex'] ) )
  479.  
  480.  
  481.  
  482.  
  483.  
  484. # replace various titles with more common names
  485. for dataset in combine:
  486. dataset['Title'] = dataset['Title'].replace(
  487. ['Lady', 'Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'],
  488. 'Rare')
  489.  
  490. dataset['Title'] = dataset['Title'].replace(['Countess', 'Sir'], 'Royal')
  491. dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
  492. dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
  493. dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
  494.  
  495. print( "\n\nAfter grouping rare title : \n" , train )
  496.  
  497.  
  498. print( train[['Title', 'Survived']].groupby(['Title'],
  499. as_index=True).count() )
  500.  
  501.  
  502.  
  503.  
  504.  
  505. print( "\nMap each of the title groups to a numerical value." )
  506. title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Royal": 5, "Rare": 6}
  507.  
  508. for dataset in combine:
  509. dataset['Title'] = dataset['Title'].map(title_mapping)
  510. dataset['Title'] = dataset['Title'].fillna(0)
  511.  
  512.  
  513.  
  514.  
  515.  
  516. print( "\n\nAfter replacing title with neumeric values.\n" )
  517. print( train )
  518.  
  519.  
  520.  
  521. #NOTICE the values of last newly added column 'Title'
  522.  
  523.  
  524.  
  525. #Next, we'll try to predict the missing Age values from the most common age for their Title.
  526.  
  527. # fill missing age with mode age group for each title
  528. mr_age = train[train["Title"] == 1]["AgeGroup"].mode() # Mr.= Young Adult.... #jinka title 1 hai unke age group ka mod
  529. print( "mode() of mr_age : ", mr_age )
  530.  
  531. print( "\n\n" )
  532.  
  533. miss_age = train[train["Title"] == 2]["AgeGroup"].mode() #Miss.= Student
  534. print( "mode() of miss_age : ", miss_age )
  535. print( "\n\n" )
  536.  
  537. mrs_age = train[train["Title"] == 3]["AgeGroup"].mode() #Mrs.= Adult
  538. print( "mode() of mrs_age : ", mrs_age )
  539. print( "\n\n" )
  540.  
  541. master_age = train[train["Title"] == 4]["AgeGroup"].mode() # Baby
  542. print( "mode() of master_age : ", master_age )
  543. print( "\n\n" )
  544.  
  545. royal_age = train[train["Title"] == 5]["AgeGroup"].mode() # Adult
  546. print( "mode() of royal_age : ", royal_age )
  547. print( "\n\n" )
  548.  
  549. rare_age = train[train["Title"] == 6]["AgeGroup"].mode() # Adult
  550. print( "mode() of rare_age : ", rare_age )
  551.  
  552.  
  553.  
  554. print( "\n\n**************************************************\n\n" )
  555. print( train.describe(include="all") )
  556. print( train )
  557.  
  558.  
  559.  
  560.  
  561. print( "\n\n******** train[AgeGroup][0] : \n\n" )
  562.  
  563. for x in range(10) :
  564. print( train["AgeGroup"][x] )
  565.  
  566.  
  567. age_title_mapping = {1: "Young Adult", 2: "Student",
  568. 3: "Adult", 4: "Baby", 5: "Adult", 6: "Adult"}
  569.  
  570. for x in range(len(train["AgeGroup"])):
  571. if train["AgeGroup"][x] == "Unknown": # x=5 ( means for 6th record )
  572. train["AgeGroup"][x] = age_title_mapping[ train["Title"][x] ]
  573.  
  574. for x in range(len(test["AgeGroup"])):
  575. if test["AgeGroup"][x] == "Unknown":
  576. test["AgeGroup"][x] = age_title_mapping[test["Title"][x]]
  577.  
  578.  
  579.  
  580.  
  581.  
  582.  
  583. print( "\n\nAfter replacing Unknown values from AgeGroup column : \n" )
  584. print( train )
  585.  
  586.  
  587.  
  588. #Now that we've filled in the missing values at least somewhat accurately,
  589. # it is time to map each age group to a numerical value.
  590.  
  591.  
  592.  
  593.  
  594.  
  595. # map each Age value to a numerical value
  596. age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3,
  597. 'Student': 4, 'Young Adult': 5,
  598. 'Adult': 6, 'Senior': 7}
  599.  
  600. train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
  601. test['AgeGroup'] = test['AgeGroup'].map(age_mapping)
  602. print()
  603. print( train )
  604.  
  605.  
  606. # dropping the Age feature for now, might change
  607. train = train.drop(['Age'], axis=1)
  608. test = test.drop(['Age'], axis=1)
  609.  
  610. print( "\n\nAge column droped." )
  611. print( train )
  612.  
  613.  
  614. #Name Feature
  615. #We can drop the name feature now that we've extracted the titles.
  616.  
  617. #drop the name feature since it contains no more useful information.
  618. train = train.drop(['Name'], axis = 1)
  619. test = test.drop(['Name'], axis = 1)
  620.  
  621.  
  622. #Sex Feature
  623. #map each Sex value to a numerical value
  624. sex_mapping = {"male": 0, "female": 1}
  625. train['Sex'] = train['Sex'].map(sex_mapping)
  626. test['Sex'] = test['Sex'].map(sex_mapping)
  627.  
  628. print( train )
  629.  
  630.  
  631.  
  632.  
  633.  
  634.  
  635. #Embarked Feature
  636. #map each Embarked value to a numerical value
  637. embarked_mapping = {"S": 1, "C": 2, "Q": 3}
  638. train['Embarked'] = train['Embarked'].map(embarked_mapping)
  639. test['Embarked'] = test['Embarked'].map(embarked_mapping)
  640. print()
  641. print( train.head() )
  642.  
  643.  
  644.  
  645.  
  646.  
  647. #Fare Feature
  648. #It is time separate the fare values into some logical groups as well as
  649. # filling in the single missing value in the test dataset.
  650.  
  651. #fill in missing Fare value in test set based on mean fare for that Pclass
  652. for x in range(len(test["Fare"])):
  653. if pd.isnull(test["Fare"][x]):
  654. pclass = test["Pclass"][x] #Pclass = 3
  655. test["Fare"][x] = round(train[ train["Pclass"] == pclass ]["Fare"].mean(), 2)
  656.  
  657.  
  658. #map Fare values into groups of numerical values
  659. train['FareBand'] = pd.qcut(train['Fare'], 4,
  660. labels = [1, 2, 3, 4])
  661.  
  662. test['FareBand'] = pd.qcut(test['Fare'], 4,
  663. labels = [1, 2, 3, 4])
  664.  
  665.  
  666.  
  667. #drop Fare values
  668. train = train.drop(['Fare'], axis = 1)
  669. test = test.drop(['Fare'], axis = 1)
  670. #check train data
  671. print( "\n\nFare column droped\n" )
  672. print( train )
  673.  
  674.  
  675.  
  676.  
  677. #check test data
  678. print()
  679. print( test.head() )
  680.  
  681.  
  682.  
  683.  
  684.  
  685.  
  686. #****************************************
  687. #6) Choosing the Best Model
  688. #****************************************
  689.  
  690. #Splitting the Training Data
  691. #We will use part of our training data (20% in this case) to test the accuracy of our different models.
  692.  
  693. from sklearn.model_selection import train_test_split
  694.  
  695. input_predictors = train.drop(['Survived', 'PassengerId'], axis=1)
  696. ouptut_target = train["Survived"]
  697.  
  698.  
  699. x_train, x_val, y_train, y_val=train_test_split(
  700. input_predictors, ouptut_target, test_size = 0.20, random_state = 7)
  701.  
  702.  
  703.  
  704. #Testing Different Models
  705. #I will be testing the following models with my training data (got the list from here):
  706.  
  707. #1) Logistic Regression
  708. #2) Gaussian Naive Bayes
  709. #3) Support Vector Machines
  710. #4) Linear SVC
  711. #5) Perceptron
  712. #6) Decision Tree Classifier
  713. #7) Random Forest Classifier
  714. #8) KNN or k-Nearest Neighbors
  715. #9) Stochastic Gradient Descent
  716. #10) Gradient Boosting Classifier
  717.  
  718.  
  719.  
  720.  
  721. #For each model, we set the model, fit it with 80% of our training data,
  722. # predict for 20% of the training data and check the accuracy.
  723.  
  724. from sklearn.metrics import accuracy_score
  725.  
  726. #MODEL-1) LogisticRegression
  727. #------------------------------------------
  728. from sklearn.linear_model import LogisticRegression
  729. logreg = LogisticRegression()
  730. logreg.fit(x_train, y_train)
  731. y_pred = logreg.predict(x_val)
  732. acc_logreg = round(accuracy_score(y_pred, y_val) * 100, 2)
  733. print( "MODEL-1: Accuracy of LogisticRegression : ", acc_logreg )
  734.  
  735.  
  736.  
  737. #OUTPUT:-
  738. #MODEL-1: Accuracy of LogisticRegression : 77.09
  739.  
  740.  
  741.  
  742.  
  743.  
  744. #MODEL-2) Gaussian Naive Bayes
  745. #------------------------------------------
  746. from sklearn.naive_bayes import GaussianNB
  747.  
  748. gaussian = GaussianNB()
  749. gaussian.fit(x_train, y_train)
  750. y_pred = gaussian.predict(x_val)
  751. acc_gaussian = round(accuracy_score(y_pred, y_val) * 100, 2)
  752. print( "MODEL-2: Accuracy of GaussianNB : ", acc_gaussian )
  753.  
  754. #OUTPUT:-
  755. #MODEL-2: Accuracy of GaussianNB : 78.68
  756.  
  757.  
  758.  
  759.  
  760. #MODEL-3) Support Vector Machines
  761. #------------------------------------------
  762. from sklearn.svm import SVC
  763.  
  764. svc = SVC()
  765. svc.fit(x_train, y_train)
  766. y_pred = svc.predict(x_val)
  767. acc_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
  768. print( "MODEL-3: Accuracy of Support Vector Machines : ", acc_svc )
  769.  
  770. #OUTPUT:-
  771. #MODEL-3: Accuracy of Support Vector Machines : 82.74
  772.  
  773.  
  774.  
  775. #MODEL-4) Linear SVC
  776. #------------------------------------------
  777. from sklearn.svm import LinearSVC
  778.  
  779. linear_svc = LinearSVC()
  780. linear_svc.fit(x_train, y_train)
  781. y_pred = linear_svc.predict(x_val)
  782. acc_linear_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
  783. print( "MODEL-4: Accuracy of LinearSVC : ",acc_linear_svc )
  784.  
  785. #OUTPUT:-
  786. #MODEL-4: Accuracy of LinearSVC : 78.68
  787.  
  788.  
  789.  
  790.  
  791. #MODEL-5) Perceptron
  792. #------------------------------------------
  793. from sklearn.linear_model import Perceptron
  794.  
  795. perceptron = Perceptron()
  796. perceptron.fit(x_train, y_train)
  797. y_pred = perceptron.predict(x_val)
  798. acc_perceptron = round(accuracy_score(y_pred, y_val) * 100, 2)
  799. print( "MODEL-5: Accuracy of Perceptron : ",acc_perceptron )
  800.  
  801. #OUTPUT:-
  802. #MODEL-5: Accuracy of Perceptron : 79.19
  803.  
  804.  
  805.  
  806.  
  807. #MODEL-6) Decision Tree Classifier
  808. #------------------------------------------
  809. from sklearn.tree import DecisionTreeClassifier
  810.  
  811. decisiontree = DecisionTreeClassifier()
  812. decisiontree.fit(x_train, y_train)
  813. y_pred = decisiontree.predict(x_val)
  814. acc_decisiontree = round(accuracy_score(y_pred, y_val) * 100, 2)
  815. print( "MODEL-6: Accuracy of DecisionTreeClassifier : ", acc_decisiontree )
  816.  
  817. #OUTPUT:-
  818. #MODEL-6: Accuracy of DecisionTreeClassifier : 81.22
  819.  
  820.  
  821.  
  822.  
  823.  
  824. #MODEL-7) Random Forest
  825. #------------------------------------------
  826. from sklearn.ensemble import RandomForestClassifier
  827.  
  828. randomforest = RandomForestClassifier()
  829. randomforest.fit(x_train, y_train)
  830. y_pred = randomforest.predict(x_val)
  831. acc_randomforest = round(accuracy_score(y_pred, y_val) * 100, 2)
  832. print( "MODEL-7: Accuracy of RandomForestClassifier : ",acc_randomforest )
  833.  
  834. #OUTPUT:-
  835. #MODEL-7: Accuracy of RandomForestClassifier : 83.25
  836.  
  837.  
  838.  
  839.  
  840.  
  841. #MODEL-8) KNN or k-Nearest Neighbors
  842. #------------------------------------------
  843. from sklearn.neighbors import KNeighborsClassifier
  844.  
  845. knn = KNeighborsClassifier()
  846. knn.fit(x_train, y_train)
  847. y_pred = knn.predict(x_val)
  848. acc_knn = round(accuracy_score(y_pred, y_val) * 100, 2)
  849. print( "MODEL-8: Accuracy of k-Nearest Neighbors : ",acc_knn )
  850.  
  851. #OUTPUT:-
  852. #MODEL-8: Accuracy of k-Nearest Neighbors : 77.66
  853.  
  854.  
  855.  
  856.  
  857.  
  858.  
  859.  
  860. #MODEL-9) Stochastic Gradient Descent
  861. #------------------------------------------
  862. from sklearn.linear_model import SGDClassifier
  863.  
  864. sgd = SGDClassifier()
  865. sgd.fit(x_train, y_train)
  866. y_pred = sgd.predict(x_val)
  867. acc_sgd = round(accuracy_score(y_pred, y_val) * 100, 2)
  868. print( "MODEL-9: Accuracy of Stochastic Gradient Descent : ",acc_sgd )
  869.  
  870. #OUTPUT:-
  871. #MODEL-9: Accuracy of Stochastic Gradient Descent : 71.07
  872.  
  873.  
  874.  
  875.  
  876. #MODEL-10) Gradient Boosting Classifier
  877. #------------------------------------------
  878. from sklearn.ensemble import GradientBoostingClassifier
  879.  
  880. gbk = GradientBoostingClassifier()
  881. gbk.fit(x_train, y_train)
  882. y_pred = gbk.predict(x_val)
  883. acc_gbk = round(accuracy_score(y_pred, y_val) * 100, 2)
  884. print( "MODEL-10: Accuracy of GradientBoostingClassifier : ",acc_gbk )
  885.  
  886. #OUTPUT:-
  887. #MODEL-10: Accuracy of Stochastic Gradient Descent : 84.77
  888.  
  889.  
  890.  
  891.  
  892.  
  893.  
  894.  
  895.  
  896. #Let's compare the accuracies of each model!
  897.  
  898. models = pd.DataFrame({
  899. 'Model': ['Logistic Regression','Gaussian Naive Bayes','Support Vector Machines',
  900. 'Linear SVC', 'Perceptron', 'Decision Tree',
  901. 'Random Forest', 'KNN','Stochastic Gradient Descent',
  902. 'Gradient Boosting Classifier'],
  903. 'Score': [acc_logreg, acc_gaussian, acc_svc,
  904. acc_linear_svc, acc_perceptron, acc_decisiontree,
  905. acc_randomforest, acc_knn, acc_sgd, acc_gbk]
  906. })
  907.  
  908.  
  909. print()
  910. print( models.sort_values(by='Score', ascending=False) )
  911.  
  912.  
  913.  
  914.  
  915. #According to above reporting, I decided to use the Random Forest model for the testing data.
  916.  
  917.  
  918. #7) Creating Submission Result File
  919. #***********************************
  920.  
  921. #It is time to create a submission.csv file which includes our predictions for test data
  922.  
  923. #set ids as PassengerId and predict survival
  924. ids = test['PassengerId']
  925. predictions = randomforest.predict(test.drop('PassengerId', axis=1))
  926.  
  927. #set the output as a dataframe and convert to csv file named submission.csv
  928. output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
  929. output.to_csv('submission.csv', index=False)
  930.  
  931. print( "All survival predictions done." )
  932. print( "All predictions exported to submission.csv file." )
  933.  
  934. print( "output : \n",output )
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement