Advertisement
JustCaused

IS - Skripta 8 i 9

Jun 6th, 2023 (edited)
161
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 7.27 KB | None | 0 0
  1. ###########################################
  2. # Data preparation and feature engineering
  3. ###########################################
  4.  
  5. ###################
  6. # Titanic data set
  7. ###################
  8.  
  9. # load Titanic train ("data/train.csv") and test sets ("data/test.csv")
  10.  
  11. # print the structure of the train set
  12.  
  13. # print the structure of the test set
  14.  
  15. ###########################
  16. # Detecting missing values
  17. ###########################
  18.  
  19. # print the summary of the train set
  20.  
  21. # print the summary of the test set
  22.  
  23. # checking the presence of empty strings or other irregular values in character variables in the train set
  24. # first for one variable, and then for all
  25.  
  26. # do the same in the test set
  27.  
  28. # check the irregular values present in the Embarked variable
  29.  
  30. # set the empty Embarked values to NA in the train set
  31.  
  32. # get indices of observations with no Cabin value from the first class, in the train set
  33.  
  34. # get indices of observations with no Cabin value from the first class, in the test set
  35.  
  36. # set the Cabin value for identified passengers to NA in the train and test sets
  37.  
  38.  
  39. # print the number of missing Cabin values in the train and test sets
  40.  
  41.  
  42. #install.packages('Amelia')
  43. # load Amelia library
  44.  
  45.  
  46. # set the display area to show two plots in the same row
  47.  
  48.  
  49. # # use the missmap f. to visualise the missing data in the train set
  50.  
  51.  
  52. # use the missmap f. to visualise the missing data in the test set
  53.  
  54.  
  55. # revert the plotting area to the default (one plot per row)
  56.  
  57.  
  58. ###########################
  59. # Handling missing values
  60. ###########################
  61.  
  62. ###############################################################
  63. ## Categorical variables with a small number of missing values
  64. ###############################################################
  65.  
  66. # create the contingency table for the values of the Embarked variable
  67.  
  68.  
  69. # replace all NA values for the Embarked variable with 'S' in the train set
  70.  
  71.  
  72. # print the contingency table for the values of the Embarked variable
  73.  
  74.  
  75. # transform the Embarked variable into a factor in both sets
  76.  
  77.  
  78. ###############################################################
  79. ## Numerical variables with a small number of missing values
  80. ###############################################################
  81.  
  82. # test the Fare variable for normality
  83.  
  84.  
  85. # get the class of the observation with missing Fare variable
  86.  
  87.  
  88. # calculate the median value for the Fare variable of all passengers from the 3rd class
  89.  
  90.  
  91. # set the median value to the Fare variable of the passenger with a missing Fare
  92.  
  93.  
  94. # print the summary of the test set
  95.  
  96.  
  97. ####################
  98. # Feature selection
  99. ####################
  100.  
  101. #################################################################
  102. ## Examining the predictive power of variables from the data set
  103. #################################################################
  104.  
  105. # transform the Sex variable into factor
  106.  
  107.  
  108. # get the summary of the Sex variable
  109.  
  110.  
  111. # compute the proportions table of the Sex variable
  112.  
  113.  
  114. # create a contingency table for Sex vs. Survived
  115.  
  116.  
  117. # transform the Survived variable into factor
  118.  
  119.  
  120. # create the table again, now labels for Survived will be available
  121.  
  122.  
  123. # compute the proportions for Sex vs. Survived
  124.  
  125.  
  126. # transform the Pclass variable into factor
  127.  
  128.  
  129. # plot the number of passengers for different classes and Survived values
  130.  
  131.  
  132. # add the Sex facet to the plot
  133.  
  134. # Instead of counts, plot the proportions
  135.  
  136.  
  137. # plot the number of passengers for different ports and Survived values
  138.  
  139.  
  140. # examine the relation between Embarked and Survived, but with proportions
  141.  
  142.  
  143. # examine the relation between Fare and Survived
  144.  
  145.  
  146. ######################
  147. # Feature engineering
  148. ######################
  149.  
  150. # add the Survived variable to the test set
  151.  
  152.  
  153. # transform the Pclass variable into factor (in the test set)
  154.  
  155.  
  156. # transform the Sex variable into factor (in the test set)
  157.  
  158.  
  159. # merge train and test sets
  160.  
  161.  
  162. ##################################
  163. ## Creating an age proxy variable
  164. ##################################
  165.  
  166. # print a sample of the Name variable
  167.  
  168.  
  169. # split the name of the first observation on , or . characters
  170.  
  171. # split the name of the first observation on , or . characters and unlist
  172.  
  173.  
  174. # split the name of the first observation on , or . characters, unlist and take the 2nd elem.
  175.  
  176.  
  177. # create a variable Title based on the value of the Name variable
  178.  
  179.  
  180. # remove the leading space character from the Title
  181.  
  182.  
  183. # print the contingency table for the Title values
  184.  
  185.  
  186. # create a vector of all women (adult female) titles
  187.  
  188.  
  189. # create a vector of all girl (young female) titles
  190.  
  191.  
  192. # create a vector of all men (adult male) titles
  193.  
  194.  
  195. # create a vector of all boy (young male) titles
  196.  
  197.  
  198. # introduce a new character variable AgeGender
  199.  
  200.  
  201. # set the AgeGender value based on the vector the Title value belongs to
  202.  
  203. # print the contingency table for the AgeGender values
  204.  
  205.  
  206. # plot the distribution of the Age attribute in the Young_Female group
  207.  
  208.  
  209. # plot the distribution of the Age attribute in the Adult_Male group
  210.  
  211.  
  212. # set the AgeGender to 'Adult_Female' for all 'girls' with age over 18
  213.  
  214.  
  215. # print the number of adult males who has the Age value set
  216.  
  217.  
  218. # set the AgeGender to 'Young_Male' for all 'Adult_Male' with age under 18
  219.  
  220.  
  221. # print the contingency table for the AgeGender variable
  222.  
  223.  
  224. # print the proportions table for the AgeGender variable
  225.  
  226.  
  227. # transform the AgeGender to factor
  228.  
  229.  
  230. # plot the AgeGender against Survived attribute
  231.  
  232.  
  233. # plot the AgeGender vs. Survived but as proportions
  234.  
  235.  
  236. ###################################
  237. ## Creating the FamilySize variable
  238. ###################################
  239.  
  240. # examine the values of the SibSp variable
  241.  
  242. # examine the values of the Parch variable
  243.  
  244.  
  245. # create a new variable FamilySize based on the SibSp and Parch values
  246.  
  247.  
  248. # print the contingency table for the FamilySize
  249.  
  250.  
  251. # compute the proportion of FamilySize >= 3 in all passangers
  252.  
  253.  
  254. # set the FamilySize to 3 to all observations where FamilySize > 3
  255.  
  256. # transform FamilySize into factor
  257.  
  258.  
  259. # plot the FamilySize vs. Survived
  260.  
  261.  
  262. #####################################
  263. ## Making use of the Ticket variable
  264. #####################################
  265.  
  266. # print a sample of Ticket values
  267.  
  268.  
  269. # compute the number of distinct values of the Ticket variable
  270.  
  271. # use tapply to compute the number of passengers on the same ticket
  272.  
  273.  
  274. # create a data frame with ticket name and ticket count as variables
  275.  
  276.  
  277. # print first few rows of the new data frame
  278.  
  279.  
  280. # print the contingency table of the count variable
  281.  
  282.  
  283. # merge titanic.all and ticket.count.df datasets on the Ticket variable
  284.  
  285.  
  286. # change the name of the newly added column to PersonPerTicket
  287.  
  288.  
  289. # print the contingency table of the PersonPerTicket variable
  290.  
  291.  
  292. # set the PersonPerTicket to 4 to all observations where PersonPerTicket > 4
  293.  
  294.  
  295. # convert PersonPerTicket to factor
  296.  
  297.  
  298. # print the contingency table for the PersonPerTicket
  299.  
  300.  
  301. # print the contingency table for the PersonPerTicket vs. FamilySize
  302.  
  303.  
  304. # plot all survived passangers (without NAs)
  305.  
  306.  
  307. # plot the PersonPerTicket vs. Survived using proportions
  308.  
  309.  
  310. ##################################
  311. # Save the augmented data set
  312. ##################################
  313.  
  314. # split into train and test set based on whether the Survived is present
  315.  
  316.  
  317. # save both data sets to a file
  318.  
  319.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement