Advertisement
Guest User

Untitled

a guest
Oct 19th, 2019
164
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 17.50 KB | None | 0 0
  1. ---
  2. title: "Data Visualization"
  3. author: "Group C"
  4. date: "September 27, 2019"
  5. output:
  6. html_document: default
  7. pdf_document: default
  8. ---
  9.  
  10. ```{r setup, include = FALSE, eval = TRUE, message = FALSE, warning = FALSE, echo = FALSE}
  11. knitr::opts_chunk$set(echo = TRUE)
  12. knitr::opts_chunk$set(out.width='750px', dpi=200)
  13. ```
  14.  
  15. ```{r pressure, include = FALSE, eval = TRUE, message = FALSE, warning = FALSE, echo = FALSE}
  16. if(!'data.table'%in%installed.packages()){
  17. install.packages('data.table')}
  18. if(!'treemapify'%in%installed.packages()){
  19. install.packages('treemapify')}
  20. if(!'waffle'%in%installed.packages()){
  21. install.packages('waffle')}
  22. if(!'RColorBrewer'%in%installed.packages()){
  23. install.packages('RColorBrewer')}
  24. if(!'gtable'%in%installed.packages()){
  25. install.packages('gtable')}
  26. if(!'tidyverse'%in%installed.packages()){
  27. install.packages('tidyverse')}
  28. if(!'maps'%in%installed.packages()){
  29. install.packages('maps')}
  30. if(!'viridis'%in%installed.packages()){
  31. install.packages('viridis')}
  32. if(!'grid'%in%installed.packages()){
  33. install.packages('grid')}
  34. if(!'readxl'%in%installed.packages()){
  35. install.packages('readxl')}
  36. if(!'mosaic'%in%installed.packages()){
  37. install.packages('mosaic')}
  38. if(!'ggthemes'%in%installed.packages()){
  39. install.packages('ggthemes')}
  40.  
  41. library(data.table)
  42. library(treemapify)
  43. library(waffle)
  44. library(RColorBrewer)
  45. library(gtable)
  46. library(tidyverse)
  47. require(maps)
  48. require(viridis)
  49. library(grid)
  50. library(readxl)
  51. library(mosaic)
  52. library(ggthemes)
  53.  
  54. df <- data.table(readxl::read_xlsx("C:/Users/PC/Desktop/IE/Data Visualization/Assignment 1/AMEX 2016 Spending.xlsx"),
  55. row.names = seq(1,110))
  56. ```
  57.  
  58.  
  59. ## Theme Setting:
  60.  
  61. A custom theme is created to be used throughout the visualizations. The main colors are defined beforehand to avoid hard coding.
  62.  
  63. ```{r, eval=T, fig.width=7, fig.height=6, echo = FALSE}
  64.  
  65. # Defining the main colors:
  66. fill_color = '#C7522B'
  67. decoration_color = '#686868'
  68. main1_color = '#61795B'
  69. main2_color = '#9DB469'
  70.  
  71. #Create a theme we can apply everywhere
  72. group_c_theme<-theme_bw() + theme(
  73. panel.grid.major = element_blank(),
  74. panel.grid.minor = element_blank(),
  75. axis.title = element_text(size = 10, hjust = 0.5, color = decoration_color),
  76. axis.text = element_text(colour = decoration_color, size = 8),
  77. axis.ticks = element_blank(),
  78. axis.line = element_line(colour = decoration_color, size=0.3, linetype = '1F'),
  79. panel.border = element_blank(),
  80. panel.grid = element_blank(),
  81. strip.text = element_text(size = 12, color = decoration_color),
  82. strip.background =element_blank(),
  83. legend.position="None",
  84. plot.background = element_rect(fill = "#f5f5f2", color = NA),
  85. panel.background = element_rect(fill = "#f5f5f2", color = NA),
  86. plot.title = element_text(size = 26, hjust = 0.5, color = decoration_color)
  87. )
  88.  
  89. #Now we set the new defined theme to the default option
  90. theme_set(group_c_theme)
  91.  
  92. ```
  93.  
  94. # **An investigation into US citizen 'Andy Dufresne' conducted by the FBI, with specific focus on American Express blue card expenditures**
  95.  
  96.  
  97. ## Exploratory Data Analysis:
  98.  
  99. We first became aware of the suspect when our partners at American Express flagged an anomalously large transaction on his American Express blue card. We proceeded to pull his annual data in order to investigate for further suspicious behaviour.
  100.  
  101. The first step is to get a summary of the data as well as information regarding the data types. This sets the stage for all visualizations, as a good understanding of the data at hand allows for proper investigation.
  102.  
  103. ```{r, fig.width=7, fig.height=6, message = FALSE, warning = FALSE, echo = FALSE}
  104. head(df)
  105. summary(df)
  106. str(df)
  107. ```
  108.  
  109.  
  110.  
  111. ## Baseline Expenditures Analysis:
  112.  
  113. Starting this section, we take a look at Andy's expenses. The first chart we create plots time vs money spent. This is done in order to assess the full shape of the data, and to see if the large transaction - that of a Nissan automobile in December of the dataset - represented an outlier or fit with the suspect's usual purchasing activity. This chart reveals that the purchase was indeed an outlier.
  114.  
  115. To place this outlier in perspective we create a cumulative chart. This helps decrease the effect of the outlier on the data, and allows a clearer analysis of the rest of the year's purchasing data.
  116.  
  117. ```{r, eval=T, fig.width=7, fig.height=6, echo = FALSE}
  118.  
  119. # Spendings per time
  120. (amount_vs_time <- ggplot(df, aes(x=Date, y=Amount)) + geom_line(color = main1_color)
  121. + labs(title = "Spendings over time")+ xlab(""))+ ylab("Spendings")
  122.  
  123. # Accumulated spendings over the year
  124. (cumul_amount_vs_time <- ggplot(df, aes(x=Date)) + geom_ribbon(aes(ymin = 0, ymax = cumsum(Amount)),
  125. fill = main2_color, alpha = 0.7) +
  126. geom_line(aes(y = cumsum(Amount)),color = main1_color)+
  127. labs(title = "Accumulated spendings over time") + xlab("") + ylab("Cumulated Spendings"))
  128. ```
  129.  
  130. In order to dig deeper into the suspect's behaviour, an analysis of his purchased items is necessary; to see if they were in keeping with that of a citizen matching his profile. A treemap is assembled, showing that (taking account of the outlier) the majority of his purchases were on mundane everyday items one would expect from a law-abiding citizen. The only suspicious aspect revolved around an unusually large travel spend over the year.
  131.  
  132. ```{r, echo = FALSE}
  133. # Building a simple barplot overviewing the number of purchases per category
  134.  
  135. ggplot(df, aes(x=Category)) + geom_bar(stat='count', fill = fill_color, alpha = 0.8)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ylab("Number of transactions") + xlab("") + scale_x_discrete(limits = c("Communications", "Entertainment", "Business Services", "Other", "Restaurant", "Transportation", "Travel", "Merchandise & Supplies"))
  136.  
  137.  
  138. # Calculating total amount of spendings per category and subcategory
  139. df_treemap <- df[,list(Total_amount = sum(Amount)), by = c('Category', 'Subcategory')]
  140.  
  141. # Creating a treemap with custom colours divided by spendings per category and subcategory
  142. (categories_treemap <- ggplot(df_treemap, aes(area=Total_amount, label=Subcategory,
  143. fill = Category, subgroup=Category, subgroup2 = Subcategory)) + geom_treemap(alpha = 0.9)
  144. + geom_treemap_text(place='centre')) +
  145. geom_treemap_subgroup_border(color = "#303030", size = 2 ) +
  146. geom_treemap_subgroup2_border(color = "#303030", size = 0.5) +
  147. theme(legend.position="bottom",
  148. legend.background = element_rect(fill = "#f5f5f2", color = NA)) +
  149. scale_fill_manual(values = c("#A46B2C", "#C7522B", "#8A9B77", "#F7D086",
  150. "#E2E0B3", "#D07D41", "#61795B", "#9DB469"))
  151. ```
  152.  
  153.  
  154.  
  155. ## Removing the Outlier:
  156.  
  157. We take a closer look at Andy's expenses on a daily level, excluding abnomal auto expenses. We begin to track the suspect's movements around the US, UK and Europe in order to discover if there has been any suspicious increases.
  158.  
  159. ```{r, eval=T, fig.width=7, fig.height=6, echo = FALSE}
  160.  
  161. ### Transforming the data for Tufte chart
  162.  
  163. df_tufte <- df %>%
  164. filter(Subcategory != "Auto Services") %>%
  165. mutate(Country_group = derivedFactor(
  166. "Europe" = Country %in% c('Luxembourg', 'Germany', 'Sweden', 'Denmark', 'Netherlands', 'Ireland',
  167. 'Czech Republic','Portugal','France'),
  168. "USA" = Country %in% c('United States'),
  169. "UK" = Country %in% c('United Kingdom')
  170. ))
  171.  
  172.  
  173. # Tufte boxplot without outlier purchase of the car
  174. ggplot(df_tufte, aes(Country_group,Amount)) +
  175. geom_tufteboxplot(outlier.colour="transparent", size=0.7, color=main1_color) +
  176. labs(title = 'Spendings per region without the car') + xlab("")
  177.  
  178. ```
  179.  
  180.  
  181.  
  182. ## Quarterly Expenditure Analysis:
  183.  
  184. Our focus shifts towards Andy's expenses over time for different categories. We look at how much money he has spent on specific categories during the year. A quarterly analysis is then undertaken in order to determine if there had been a pattern to the suspect's purchases leading up to the anomalous transaction. This reveals a sudden unexplained shift in behaviour, as the suspect appears to have developed an overnight interest in September of the dataset in purchasing expensive tickets for music and entertainment events.
  185.  
  186. The data is presented in a waffle chart.
  187.  
  188. ```{r, eval=T, fig.width=9, fig.height=12, echo = FALSE}
  189.  
  190. ##### WAFFLE
  191.  
  192. # Transforming data to get percentage of spending per category and quarter
  193. seasonal_data <- df %>%
  194. mutate(quarter = quarters(Date))
  195.  
  196. business_services <- seasonal_data %>%
  197. filter(Category == 'Business Services') %>%
  198. group_by(quarter) %>%
  199. summarise(num_obs = n()) %>%
  200. mutate(percent = round(num_obs/sum(num_obs)*100))
  201.  
  202. # Defining the labels for the legend
  203. business_services_percent <- business_services$percent
  204. names(business_services_percent) <- business_services$quarter
  205.  
  206. communications <- seasonal_data %>%
  207. filter(Category == 'Communications') %>%
  208. group_by(quarter) %>%
  209. summarise(num_obs = n()) %>%
  210. mutate(percent = round(num_obs/sum(num_obs)*100))
  211.  
  212. communications_percent <- communications$percent
  213. names(communications_percent) <- communications$quarter
  214.  
  215. entertainment <- seasonal_data %>%
  216. filter(Category == 'Entertainment') %>%
  217. group_by(quarter) %>%
  218. summarise(num_obs = n()) %>%
  219. mutate(percent = round(num_obs/sum(num_obs)*100))
  220.  
  221. entertainment_percent <- entertainment$percent
  222. names(entertainment_percent) <- entertainment$quarter
  223.  
  224. merch <- seasonal_data %>%
  225. filter(Category == 'Merchandise & Supplies') %>%
  226. group_by(quarter) %>%
  227. summarise(num_obs = n()) %>%
  228. mutate(percent = round(num_obs/sum(num_obs)*100))
  229.  
  230. merch_percent <- merch$percent
  231. names(merch_percent) <- merch$quarter
  232.  
  233. other <- seasonal_data %>%
  234. filter(Category == 'Other') %>%
  235. group_by(quarter) %>%
  236. summarise(num_obs = n()) %>%
  237. mutate(percent = round(num_obs/sum(num_obs)*100))
  238.  
  239. other_percent <- other$percent
  240. names(other_percent) <- other$quarter
  241.  
  242. restaurant <- seasonal_data %>%
  243. filter(Category == 'Restaurant') %>%
  244. group_by(quarter) %>%
  245. summarise(num_obs = n()) %>%
  246. mutate(percent = round(num_obs/sum(num_obs)*100))
  247.  
  248. restaurant_percent <- restaurant$percent
  249. names(restaurant_percent) <- restaurant$quarter
  250.  
  251. transportation <- seasonal_data %>%
  252. filter(Category == 'Transportation') %>%
  253. group_by(quarter) %>%
  254. summarise(num_obs = n()) %>%
  255. mutate(percent = round(num_obs/sum(num_obs)*100))
  256.  
  257. transportation_percent <- transportation$percent
  258. names(transportation_percent) <- transportation$quarter
  259.  
  260. travel <- seasonal_data %>%
  261. filter(Category == 'Travel') %>%
  262. group_by(quarter) %>%
  263. summarise(num_obs = n()) %>%
  264. mutate(percent = round(num_obs/sum(num_obs)*100))
  265.  
  266. travel_percent <- travel$percent
  267. names(travel_percent) <- travel$quarter
  268.  
  269. iron(
  270. waffle::waffle(business_services_percent, rows=5, xlab = "Business Services",
  271. colors = c(main1_color, "#F7D086", fill_color), keep = FALSE,legend = "none"),
  272. waffle::waffle(communications_percent, rows=5, xlab = "Communications",
  273. colors = c(main1_color), keep = FALSE, legend = "none"),
  274. waffle::waffle(entertainment_percent, rows=5, xlab = "Entertainment",
  275. colors = c("#F7D086", fill_color), keep = FALSE, legend = "none"),
  276. waffle::waffle(merch_percent, rows=5, xlab = "Merchandise & Supplies",
  277. colors = c(main1_color, main2_color, "#F7D086", fill_color), keep = FALSE, legend = "none"),
  278. waffle::waffle(other_percent, rows=5, xlab = "Other",
  279. colors = c(main1_color, "#F7D086", fill_color), keep = FALSE, legend = "none"),
  280. waffle::waffle(restaurant_percent, rows=5, xlab = "Restaurant",
  281. colors = c(main1_color, main2_color, "#F7D086", fill_color), keep = FALSE, legend = "none"),
  282. waffle::waffle(transportation_percent, rows=5, xlab = "Transportation",
  283. colors = c(main1_color, main2_color, "#F7D086", fill_color), keep = FALSE, legend = "none"),
  284. waffle::waffle(travel_percent, rows=5, xlab = "Travel",
  285. colors = c(main1_color, main2_color, "#F7D086", fill_color), keep = FALSE, legend_pos = "bottom")
  286. )
  287.  
  288.  
  289. ```
  290.  
  291.  
  292.  
  293. ## Analyzing the Travel Patterns:
  294.  
  295. In order to investigate the suspect's travel activities, and specifically to see if any purchases were made in countries deemed detrimental to America's international position, we charted the suspect's movements on a global map. A world map is constructed displaying Andy's expenses in each country. The countries Andy spent money in are colored, whereas the ones he did not spend money in are left blank. Different shades of color are used to highlight the amount of money spent in each country.
  296.  
  297. This analysis reveals that the suspect's movements and international spending activities took place in low-threat states.
  298.  
  299. ```{r, eval=T, fig.width=12, fig.height=8, echo = FALSE}
  300.  
  301. ############# MAP
  302.  
  303. # Creating a vector with the countries that Andy traveled to
  304. europe <- c('UK', 'Luxembourg', 'Germany', 'Sweden', 'Denmark', 'Netherlands', 'Ireland',
  305. 'Czech Republic','Portugal','France', "USA")
  306. # Taking a set with coordinates of all the countries from the library maps
  307. world_map <- map_data('world')
  308.  
  309. # Transforming our data: calculating total spendings per country, replacing some of the names of the countries
  310. # so that they correspond to the world map dataset with coordinates, joining our data with the world map
  311. # dataset so that we have coordinates for all the countries, filtering longtitude and latitude to zoom in
  312. # Replacing NAs, so that countries Andy didn't spend money in, will be displayed
  313. data_replaced <- df %>%
  314. group_by(Country) %>%
  315. summarise(total_spending = sum(Amount)) %>%
  316. mutate(Country = replace(Country, Country =="United States", "USA")) %>%
  317. mutate(Country = replace(Country, Country =="United Kingdom", "UK")) %>%
  318. right_join(world_map, by = c("Country" = "region")) %>%
  319. filter(lat >= 20, long <= 30) %>%
  320. mutate(total_spending = replace_na(total_spending, -999))
  321.  
  322.  
  323. # Transforming total spendings to a discrete scale
  324. # Here we define equally spaced pretty breaks. They will be surrounded by the minimum value at the beginning
  325. # and the maximum value at the end.
  326.  
  327. pretty_breaks <- c(0,100,500,1000,5000,10000)
  328. # Find the extremes
  329. minVal <- min(data_replaced$total_spending, na.rm = T)
  330. maxVal <- round((max(data_replaced$total_spending, na.rm = T)),0)
  331. # Compute labels
  332. labels <- c()
  333. brks <- c(minVal, pretty_breaks, maxVal)
  334.  
  335. for(idx in 1:length(brks)){
  336. labels <- c(labels,round(brks[idx + 1], 2))
  337. }
  338. labels <- labels[1:length(labels)-1]
  339.  
  340. # Define a new variable on the data set just as above
  341. data_replaced$brks <- cut(data_replaced$total_spending,
  342. breaks = brks,
  343. include.lowest = TRUE,
  344. labels = labels)
  345. # Set sclaes for our breaks
  346. brks_scale <- levels(data_replaced$brks)
  347. labels_scale <- rev(brks_scale)
  348.  
  349.  
  350. # Creating custom theme for the map
  351. theme_map <- function(...) {
  352. theme_minimal() +
  353. theme(
  354. text = element_text(color = "#22211d"),
  355. axis.line = element_blank(),
  356. axis.text.x = element_blank(),
  357. axis.text.y = element_blank(),
  358. axis.ticks = element_blank(),
  359. axis.title.x = element_blank(),
  360. axis.title.y = element_blank(),
  361. panel.grid.major = element_blank() ,
  362. panel.grid.minor = element_blank(),
  363. plot.background = element_rect(fill = "#f5f5f2", color = NA),
  364. panel.background = element_rect(fill = "#f5f5f2", color = NA),
  365. panel.border = element_blank(),
  366. legend.position = c(0.15, 0.1),
  367. legend.text.align = 0,
  368. legend.text = element_text(size = 10, hjust = 0, color = "#4e4d47"),
  369. plot.title = element_text(size = 28, hjust = 0.8, color = "#4e4d47"),
  370. plot.subtitle = element_text(size = 22, hjust = 0.8, face = "italic", color = "#61795B"),
  371. legend.title = element_text(size = 18),
  372. plot.margin = unit(c(.5,.5,.2,.5), "cm"),
  373. panel.spacing = unit(c(-.1,0.2,.2,0.2), "cm"),
  374. plot.caption = element_text(size = 12, hjust = 0.92, color = "#61795B"),
  375. ...
  376. )
  377. }
  378.  
  379. # Creating map
  380. (p <- ggplot(data = data_replaced, aes(x = long,
  381. y = lat,
  382. group = group)) +
  383. # Country polygons
  384. geom_tile(aes(fill = brks)) +
  385. scale_alpha(name = "", range = c(0.6, 0), guide = F) +
  386. geom_polygon(aes(fill = brks)) +
  387. # Country outline
  388. geom_path(data = data_replaced, aes(x = long,
  389. y = lat,
  390. group = group),
  391. color = "white", size = 0.1) +
  392. theme_map() +
  393. labs(x = NULL,
  394. y = NULL,
  395. title = "Andy's total spendings per country",
  396. subtitle = "Andy spent the majority of his money in UK and USA",
  397. caption = "Total spendings in Europe were 5 times less than in UK and USA" ) +
  398. scale_fill_manual(
  399. values = c("#303030", "#61795B","#9DB469", "#E2E0B3", "#F7D086","#D07D41","#C7522B"),
  400. name = "Total Spendings",
  401. drop = FALSE,
  402. guide = guide_legend(
  403. direction = "horizontal",
  404. keyheight = unit(4, units = "mm"),
  405. keywidth = unit(100/length(labels), units = "mm"),
  406. title.position = 'top',
  407. title.hjust = 0.5,
  408. label.hjust = 1,
  409. nrow = 1,
  410. byrow = T,
  411. reverse = F,
  412. label.position = "bottom"
  413. )
  414. ))
  415.  
  416.  
  417. ```
  418.  
  419.  
  420.  
  421. ## Conclusion:
  422.  
  423. The investigation into Suspect Andy Dusfresne concluded as follows. While some of the suspect's activities were unusual, the car purchase was the only true outlier in the data. As such, there was not enough evidence to pursue the investigation further, and the case is now closed pending further discrepancies.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement