Hw2 (R)

---- Probability Distributions ----

# Normal Distribution: plot(seq(-3,3,.1),dnorm(seq(-3,3,.1)))

1. #Central Limit Theorem ( https://github.com/victoirv/CLTgif )
Pick any data set of numbers (preferably > 100 meaningful numbers) and write a program that samples 10 numbers from it for a "sample-mean". Compute 10000 sample means and plot them as a histogram. The distribution should look like a normal distribution

rands <- runif(10000,0,100)
hist(sapply(1:1000, function(i){ mean(sample(rands,10)) }))


2. # Z-score to p , http://ncalculators.com/statistics/z-score-calculator.htm , pnorm
 Given a normal distribution with mean 5 and standard deviation 3, what is the probability of drawing a value less than 2 ? ( We will do this problem in class )

> pnorm(2,5,3)
[1] 0.1586553

3. # P-score to z , qnorm
 Given a normal distribution with mean 5 and standard deviation 3, 99.9% of values are greater than X. What is X? (Do in class)

> qnorm(0.001 , 5, 3)
[1] -4.270697


4. # 2 sided P-score to z,
 Given a normal distribution with mean 5 and standard deviation 3, 80% of the values are between A and B, what is A or B?

> qnorm(0.1 , 5, 3)
[1] 1.155345
> qnorm(0.9 , 5, 3)
[1] 8.844655


5. You can use the binomial distribution to model how many of 10 fair coin flips would result in 4 or fewer heads (pbinom(4,10,0.5)). Similarly (although not needed for this problem), you can find that in 99% of cases, you would find at most how many number of heads: qbinom(0.99,10,0.5). Finally you can use dbinom to get the percentage of exactly one event, such as dbinom(1,3,0.5) is 3/8, or the probability of having at least 1 head after flipping a coin 3 times.

For this problem, there is a 37% probability of raining on any given day. Given 10 random days, what is the probability that exactly 3 of the days have rain. What is the probability that at most 3 of the days have rain? What is the probability that at least 3 of the days have rain?

Exactly 3?
> pbinom(3,10,0.37) - pbinom(2,10,0.37)
[1] 0.2394254

At most 3?
> pbinom(3,10,0.37)
[1] 0.4599962

At least 3?
> 1-pbinom(2,10,0.37)
[1] 0.7794292


---- Data Formats ----

6. This API allows you to match IP Addresses to countries.  https://api.ip2country.info/ip?8.8.8.8 . The website has a throttle request for 2 requests / second / client: https://ip2country.info/humans.txt . Additionally, according to XKCD, the IP address space is mapped like this: https://xkcd.com/195/ . Write a program that executes 0-255.{66,127,191,255}.0.0 (a total of 256*4 queries, which should take you just over 30 minutes to run) . Compile a histogram of which countries have the most IP address spaces registered to them. If you do not have enough time, then just explore the 0-255.0.0.0 ip address space.


ips <- as.character(sapply(0:255,function(x){ paste(x,c(66,127,191,255),"0.0",sep=".")  }))
ipdata <- list()
for(ip in ips) {
	data <- getURL(paste("https://api.ip2country.info/ip?",ip,sep=""))
	message(ip);
	message(data);
	ipdata[[ip]] <- data
	Sys.sleep(2)
}

#ipdataList <- lapply(ipdata,fromJSON)
countryByIP <- data.frame(cbind(sapply(ipdataList, function(x) { x[[2]] })))

> sort(table(countryByIP))
countryByIP
BEN BGD BGR BHS BIH CHL CIV CRI DOM DZA ECU ETH GHA IRQ KWT LBY LKA LTU MLT MYS PAN PRY ROU SRB SVK TZA UKR VEN AZE BLR EST GRC PAK
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2
SAU AUT EGY ESP  EU IRL ISR NGA NZL PER PHL HUN MAR CHE DNK FIN IRN NOR TUR ARG BEL COL HKG MEX SWE VNM IDN IND SGP ZAF POL TWN NLD
  2   3   3   3   3   3   3   3   3   3   3   4   4   5   5   5   5   5   5   6   6   6   6   6   6   6   7   7   7   7   9   9  11
ITA AUS CAN RUS FRA BRA KOR DEU GBR JPN CHN     USA
 12  14  15  15  22  23  24  25  26  41  80 158 362

7. Pick a dataset from http://www.stat.ufl.edu/~winner/datasets.html , and do a linear regression on it. Write a sentence on what your finding implies. (Include the link to the dataset and your code in the answer)

Data used: http://www.stat.ufl.edu/~winner/data/birthrate.txt
Data: http://www.stat.ufl.edu/~winner/data/birthrate.dat

data <- getURL("http://www.stat.ufl.edu/~winner/data/birthrate.dat")
data <- gsub("[ ]{2,}","\t",data)
data <- read.table(textConnection(data), sep="\t")
data <- data.frame(data)
names(data) <- c("nation","birthrate","capincome","poponfarms","infantmort")

> summary(lm(data$birthrate ~ data$infantmort))

Call:
lm(formula = data$birthrate ~ data$infantmort)

Residuals:
    Min      1Q  Median      3Q     Max
-15.583  -4.995  -1.893   5.131  18.190

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)     13.43237    2.75145   4.882 3.83e-05 ***
data$infantmort  0.21574    0.04587   4.703 6.25e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.285 on 28 degrees of freedom
Multiple R-squared:  0.4413,    Adjusted R-squared:  0.4214
F-statistic: 22.12 on 1 and 28 DF,  p-value: 6.245e-05

Birth rates are positively correlated with income mortality, with high significance.

8. Find a JSON dataset from https://github.com/toddmotto/public-apis that does not require any authentication AND where a linear regression returns a high correlation coefficient.

pokemonjson <- list()
for(id in 1:721) {
	poke <- getURL(paste("http://pokeapi.co/api/v2/pokemon/",id,"/",sep=""))
	message(poke)
	message(id)

	pokemonjson[[id]] <- poke
	Sys.sleep(1)
}

pokemonList <- list()
for(id in 1:length(pokemonjson)) {
	message(id)
	pokemonList[[id]] <- fromJSON(pokemonjson[[id]])
}

pbody <- data.frame(t(sapply(pokemonList, function(x) { c(x$weight,x$height,length(x$moves),x$base_experience ) })))
names(pbody) <- c("w","h","moves","expbase")
summary(lm(w ~ h + moves + expbase + I(h*expbase) + I(moves*expbase), data=pbody))


9. Yahoo Finance provides CSV time series data on the "Adjusted Close" price of various financial instruments. For example, Google's hold company, "Alphabet", is here: http://finance.yahoo.com/q/hp?s=GOOG+Historical+Prices . At the bottom of the page, there is a download CSV links. For this problem, find 2 stocks that have a correlation in price > 0.6 . Show the code that you used to perform your analysis.

snp <- read.csv("http://chart.finance.yahoo.com/table.csv?s=^GSPC&a=5&b=14&c=2016&d=6&e=14&f=2016&g=d&ignore=.csv")
nasdaq <- read.csv("http://chart.finance.yahoo.com/table.csv?s=^IXIC&a=5&b=14&c=2016&d=6&e=14&f=2016&g=d&ignore=.csv")
> summary(lm(snp$Adj.Close ~ nasdaq$Adj.Close))$r.squared
[1] 0.9691687

10. weather.gov provides an XML interface given longitude and lattitude: http://forecast.weather.gov/MapClick.php?lat=29.803&lon=-82.411&FcstType=digitalDWML . After saving the xml file to your workspace (check getwd() to see what your workspace location is), you can use the following code to parse it:
require(XML)
data <- xmlParse("file.xml")
xml_data <- xmlToList(data)
location <- as.list(xml_data[["data"]][["location"]][["point"]])
start_time <- unlist(xml_data[["data"]][["time-layout"]][
    names(xml_data[["data"]][["time-layout"]]) == "start-valid-time"])
temps <- xml_data[["data"]][["parameters"]]
temps <- temps[names(temps) == "temperature"]
temps <- temps[sapply(temps, function(x) any(unlist(x) == "hourly"))]
temps <- unlist(temps[[1]][sapply(temps, names) == "value"])
out <- data.frame(
  as.list(location),
  "start_valid_time" = start_time,
  "hourly_temperature" = temps)

#answer starts here:
windspeed <- xml_data[["data"]][["parameters"]]
windspeed <- windspeed[names(windspeed)=="wind-speed"]
windspeed <- windspeed[sapply(windspeed, function(x) any(unlist(x) == "sustained"))]
windspeed <- as.numeric(unlist(windspeed[[1]][sapply(windspeed, names) == "value"]))
out$windspeed <- windspeed
#out$windspeed <- as.numeric(windspeed[[1]][1:dim(out)[1]])

precip <- xml_data[["data"]][["parameters"]]
precip <- precip[names(precip)=="probability-of-precipitation"]
precip <- precip[sapply(precip, function(x) any(unlist(x) == "floating"))]
precip <- as.numeric(unlist(precip[[1]][sapply(precip, names) == "value"]))
out$precip <- precip
head(out)

For this problem, add the "wind-speed", "probability-of-precipitation" and humidity columns to the dataframe

---- ISLR Problems ----

11. From the ISLR textbook page 121 (Section 3.7), Do problem #8 (a.i, a.ii, a.iii, a.iv, b, and c)

http://blog.princehonest.com/stat-learning/ch3/applied.html

12. From the ISLR textbook page 122 (Section 3.7), Do problem #9 (c.i,c.ii,c.iii)

13. From the ISLR textbook page 123 (Section 3.7), Do problem #10 (a,b)

---- Simple Regression in Java ----

14. (Java Exercise) use this linear regression class and any data to fix the slope & intercept, and determine the r^2: http://introcs.cs.princeton.edu/java/97data/LinearRegression.java.html

15. (Java Exercise) The Apache Math library can be used for linear regression. For this problem, we will use their library to perform a simple linear regression.

a. Create a new Maven project in Eclipse: http://www.tech-recipes.com/rx/39279/create-a-new-maven-project-in-eclipse/
b. Add their dependency from https://mvnrepository.com/artifact/org.apache.commons/commons-math3/3.6.1 to your pom.xml:
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-math3</artifactId>
    <version>3.6.1</version>
</dependency>
b. Create a simple project that uses the SimpleRegression class as outlined by http://commons.apache.org/proper/commons-math/userguide/stat.html#a1.4_Simple_regression


Problems on Distributions

1. Given x , y, and the model y_est = P_0 + P_1 * x, compute the Residual Sum of Squares ("RSS"). Recall that RSS equals (e_1)^2 + (e_2)^2 + (e_3)^2 .. (e_n)^2 where (e_i)^2 = y_i - (P_0 + P_1) * x.

2. In a simple linear regression with the model y_estimate = P_0 + P_1 * x, the parameters ("coefficients") P_0 and P_1 can be calculated with a formula to minize the Residual Sum of Squares . The equation for this "least squares" approach is P_1 (the slope) = sum_over_i( (x_i - mean(x)) * (y_i - mean(y))) / sum_over_i(x_i-mean(x))^2 . Also, P_0 (the intercept) = mean(y) - P_1 * mean(x). Note that the derivation for these two equations requires calculus, and is beyond thte scope of this course. Given arrays xdiff = x_i - mean(x) and ydiff = y_i - mean(y) as well as mean(x) and mean(y), compute P_0 and P_1 .

3. The residual Standard Error ("RSE") is computed as sqrt(RSS/(n-2)), and it can be used to calculate the standard error for parameters P_0 and P_1 . Specifically, SE(P_0) = sqrt(RSE^2 * (1/n +  (mean(x)^2)/sum_over_i((x_i-mean(x))^2))) , and SE(P_1) = sqrt((RSE^2)/sum_over_i(x_i-mean(x))) . Please compute SE(P_0) and SE(P_1) for the linear least squares model y_est = P_0 + x*P_1.

4. One way to test if a linear model (y=P_0 + P_1 * x) was effective is to look at the "t-statistic" of the lsope parameter P_1 . For this problem, rather than manually computing t_1 = (P_1 - 0)/(SE(P_1)), use R to compute the t-statistic. Also use confint to compute the resulting 95% confidence interval.