Untitled

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Applied Data Science Capstone - Part I"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Which is the best neighborhood in NYC to run my new business ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main objective of this project is to create a solution that get location data from New York City and find some recommendation of where are the best places to run a new business, according to density of business, visitation in the area and users rating for this categories  in the neighborhoods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Business problem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we try to answer the main question *Which is the best neighborhood in NYC to run my new business ?*, we need answer another questions:\n",
    "\n",
    "3. Which neighborhoods receive more visitors/customers ? \n",
    "1. Which neighborhoods there is few business like of yours ?\n",
    "2. Which neighborhoods has worst average rates from customers for this business category ?\n",
    "\n",
    "Of course, there are lots of questions that an entrepreneur should answer before we can start a new business, but we are data scientists so we will find the best answers with available data for them !"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Audience"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The audience for the results of this project is everyone that are analyzing neighborhoods and venues data and are looking for insights. The results will be presented in a straight and clear manner, so anyone that is interested in this subject like students, entrepreneurs, business analysts and so on can understand them and get some value, not only *number-crunchers*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data needed to solve the problem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to create a solution this problem, we need information of venues based on geographical data (latitude, longitude), ratings for that venues, # of visitors and its consumption profiles for each neighborhood. For our luck, some of that information we can get from FOURSQUARE API !\n",
    "For question 1, we will use the # of ratings posted about venues as a proxy for the # of visitors in the neighborhoods, then we can obtain it from FOURSQUARE API too !"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Sources"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main source of data is FOURSQUARE API. However, we need to preprocessing and transform them to produce new datasets that will be used on machine learning models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Methodology"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic ideas underlying this solution are:\n",
    "1. Users must choose a business category based on FOURSQUARE venue´s categories\n",
    "\n",
    "2. I will build a dataframe to be the input of machine learning model with the following columns:\n",
    "    - NEIGHBORHOOD\n",
    "    - VENUE_COUNT\n",
    "    - USER_COUNT\n",
    "    - CATEGORY_AVG\n",
    "    \n",
    "3. I will process FOURSQUARE data and:\n",
    "\n",
    "    a. Count business for selected category for each neighborhood and write in the VENUE_COUNT column.\n",
    "    \n",
    "    b. Count user ratings for all business for each neighborhood for this category and write in the USER_COUNT column.\n",
    "    \n",
    "    c. Get the average rating for all business in this category for each neighborhood and write in CATEGORY_AVG column.\n",
    "    \n",
    "    d. Get the absolute minimum and maximum count of user ratings for this category and store in MIN_USER_COUNT and MAX_USER_COUNT variables.\n",
    "    \n",
    "    e. Get the absolute minimum and maximum # of venues for this category and store in MIN_VENUE_COUNT and MAX_VENUE_COUNT variables.\n",
    "    \n",
    "    f. Get the absolute minimum and maximum average ratings for this category and store in MIN_CATEGORY_AVG and MAX_CATEGORY_AVG variables.\n",
    "\n",
    "4. Normalize data for columns VENUE_COUNT, USER_COUNT and CATEGORY_AVG.\n",
    "\n",
    "5. Run models for all neighborhoods with the above calculated variables (VENUE_COUNT, USER_COUNT, CATEGORY_AVG)\n",
    "\n",
    "6. Predict the best places for a new business (VENUE_COUNT = 0, USER_COUNT = 1, CATEGORY_AVG = 0)\n",
    "\n",
    "7. The results will be a rank of neighborhoods with minimal distances, ascendenting ordered."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Machine Learning algorithms selected"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will try K-Nearest Neighborhood with K = 1 (a single neighborhood) and Multiple Linear Regression and compare results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}