Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Applied Data Science Capstone - Part I"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Which is the best neighborhood in NYC to run my new business ?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Introduction"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The main objective of this project is to create a solution that get location data from New York City and find some recommendation of where are the best places to run a new business, according to density of business, visitation in the area and users rating for this categories in the neighborhoods."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Business problem"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "If we try to answer the main question *Which is the best neighborhood in NYC to run my new business ?*, we need answer another questions:\n",
- "\n",
- "3. Which neighborhoods receive more visitors/customers ? \n",
- "1. Which neighborhoods there is few business like of yours ?\n",
- "2. Which neighborhoods has worst average rates from customers for this business category ?\n",
- "\n",
- "Of course, there are lots of questions that an entrepreneur should answer before we can start a new business, but we are data scientists so we will find the best answers with available data for them !"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Audience"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The audience for the results of this project is everyone that are analyzing neighborhoods and venues data and are looking for insights. The results will be presented in a straight and clear manner, so anyone that is interested in this subject like students, entrepreneurs, business analysts and so on can understand them and get some value, not only *number-crunchers*."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Data"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Data needed to solve the problem"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In order to create a solution this problem, we need information of venues based on geographical data (latitude, longitude), ratings for that venues, # of visitors and its consumption profiles for each neighborhood. For our luck, some of that information we can get from FOURSQUARE API !\n",
- "For question 1, we will use the # of ratings posted about venues as a proxy for the # of visitors in the neighborhoods, then we can obtain it from FOURSQUARE API too !"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Data Sources"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The main source of data is FOURSQUARE API. However, we need to preprocessing and transform them to produce new datasets that will be used on machine learning models."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Methodology"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The basic ideas underlying this solution are:\n",
- "1. Users must choose a business category based on FOURSQUARE venue´s categories\n",
- "\n",
- "2. I will build a dataframe to be the input of machine learning model with the following columns:\n",
- " - NEIGHBORHOOD\n",
- " - VENUE_COUNT\n",
- " - USER_COUNT\n",
- " - CATEGORY_AVG\n",
- " \n",
- "3. I will process FOURSQUARE data and:\n",
- "\n",
- " a. Count business for selected category for each neighborhood and write in the VENUE_COUNT column.\n",
- " \n",
- " b. Count user ratings for all business for each neighborhood for this category and write in the USER_COUNT column.\n",
- " \n",
- " c. Get the average rating for all business in this category for each neighborhood and write in CATEGORY_AVG column.\n",
- " \n",
- " d. Get the absolute minimum and maximum count of user ratings for this category and store in MIN_USER_COUNT and MAX_USER_COUNT variables.\n",
- " \n",
- " e. Get the absolute minimum and maximum # of venues for this category and store in MIN_VENUE_COUNT and MAX_VENUE_COUNT variables.\n",
- " \n",
- " f. Get the absolute minimum and maximum average ratings for this category and store in MIN_CATEGORY_AVG and MAX_CATEGORY_AVG variables.\n",
- "\n",
- "4. Normalize data for columns VENUE_COUNT, USER_COUNT and CATEGORY_AVG.\n",
- "\n",
- "5. Run models for all neighborhoods with the above calculated variables (VENUE_COUNT, USER_COUNT, CATEGORY_AVG)\n",
- "\n",
- "6. Predict the best places for a new business (VENUE_COUNT = 0, USER_COUNT = 1, CATEGORY_AVG = 0)\n",
- "\n",
- "7. The results will be a rank of neighborhoods with minimal distances, ascendenting ordered."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Machine Learning algorithms selected"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We will try K-Nearest Neighborhood with K = 1 (a single neighborhood) and Multiple Linear Regression and compare results."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.8"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
- }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement