a guest Feb 17th, 2019 74 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
- Python Data Engineer Assignment
- Hello! We are very happy that you decided to work on our programming homework, it will be fun, we promise! ;-)
- The idea of the exercise is not to write the perfect solution, but rather to get a better feeling of your problem solving skills, your understanding of the dataset you are working with, your coding style and the tools you decide to use.
- This will make the following interviews much easier for you since you have already shown your technical skills in a realistic problem.
- The exercise is using the famous New York City taxi dataset. It contains data for 1,1 billion taxi rides which occured in New York City between 2009 and 2015.
- Additionally you will use a weather-related dataset, with meteorological measurements from a National Climatic Data Center weather station located in Central Park.
- The compulsory part of the exercise composed of the following parts:
- ● Select the rides which started from Manhattan and ended in John Fitzgerald Kennedy International Airport. The filtering does not need to be perfect but 80%
- ● Identify whether there is a correlation between the number of rides within a day
- and the weather, and more specifically the precipitation. The answer can be provided as free text, graphs, spreadsheets, csv files or any other way you find the most appropriate.
- The exercise has also a bonus part, where you will need to:
- ● Visualize all of the filtered rides.
- ● Provide a way in your visualization where the filtered rides can be distinguished
- between precipitation levels.
- You can select to answer only none, one or both questions in the bonus part.
- Getting the data
- You can get the weather data directly from this link. For the taxi rides data, you will need to download the script download_raw_data.sh and text file raw_data_urls.txt, in the same folder and run the script.
- The dataset will be downloaded from S3 to your local folder. Its size is 120GB so the download may take a while. If the dataset is too big for you, please download only a portion of the files, which you can work on. And make sure you document in your submission which portion of the dataset’s volume you used, either in Gigabytes or number of trips.
- You will need to send over your code to Leonardo at email@example.com , along with instructions on installing and running your code, an explanation of your choices.
- The minimum expected is a readme file, but you can do it as nice as you want, it will definitely give you points.
- You are allowed to use any programming language, framework and library you want. You will be judged for the quality of your code and not the language or framework you decided to use.
- You are not allowed to use any code you might find online, neither to publish your solution or this document without our explicit consent. For any questions that might arise, please contact Leonardo at firstname.lastname@example.org.
- Good luck!
RAW Paste Data