a guest May 24th, 2019 78 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
- Alternative data (AltData) is data from sources that are considered non-traditional for the concrete industry. That means that alternative data depends on what kind of traditional data sources are already used by you and your competitors. The purpose of this is that analyzing AltData to identify some unique insights and actions beyond those provided by regular or traditional data. As a result, your company could develop a strong competitive differentiator for an amount of time. [2: www.import.io]
- It’s simple, really: To beat the market, just have insights before everyone else. 
- If to speak about financial market according to the Eagle Alpha alternative data report it is possible to define 24 categories of AltData. See picture below:
- 24 Categories of Alternative Data [ 1 ]
- But “Having a wealth of data is great, but only if you really believe it is going to improve your ability to forecast and capture market inefficiencies or risk premia. Available information is not synonymous with useful information.”
- — Ray Iwanowski, Managing principal, and co-founder Secor Asset Management LP 
- The QUIZ
- Using alternative data like tweets, web traffic, and Google trends could we predict the long or the short signal for the stock market price of the X company?
- STAGE 1
- Step 1: Import files in the workspace
- Uploading the datasets using IO library in the main workspace Google Colab Notebook from the local device. Then the Pandas library helps us to read them.CSV files and put them in a DataFrames.
- Import and read files with data
- Step 2: Engineering each dataset and its features
- For example, we take the web traffic dataset and drop the columns that we don’t need or that are not informative for us in this situation. And because in the next step we will merge a few of this datasets and “date” columns will be the join column reference, we need to transform the data type of the columns “date” to be the same. So it contains the time that we don’t need and the second reason is that the datasets are collected from different sources and the “date” columns could have different date formats and we change them to be similar.
- Drop the uninformative columns
- Plot the web traffic numbers in a graph.
- Web Traffic Graph
- We see in the graph above that there is some strange point in around 60000 and 70000. Seams to be outliers point and are good to delete them because it could modify the good numbers from the dataset in the future processing of it.
- As we see the majority of point a below 30000 so we delete all number from the dataset that is bigger the 30000.
- New Web Traffic graph
- Step 3: Merge few datasets in one big dataset
- Join four datasets together
- After the joining process, we see the periods of time of collected data are very different. And only 24 days there are values on all of them. Unfortunately, we can’t fill the NaN values because they are not missed randomly.
- Range of dates in every dataset
- So, let’s take only that small part of the dataset that contains the values and where are less missed values.
- How do we see there final dataset have a shape of 24 rows and 16 columns?
- Step 4: Plotting data
- We see in the intersection of stock prices ‘Open, High, Low, Close’ with ‘count_comments’ indexes between 0.82–0.90, that say about a strong positive correlation. Let’s graph them and see what is there!
- The positive correlation between data, but wrong information
- WOW! We found a perfect correlation between the number or count comments and price in the market. But BE AWARE, let’s think the price in the future even will go up, sometimes could decrease more or less, but a count comment is a cumulative number so always these values will increase. So is a graph with misinformation.
- Plot a few more graphics in willing to find something that could say some information, but nothing informative
- Stage 1 conclusion
- Sometimes when we deal with alternative data we can face o lot of challenges: a lot of missing data, small datasets, etc. and my situation is the case. So the initial datasets are too different, some too big others too small, as a result, we could extract only a small dataset and try to do a simple analysis to find something informative.
RAW Paste Data