A little story. Recently in a job interview, I was asked to explain how
to derive the Principal Components from the Eigenvectors of a matrix.
Although, PCA can be useful for certain types of data, there are many
standard libraries which do the calculations. That is how I do it. The
last time I actually calculated eigenvectors directly was 1981. That
was my answer. Well, since this answer was less than satisfactory, I
did not get the job offer.
As a side note, after that interview, the company changed the job
description from "Machine learning Engineer" to "Machine Learning
Research Scientist". They decided they wanted someone to do research
instead of build production quality ML systems. They were apparently
following my "Machine Learning Skills Pyramid" from 2014.
http://www.anlytcs.com/2014/01/machine-learning-skills-pyramid-v10.html
Anyway, back to the point. Upon arriving home, a quick search on
YouTube uncovered an awesome resource for Machine Learning. The
fellow's name is Victor Lavrenko. His page is here:
https://www.youtube.com/user/victorlavrenko
The specific playlist for the EigenVectors question is here:
https://www.youtube.com/playlist?list=PLBv09BD7ez_5_yapAg86Od6JeeypkS4YM
Be sure to watch all twelve videos in order for the full rundown on PCA
(about an 1.5 hours total)
Enjoy!
My Views on Technology, Machine Learning, Finance, Jobs, and Everything Else
Pages
▼
Tuesday, May 31, 2016
Thursday, May 26, 2016
How To Work For Free - The Phony Job Interview Scam
I just returned home from a job interview in Boston. It wasn't until I sat in the Logan airport gate area for four hours thinking about how frustrating and annoying the interviews went, that I suddenly realized that I had been scammed. (I won't mention the company name for fear of a lawsuit).
When searching for a new job opportunity beware of this 'Phony' job interview scam.
It goes like this:
- A recruiter calls with a job opportunity for an unusually high dollar amount and really great perks like working remote etc. The story is told that they have had a really hard time finding qualified candidates with the specific skills you possess and they think you are a great fit.
- You go through a few steps in the process, phone interview, skype etc and then they bring you in for a series of face to face interviews with other members of the team. In my case, they shelled out money for an airline ticket, so I figured, not only were they serious, but that my chances were really good.
- When you arrive, there is a group of people waiting for you, all are extremely friendly (You think to yourself, this seems like a nice place to work!).
- They show you the problem they are having and pump you for ideas on how you would solve the problem.
- Feeling pressured to show your creative and technical abilities, you dig deep to pull out every idea that you can to help solve their problem.
- Each time you explain an idea, they respond with, "But how would you solve it?" They never acknowledge that any of your ideas are good. In fact, they act as if it is not good enough. You feel compelled to try harder and dig deeper.
- After three or so hours of this continual 'pumping' for ideas, they abruptly end the meeting, thank you for your time and hustle you out the door.
- No one had seen or read my resume
- They show you the problem they are having have and pump you for ideas on how you would solve the problem. Your ideas are never good enough and they pump you for more.
- Thinking you need to prove your value, you deliver more and more ideas for their problem solving.
- There is apparently no shortage of the required skills. In this case, all were knowledgeable about python, algorithms, machine learning, etc.
- There is no discussion about the job terms, working conditions, team, equipment, logistics etc.
So for the cost of a one day travel ticket (under $300 total), they received a wealth of information and ideas about how to solve their problem. For me, I got $0 pay for my free consulting, plus I sat for four hours on a plane and over six hours in the waiting area at Boston's Logan Airport (flight delays etc).
Have you had a similar experience? Does anyone have ways to combat this dirty trick?
All comments are welcome. Comment below or on twitter @anlytcs
Monday, May 16, 2016
Comparing Daily Stock Market Returns to a Coin Flip
In this post, we examine the Random Walk Hypothesis as applied to daily stock market returns. Background information on this can be found here: https://en.wikipedia.org/wiki/Random_walk_hypothesis .
So I applied my skills to this problem and came up with a bit of code which attempts to use some feature engineering to predict a coin flip then use the same approach on daily S&P 500 (where 1 is a up day and 0 is a down day). The next day's outcome is the classification label for the current day.
Program Setup and Feature Generation
You can find my code and data file on GitHub, where you can read it, download it and tweak it until you feel satisfied with the results.
https://github.com/anlytcs/blog_code/tree/master/coin_flip
The features designed for this experiment consisted of:
1. Previous, i.e. was yesterday up or down.
2. A count of Heads while in a Heads streak.
3. A count of Tails while in a Tails streak.
4. A count of Heads and Tails in a set of lookback periods. That is in the last 5, 10,20,30, etc days how many heads and how many tails were there. This to capture any observable trends. (not necessarily valid for coin flips, but believed to be a valuable tool in stock trading. Here is a bit of code where I identify the feature labels:
LOOKBACKS = [5,10,20,30,40,50,100]
HEADER_LINE = ['label','previous','heads_streak','tails_streak']
for i in LOOKBACKS:
HEADER_LINE.append('heads_'+str(i))
for i in LOOKBACKS:
HEADER_LINE.append('tails_'+str(i))
The experiment consisted of three runs:
1. 100,000 pseudo random number coin flips.
2. 4600 daily observations, going back to 1995, of the S&P 500 index transformed into coin-flips. That is the up_or_down column in the data file.
https://github.com/anlytcs/blog_code/blob/master/coin_flip/GSPC_cleaned.csv
Data Source: Yahoo finance.
3. 4600 pseudo random number coin flips.
Each run used a 5-fold cross validation, then plotted the AUC curve for the various runs and averages the AUCs for a final 'score'. Here are the results:
Output of Run
=======================================================
Start: Fri May 13 16:18:50 2016
Do 100000 Coin Flips
Counts: Counter({1: 50053, 0: 49947})
Heads: 50.05300 percent
Tails: 49.94700 percent
Build Features: Fri May 13 16:18:50 2016
Build Model: Fri May 13 16:19:45 2016
Train and Do Cross Validation: Fri May 13 16:19:45 2016
[ 0.49705482 0.496599 0.50169547 0.49761318 0.49758308]
Average: 0.498109110082
Accuracy: 0.4981 (+/- 0.003664)
=======================================================
Do SP500 (4600 days)
Build Features: Fri May 13 16:20:09 2016
Build Model: Fri May 13 16:20:10 2016
Train and Do Cross Validation: Fri May 13 16:20:10 2016
[ 0.53477333 0.54759786 0.53681652 0.55128712 0.53932995]
Average: 0.54196095771
Accuracy: 0.5420 (+/- 0.012769)
=======================================================
Do 4600 Coin Flips
Counts: Counter({1: 2322, 0: 2278})
Heads: 50.47826 percent
Tails: 49.52174 percent
Build Features: Fri May 13 16:20:33 2016
Build Model: Fri May 13 16:20:34 2016
Train and Do Cross Validation: Fri May 13 16:20:34 2016
[ 0.48760588 0.52287443 0.52944808 0.5124338 0.50364302]
Average: 0.511201041299
Accuracy: 0.5112 (+/- 0.029456)
End: Fri May 13 16:20:51 2016
=======================================================
Commentary and Conclusions
1. As expected, the 100,000 coin flips run shows exactly what you would expect. With an Average AUC of 0.4981, this is almost the definition of the ROC curve. https://en.wikipedia.org/wiki/Receiver_operating_characteristic . For 100k flips and all the generated features, the 5 ROC curves basically follow the 45 degree line. So, there was no benefit found over random guessing what the next flip will be.
2. For 4600 days of S&P 500 'flips', there appears to be a very slight edge from the model, with an average AUC of 0.5419609577. Not enough to risk actual money.
3. Now, the 4600 coin flips output raises some interesting questions. The average AUC was 0.5112 and the five ROC curves somewhat hug the 45 degree line. This raises the question of the validity of the 0.54 found in run #2 (4600 days SP500) ? I wonder what would happen if we had 100,000 days S&P 500 data. That is about 380 years worth. Maybe some motivated enough, could dig up 100,000 hourly readings try this experiment again. I would be curious to see the results.
Have we disproven the Random Walk Hypothesis? No. There are much better mathematical minds than myself who have effectively put that theory to rest. An interesting and thoroughly enjoyable read on this subject is the Misbehavior of Markets by Benoit Mandelbrot. http://www.amazon.com/Misbehavior-Markets-Fractal-Financial-Turbulence/dp/0465043577/ref=asap_bc?ie=UTF8
You can also read the abstract here: http://users.math.yale.edu/users/mandelbrot/web_pdfs/getabstract.pdf
What I think we have shown:
1. Machine learning cannot predict a coin toss (but we knew that already).
2. Next day stock price forecasting is a hard problem.
Feedback? Hit me up on Twitter @anlytcs
Wednesday, April 20, 2016
Stock Forecasting with Machine Learning - Are Stock Prices Predictable?
In the last two posts, I offered a "Pop-Quiz" on predicting stock prices. Today, I would like to ask the most important issue when attempting to use any form of predictive analytics in the financial markets. Do you even have a chance of getting reliable results? Or are you wasting your time? Back in 2003, when I first built the described Neural Network solution, it was my first naive take on the problem and I wasted a lot of time.
Today, with the expansion of machine learning research and mathematical techniques combined with the proliferation of open source tools, we are in a much better position to answer these questions directly. A few months back a new algorithm came to my attention via an interesting post on the FastML blog entitled "Are Stocks predictable". Check this link: http://fastml.com/are-stocks-predictable/
The short story is this: A PhD student at Carnegie Mellon University named Georg Goerg developed an algorithm and published his findings in what he called 'Forecastable Component Analysis'. This algorithm looks at a time-series and tries to determine how much noise vs. how much signal. The answer is provided as an 'Omega Score'. The algorithm was also provided as an R package ForeCA.
In English, if the data contains too much noise, attempts to predict the series will fail. This is really useful for stock prices. FastML shows that next day %changes for stock indexes have ridiculously low Omega scores, between 1.25% and 6%. Not enough to bank on.
I discovered a similar effect in my research. No matter how much you torture the input data, forecasting the next day's close is a fool's folly. It is analogous to attempting to predict the flip of a coin. However, what I have discovered (assuming I am interpreting the results correctly), is that as you go out in time, the results start to become more meaningful. So, what would happen if you fed the ForeCA algorithm with Percentage change values for 1,5,10,15,20,25, and 30 days in the future ?
Here are the results. Note: ForeCA reorders the columns from most to least forecastable (after transformation), so for the sake of simplicity, just pay attention to the 'Orig' series' omega scores and the top right bar chart. (bars labeled X1-Day through X30_DAY). As you can see, the noise/signal ratio and your ability to forecast improves as the number of days increases.
$Omega
Series 7 Series 4 Series 5 Series 6 Series 3 Series 2 Series 1
31.998529 28.954507 25.660565 23.572059 20.275582 11.857304 4.612705
$Omega.orig
X1_DAY X5_DAY X10_DAY X15_DAY X20_DAY X25_DAY X30_DAY
1.632106 11.253286 18.363721 22.831144 26.353855 29.138379 31.560240
Again, assuming I am interpreting the results correctly, we have a 31.56% chance of getting the forecast right 30 days in the future. Still not enough to bank on. In the end, stock market success is not about the perfect algorithm or forecast or formula. It is about managing risk when your signal goes wrong.
(Note: I would have provided the R source code and input data, but it was left on my work laptop when I recently finished up a project with Cisco).