My Views on Technology, Machine Learning, Finance, Jobs, and Everything Else
Pages
▼
Wednesday, October 29, 2014
Stock Forecasting With Machine Learning - Pop-Quiz
A few years back, I decided that machine learning algorithms could be designed to forecast the next day's Open, High Low, Close for the SP500 index. Armed with that information, it would be a cinch to make $Millions !
The following chart shows the initial design of the Neural Network:
As it turns out, this ML model did not work. In fact, this approach is completely wrong ! Can you think this through and come up with reasons why ?
(Note: This slide was taken from a recent presentation entitled "Building Effective Machine Learning Applications").
Tuesday, October 28, 2014
Hire a Data Scientist for only $5.00
Got a Machine Learning problem? Got $5.00? Consider it solved !
Visit this page on Fiverr.com:
https://www.fiverr.com/aliabbasjp/solve-a-machine-learning-and-intelligence-problem-to-gather-insights-from-your-data?context=adv.cat_10.subcat_143&context_type=auto&funnel=2014102811022631011277080
While I don't have any knowledge of the quality of their work, I suspect this price is reflective of the cost of living in their location minus a discount for the promotional benefit.
The Internet is the great equalizer.
Visit this page on Fiverr.com:
https://www.fiverr.com/aliabbasjp/solve-a-machine-learning-and-intelligence-problem-to-gather-insights-from-your-data?context=adv.cat_10.subcat_143&context_type=auto&funnel=2014102811022631011277080
While I don't have any knowledge of the quality of their work, I suspect this price is reflective of the cost of living in their location minus a discount for the promotional benefit.
The Internet is the great equalizer.
Thursday, September 25, 2014
Understanding Online Learning - Part 1
Online learning is form of machine learning with the following characteristics:
1. Supervised learning
2. Operates on streams of big data
3. Fast, lightweight models
4. Small(er) RAM footprint
5. Updated continuously
6. Adaptable to changes in the environment
Many machine learning algorithms train in batch mode. The model requires the entire batch of training data to be fed in at one time. To train, you select an algorithm, prepare your batch of data, train the model on the entire batch, check the accuracy of your predictions. You then fine tune your model by iterating your process and by tweaking your data, inputs and parameters. Most algorithms do not allow new batches of data to update and refine old models. So periodically you may need to retrain your models with the old and new data.
There are a number of benefits to the batch approach:
There are some advantages and a few drawbacks to the online learning approach.
Advantages:
Meanwhile, here are some interesting links to learn more:
http://en.wikipedia.org/wiki/Online_machine_learning
http://www.youtube.com/watch?v=HvLJUsEc6dw
http://www.microsoft.com/en-us/showcase/details.aspx?uuid=436006d9-4cd5-44d4-b582-a4f6282846ee
Enjoy !
1. Supervised learning
2. Operates on streams of big data
3. Fast, lightweight models
4. Small(er) RAM footprint
5. Updated continuously
6. Adaptable to changes in the environment
Many machine learning algorithms train in batch mode. The model requires the entire batch of training data to be fed in at one time. To train, you select an algorithm, prepare your batch of data, train the model on the entire batch, check the accuracy of your predictions. You then fine tune your model by iterating your process and by tweaking your data, inputs and parameters. Most algorithms do not allow new batches of data to update and refine old models. So periodically you may need to retrain your models with the old and new data.
There are a number of benefits to the batch approach:
- Many ML algorithms to choose from. You have many more algorithms because that is typically how they are developed at the universities and the batch approach aligns with traditional statistics practices.
- Better accuracy. Since the batch represents the "known universe", there are many mathematical techniques which have been developed to improve model accuracy.
- Can be effective with smaller data sets. Hundreds or thousands of rows can results in good ML models. (Internally, many algorithms iterate over the data set to learn the desired characteristics and improve the results).
There are some advantages and a few drawbacks to the online learning approach.
Advantages:
- Big Data: Extremely large data sets are difficult to work with. Model development and algorithm training is cumbersome. With online learning, you can wrestle the data down to manageable sized chunks and feed it in..
- Small(er) RAM footprint. Obvious benefits of using less RAM.
- Fast: Because they have to be.
- Adaptive: As new data comes, the learning algorithm adjusts the model and automatically adapts to the changes in the environment. This is useful for keeping your model in sync with changes in human behavior such as click-thru behavior and financial markets etc. With traditional algorithms using a batch approach, the newer behavior is blended in with the older data so these subtle changes in behavior are lost. With online learning, the model continuously moves toward latest version of reality.
- It requires a lot of data. Since the learning is done as it goes along, the model accuracy is developed over millions of rows not thousands. (You should pre-train your model before production use, of course).
- Predictions as not as accurate. You give up some accuracy in the predictive powers of the model as a trade off for the speed and size of the solution.
Meanwhile, here are some interesting links to learn more:
http://en.wikipedia.org/wiki/Online_machine_learning
http://www.youtube.com/watch?v=HvLJUsEc6dw
http://www.microsoft.com/en-us/showcase/details.aspx?uuid=436006d9-4cd5-44d4-b582-a4f6282846ee
Enjoy !
Monday, August 11, 2014
Naked Short Selling - An Introduction
Imagine an Asset that can be created from thin air, cost nothing to produce, can be created in unlimited quantities, can be easily sold with the push of a button for big money and the seller keeps all the money, forever. Does this sound like an ideal way to make a lot of money? Does this sound illegal? As a matter of fact it is!
Welcome to the world of Naked Short Selling.
Without getting into the basics or the ethics of Short Selling, I'll just say that it is a common practice in most financial markets. In essence, the "Short Seller" is betting the price will go down (i.e. They Sell first and Buy back later). This is usually perfectly legal although it can be risky. (For detailed information go here: http://en.wikipedia.org/wiki/Naked_short_selling ). However, before you can understand the crime, first you need to understand some distinctions:
With Futures and Options:
However, the 'big boys' play by a different set of rules. There are certain legal exemptions for certain market participants (market makers etc) and there is also a lack of SEC enforcement against other market participants. They can short the stock which they have not yet borrowed as long as they promise to deliver the borrowed shares in before settlement (three days). Sometimes they don't deliver.
Think of the implications of this practice (especially if this behavior is performed by criminals and/or psychopaths):
Does Illegal Naked Short Selling occur today? Probably yes, but I am no expert in these things just an outside observer. So, just out of curiosity, I picked up a few stock tickers from the NYSE website which had "Failure to Deliver" reports (http://www1.nyse.com/regulation/memberorganizations/Threshold_Securities.shtml ) and viewed their price charts.
Here are a couple of stocks which are appear to be diving relentlessly into the ground: END, USU, WLT. I cannot tell if these stocks are dropping because of their deteriorating business conditions or due to naked short selling. But it is interesting that they are all dropping fairly consistently for three years straight. (Keep in mind there MUST be bounces along the way in order to trick more victims into thinking the bottom is in and therefore commit to buying some/more shares).
WARNING: I WOULD NOT BUY OR SHORT THESE STOCKS. This is not investment advice, just a bit of education on some of the dirty ticks you need to know about in order to protect yourself.
Welcome to the world of Naked Short Selling.
Without getting into the basics or the ethics of Short Selling, I'll just say that it is a common practice in most financial markets. In essence, the "Short Seller" is betting the price will go down (i.e. They Sell first and Buy back later). This is usually perfectly legal although it can be risky. (For detailed information go here: http://en.wikipedia.org/wiki/Naked_short_selling ). However, before you can understand the crime, first you need to understand some distinctions:
With Futures and Options:
- You are buying and selling a legal Contract (with specific rights and obligations).
- These Contracts are created at will between the buyers and the sellers (by design).
- They (either the clearing house or the exchange) keeps track of the Open Interest (number of Contracts outstanding).
- There is a time limit (expiration date), when everything needs to be settled.
- There is a clearing house to hold the trader legally accountable to the terms of their Contract(s).
- There is a mechanism to take money out or put money into your account on a nightly basis (futures) or at expiration (options).
- To short a Futures or Option Contract, all you need to do is push a button. This is perfectly legal and by design.
- To close the position, you simply hit the Buy button or wait for expiration.
- You are Buying and Selling an Asset. (Representing a fraction of a corporation or a limited partnership etc.).
- The number of shares outstanding is controlled by the corporation.
- There is no time limit. You could hold your IBM shares for 20 years.
- Stock trades are typically settled in three days (US). Money changes hands and so do the stock shares.
- To Short a share of stock, you must first "borrow" the stock from someone else and then sell the stock.
- To close the position, you simply hit the Buy button (which then theoretically returns the shares to the lender).
However, the 'big boys' play by a different set of rules. There are certain legal exemptions for certain market participants (market makers etc) and there is also a lack of SEC enforcement against other market participants. They can short the stock which they have not yet borrowed as long as they promise to deliver the borrowed shares in before settlement (three days). Sometimes they don't deliver.
Think of the implications of this practice (especially if this behavior is performed by criminals and/or psychopaths):
- They are creating new shares out of thin air (which are supposed to be a limited Asset)
- They sell them to unsuspecting buyers who think they have a real Asset in their account.
- Nobody knows who did it.
- The Buyer may not know about the fraud for many years (or ever). All the Buyer ever knows is that the price keeps going down.
- If they create and sell enough shares they can drive the stock price down to $0 (and then they never have to buy it back !).
Does Illegal Naked Short Selling occur today? Probably yes, but I am no expert in these things just an outside observer. So, just out of curiosity, I picked up a few stock tickers from the NYSE website which had "Failure to Deliver" reports (http://www1.nyse.com/regulation/memberorganizations/Threshold_Securities.shtml ) and viewed their price charts.
Here are a couple of stocks which are appear to be diving relentlessly into the ground: END, USU, WLT. I cannot tell if these stocks are dropping because of their deteriorating business conditions or due to naked short selling. But it is interesting that they are all dropping fairly consistently for three years straight. (Keep in mind there MUST be bounces along the way in order to trick more victims into thinking the bottom is in and therefore commit to buying some/more shares).
WARNING: I WOULD NOT BUY OR SHORT THESE STOCKS. This is not investment advice, just a bit of education on some of the dirty ticks you need to know about in order to protect yourself.
It's Been Quiet on the Blog Lately
It's been quiet around here lately, so I have decided to expand the scope of this blog beyond the basic theme of Machine Learning.
There are a lot of interesting technology topics out there, sometimes related to machine learning and data science (but sometimes not). In coming months, some areas which I plan to cover include:
- Financial Markets
- Black Box and Adaptive Trading Systems
- Job Markets
- Business Optimization with Data Science
- Open Source Software
Cheers,
Steve
Wednesday, March 5, 2014
Top Ten Reason To "Kaggle"
Do you aspire to do Machine Learning, Data Science, or Big Data Analytics? If so, you have probably studied, taken courses, read a bunch of blog posting and can code up some R, Python or Matlab.
Are you ready to start solving real world problems? Probably not. It is one thing to know some things about data, it is a very different situation altogether to effectively solve real world problems. So how do you improve your skills?
I highly recommend you take a look at Kaggle Competitions. Kaggle.com hosts Data Science/Machine Learning competitions on their site. They offer a wide range of challenging problems with a fixed deadline and with the element of competition. Also, there is usually a modest financial incentive for the winner(s), although I am surprised at how meager most of the prizes are considering the amount of work invovled and the benefit they derive from crowdsourcing their problems. But the real benefit is not in the prize money, it is in the learning process.
Kaggle has been a tremendous learning experience to expand my depth and breadth of knowledge. Here's why:
1. Kaggle exposes you to a wide range of Machine Learning problems: Forecasting, Sentiment Analysis, Natural Language Processing, Image Recognition, etc. This motivates you learn about as much as you can about the problem domain, the type of data involved, and the various algorithms which might be applicable.
2. Kaggle is under a time limit. This "forces" you to work in a very efficient manner in developing and testing out alternative ideas quickly. When under pressure and motivated to score highly on a competition, you will focus and learn more techniques in a very short time frame.
3. Kaggle competitions "force" you to code and recode your solution in the most resource efficient manner possible, making tradeoffs between programmer time, CPU time, RAM etc. In order to compete, you to need to discover and remove performance bottlenecks quickly. This enable you to improve turnaround time for subsequent iterations.
4. Each competitions uses a different scoring mechanisms. You will learn about the various scoring metrics and when they are used. You will probably code some of these yourself.
5. You will surely learn the value of Cross-validation. Re-sampling and retraining your model multiple times to validate that your solution is working and not overfitting the data.
6. You will learn new methods for dealing with dirty data: Cleaning, filtering, handling missing values etc. Sometimes the competition planners intentionally throw
garbage into the data sets in order to make the challenge harder.
7. You will sometimes be handling massive file sizes, putting you to the challenge of slicing, sampling, splitting, extracting and zipping useful subsets of the data.
8. Each competition has a forum where competitors help each other tackle the problem. There is a really supportive atmosphere for learning and exploring in the Kaggle forums. At the conclusion of the competition, there is a massive learning opportunity as the participants "open their kimonos" and share their best work for solving the problem. The more intimate knowledge you have of the problem, the better you will understand the thought process they went through and will take notes for the next competition.
9. You will be competing against some of the best Data Scientist in the world. This competition brings out the best you have in yourself. If you are mediocre in your approach, it will show in your results. Your Kaggle Leaderboard ranking is immediate feedback on how well you have broken down and solved the problem. You can't lie to yourself, the final leaderboard shows where you stand.
10. You will come to realize there is more to machine learning than just pushing data through a library algorithm. If all Kaggle competitors have access to the same libraries of algorithms and tools, what differentiates the solutions? How do you win? You can do your best work and still find 200-300 people with higher scores on the Kaggle leaderboard. The leaderboard scoring focuses all your energy on the primary objective: Improving the overall score of your solution. It can be tough. Kaggle competitors are some of the most brilliant minds on the planet.
11. After you have scored highly in a number of competitions (A top ten finalist and a top 10% placement) you can earn the coveted "Kaggle Master" badge.
12. Recruiters are scouring the Kaggle boards looking for talented Data Scientists. You could find a new position.
For all of these reasons, Kaggle works to bring out the best talent within you. If you really want to become expert in Data Science and Machine Learning, you should consider Kaggle competitions.
(Yes, I know that was actually twelve reasons, but Ten makes a better headline :-) )
Here is a screen shot of the Loan Default Prediction Leaderboard
Tuesday, January 28, 2014
Machine Learning Skills Pyramid V1.0
While the exact definition of "Data Scientist" continues to elude us, the job requirements seem to heavily include machine learning skills. They also include a wide range of other skills, ranging from specific languages, frameworks, databases etc, to data cleaning, web scraping, visualizations, mathematical modeling and subject matter expertise. (This breakdown will be the subject of a future post, as I was having some trouble with my web scraper ;))
So for the typical "Data Scientist" role, many organizations want PhD level academic training plus an assortment of nuts and bolt programming or database skills. Most of these job requirements are like a rich and complex mix of "can't find the right candidate" (aka Unicorn). So, as an extension to the Data Science Venn Diagram V2.0, I thought it would be helpful to try to clarify and make some important distinctions regard Machine Learning skills.
Back in the 2002-2003 time frame, I spent a bunch of time trying to code my own Neural Networks. This was a very frustrating experience because bugs in these algorithms can be especially difficult to find and it took time away from what I really wanted to do, which is building applications using machine learning. So I decided back then to use well tested and fully debugged library algorithms over clunky home grown algorithms whenever possible. These days there are so many powerful and well tested ML libraries, why would anyone write one from scratch? The answer is, sometimes a new algorithm is needed.
First, some definitions will help clarify:
- ML Algorithm: A well defined, mathematically based tool for learning from inputs. Typically found in ML libraries. Take the example of sorting algorithms: BubbleSort, HeapSort InsertionSort, etc. As a software developer, you do not want or need to create a new type of sort. You should know which works best for your situation and use it. The same applies to Machine Learning: Random Forests, Support Vector Machines, Logistic Regression, Backprop Neural Networks etc, are all algorithms which are well known, have certain strengths and limitations and are available in many ML libraries and languages. These are a bit more complicated than sorting, so there is more skill required to use them effectively.
- ML Solution: An application which uses one or more ML Algorithms to solve a business problem for an organization (business, government etc).
- ML Researcher/Scientist: PhD's are at the top of the heap. They have been trained to work on leading edge problems in Machine Learning or Robotics etc. These skills are hard won and are will suited for tackling problems with no known solution. When you have a new class of problems which require insight and new mathematics to solve, you need an ML Researcher. When they solve a problem a new ML Algorithm will likely emerge.
- ML Engineer: Is a sharp software engineer with experience in building ML Solutions (or solving Kaggle problems). The ML Engineer's skills are different from the ML Researcher. There is less abstract mathematics and more programming, database and business acumen involved. An ML Engineer analyzes the data available, the organizational objectives and the ML Algorithms known to operate on this type of problem and this type of data. You can't just feed any data into any ML Algorithm and expect a good result. Specialized skills are required in order to create high scoring ML solutions. These include: Data Analysis, Algorithm Selection, Feature Engineering, Cross Validation, appropriate scoring and trouble shooting the solution.
- Data Engineer: A software engineer with platform and language specific skills. The Data Engineer is a vital part of the ML Solution team. This person or group does the heavy lifting when it comes to building data driven systems. The are so many languages, databases, scripting tools, operating systems each with its own set of quirks, secret incantations and performance gotchas. A Data Engineer needs to know a broad set of tools and be effective in getting the data extracted, scraped, cleaned, joined, merged and sliced for input to the ML Solution. Many of the skills needed to manage Big Data, belong in the Data Engineer category.
(Click Image to Enlarge)
Sunday, January 26, 2014
Stock Forecasting with Machine Learning
Almost everyone would love to predict the Stock Market for obvious reasons. People have tried everything from Fundamental Analysis, Technical Analysis, and Sentiment Analysis to Moon Phases, Solar Storms and Astrology.
However, unless you are in a position to front run other people's trades, like High Frequency Trading, there is no such thing as a guaranteed profit in the markets. The problem with human stock analysis is that there is so much data and so many variables that it is easy for the average human to become overwhelmed, get sucked down the rabbit hole and continue to make sub-optimal choices.
Sounds like a job for Machine Learning and there is no shortage of people and companies trying this as well. One major pitfall is that most ML algorithms do not work well with stock market type data. This also results in a lot of people of wasting a lot of time. But In order to share some of the concepts and get the conversation started I am posting some of my findings regarding Financial and Stock Forecasting using Machine Learning
I trained 8000 machine learning algorithms to develop a probabilistic future map of the stock market in the short term (5-30 days) and have compiled a list of the stocks most likely to bounce in this time frame. There is no single future prediction. Instead there is a large set of future probabilities which someone can use to evaluate their game plan and portfolio. My exact methods remain proprietary at this time (but might consider institutional licensing).
Here are the "Stock Picks" based on how they closed on Friday (Jan 24, 2014) based on the stock's individual trading behavior:
GE - General Electric
GM - General Motors
HON - Honywell
DIS - Disney
MET - MetLife
NKE - Nike
OXY - Occidental Petroleum
BK- Bank of New York Mellon
EMR - Emerson Electric
TWX - Time Warner Inc.
FCX - Freeport McMoran Copper and Gold
Disclaimer: This is not trading or investing advice. It is simply the output of my ML system. If you lose money, do not come crying. Trade at your own risk!
Since, the market got pummelled this week, there are a lot of stocks that look like 'buys' right now. But the overall (US) market is coming off a very prolonged euphoric period and it has not had a significant correction for over two years. So, it is possible that the current downswing is either a minor pullback a.k.a. "dip", or it is the start of a major correction.
Here are the charts. For the most part they look like a big sell-off in an larger uptrend. It is always interesting to see how the future unfolds and especially with respect to these predictions. Also, keep in mind, even if a stock does bounce, it could then run out of steam and drop again. Ah...life in the uncertainty zone ;).
Enjoy!
Monday, January 6, 2014
Data Science Venn Diagram v2.0
There have been a number of attempts to get our collective brains around all the skill sets needed to effectively do Data Science.
Here are two...
1. http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
2. http://upload.wikimedia.org/wikipedia/commons/4/44/DataScienceDisciplines.png
Below is my take on the subject. The center is marked "Unicorn". This a reference to the recent discussions in the press and blogosphere indicating that Data Scientists are as hard to find as unicorns. Finally the mindset is changing that a team of people with complimentary skills is the course of action for most data driven organizations. Certainly some individuals might posses Computer Science, Statistics and Subject Matter Expertise. They are just very hard to find. Many Data Scientist job descriptions don't reflect this reality and so these positions go unfilled for six months or more.
Let me know what you think...
CLICK TO ENLARGE
This is an Adaptation of the original Data Science Venn Diagram which is licensed under Creative Commons Attribution-NonCommercial.
Lightning Fast Python?!?
This is how I reduced my data crunching process time from 12 hours down to only 20 minutes using Python.
Feature generation is the process of taking raw data and boiling it down
to a "feature matrix" that machine learning algorithms typically require.
In the Kaggle's Biometric Accelerometer Competition
http://www.kaggle.com/c/accelerometer-biometric-competition , the train
data was 29.5 million rows and the test data was just over 27 million
rows. Bringng the total raw data to about 56,500,000 rows of smartphone
accelerometer readings.
My initial feature generation code used the standard approach to machine
learning in Python: Pandas, Scikit-Learn, etc. It was taking about 12
hours (train and test). Ugh.
I went searching for a faster solution What were my options?
1. Rusty C language skills
2. Learn another language: Julia which is supposed to be very fast
(still on my todo list).
3. Try Cython a form of python that "sort of" compiles to C.
4. What else was there???
The Answer was PyPy. http://www.pypy.org/
PyPy got its start as a version of Python written in Python. At first,
this seemed kind of interesting for compiler people but not what I
needed. Then I learned that the PyPy team has been putting a lot of
effort into their JIT Compiler. A Just-In-Time (JIT) compiler converts
your code to machine language the first time it touches your code.
After that, it runs at machine speeds. The result is blazingly fast
Python! See http://speed.pypy.org/
There is a drawback: Many Machine Learning libraries do not run on it.
I had to remove all Pandas, Numpy, Scikit. So I broke my problem into
two steps: Feature generation in PyPy and Machine Learning in
Python/Pandas/SciKit. After that I was slicing and dicing
accelerometer readings like crazy. More importantly, I was iterating my
solution faster. Allowing me to finish 26th out of 633 teams (top 4%)!
Hopefully over time, more ML libraries will be ported to PyPy (I think
Numpy is working on it). For now, here is a list of packages which are
either known to work or not work with PyPy
https://bitbucket.org/pypy/compatibility/wiki/Home
Below is a code snippet for those who want to try it. What you need to
run this code:
1. Install PYPY
2. Change file permissions to allow execution
3. Run it from the command line: ./gen_features.py
#!/usr/bin/pypy
import os
import csv
parseStr = lambda x: float(x) if '.' in x else int(x)
allData = []
def feature_gen():
global allData
# do something here
return
os.nice(9)
f=open('train.csv')
reader = csv.reader(f)
header = reader.next() # strip off the csv header line
for row in reader:
convertedRow = []
for token in row:
try:
newTok = parseStr(token)
except (ValueError):
print token
raise ValueError
convertedRow.append(newTok)
allData.append(convertedRow)
f.close()
feature_gen()
Feature generation is the process of taking raw data and boiling it down
to a "feature matrix" that machine learning algorithms typically require.
In the Kaggle's Biometric Accelerometer Competition
http://www.kaggle.com/c/accelerometer-biometric-competition , the train
data was 29.5 million rows and the test data was just over 27 million
rows. Bringng the total raw data to about 56,500,000 rows of smartphone
accelerometer readings.
My initial feature generation code used the standard approach to machine
learning in Python: Pandas, Scikit-Learn, etc. It was taking about 12
hours (train and test). Ugh.
I went searching for a faster solution What were my options?
1. Rusty C language skills
2. Learn another language: Julia which is supposed to be very fast
(still on my todo list).
3. Try Cython a form of python that "sort of" compiles to C.
4. What else was there???
The Answer was PyPy. http://www.pypy.org/
PyPy got its start as a version of Python written in Python. At first,
this seemed kind of interesting for compiler people but not what I
needed. Then I learned that the PyPy team has been putting a lot of
effort into their JIT Compiler. A Just-In-Time (JIT) compiler converts
your code to machine language the first time it touches your code.
After that, it runs at machine speeds. The result is blazingly fast
Python! See http://speed.pypy.org/
There is a drawback: Many Machine Learning libraries do not run on it.
I had to remove all Pandas, Numpy, Scikit. So I broke my problem into
two steps: Feature generation in PyPy and Machine Learning in
Python/Pandas/SciKit. After that I was slicing and dicing
accelerometer readings like crazy. More importantly, I was iterating my
solution faster. Allowing me to finish 26th out of 633 teams (top 4%)!
Hopefully over time, more ML libraries will be ported to PyPy (I think
Numpy is working on it). For now, here is a list of packages which are
either known to work or not work with PyPy
https://bitbucket.org/pypy/compatibility/wiki/Home
Below is a code snippet for those who want to try it. What you need to
run this code:
1. Install PYPY
2. Change file permissions to allow execution
3. Run it from the command line: ./gen_features.py
#!/usr/bin/pypy
import os
import csv
parseStr = lambda x: float(x) if '.' in x else int(x)
allData = []
def feature_gen():
global allData
# do something here
return
os.nice(9)
f=open('train.csv')
reader = csv.reader(f)
header = reader.next() # strip off the csv header line
for row in reader:
convertedRow = []
for token in row:
try:
newTok = parseStr(token)
except (ValueError):
print token
raise ValueError
convertedRow.append(newTok)
allData.append(convertedRow)
f.close()
feature_gen()
An Easy Way to Bridge Between Python and Vowpal Wabbit
Python is a great programming language. It is has a clean syntax, tremendous user community support, and excellent machine learning libraries. Unfortunately it is SLOW! So, when the situation calls for it, I prefer to drop down to machine code to run the actual machine learning algorithm.
One fast and amazing Machine Learning tool that I have used on a number of projects is Vowpal Wabbit. It was developed by researchers at Yahoo! Research and later at Microsoft Research. It has support for many types of learning problems, automatically consumes/vectorizes text, can do recommendations, predictions, classifications, (single and multi-class), supports namespaces, instance weighting, and the list goes on.
VW Homepage: http://hunch.net/~vw/
VW Wiki: https://github.com/JohnLangford/vowpal_wabbit/wiki
(The Wiki is better for finding all the functions and how to use it)
There are also a few Python wrappers for Vowpal Wabbit:
1. Vowpal Porpoise: https://github.com/josephreisinger/vowpal_porpoise
2. PyVowpal: https://github.com/shilad/PyVowpal
The problem with wrappers is that they don't always expose all the features you want to use. Vowpal has a lot of features. So, after a bit of hemming and hawing, I did a "slash and burn" then wrote what I needed. This is how I currently use Vowpal Wabbit with Python. Instead of a wrapper, I offer you code snippets which can be tailored to your specific needs.
This code assumes you know how to use Python and Pandas. It runs on linux and uses the matrix factorization feature (recommendation engine) of Vowpal.
Performance: With over 43 million rows, it took about 16 minutes to generate the inputs in the Pandas DataFrame, but only 9 minutes to train with 20 passes. (I7-2600K)
Enjoy!
Steve Geringer
##########################################################################
# Here are the essential ingredients. You'll have to fill in the rest...;)
##########################################################################
import os
from time import asctime, time
import subprocess
import csv
import numpy as np
import pandas as pd
.
.
.
#############################################################
# Parameters and Globals
#############################################################
environmentDict=dict(os.environ, LD_LIBRARY_PATH='/usr/local/lib')
# Hat Tip to shrikant-sharat for this secret incantation
# Note: only needed if you rebuilt vowpal and the new libvw.so is in /usr/local/lib
parseStr = lambda x: float(x) if '.' in x else int(x)
#############################################################
# Vowpal Wabbit commands
#############################################################
"""
WARNING: MAKE SURE THERE ARE NO EXTRA SPACES IN THESE COMMAND STRINGS...IT GIVES A BOOST::MULTIPLE OPTIONS ERROR
"""
trainCommand = ("vw --rank 3 -q ui --loss_function=squared --l2 0.001 \
--learning_rate 0.015 --passes 20 --decay_learning_rate 0.97 --power_t 0 \
-d train_vw.data --cache_file vw.cache -f vw.model -b 20").split(' ')
predictCommand = ("vw -t -d test_vw.data -i vw.model -p vw.predict").split(' ')
.
.
.
#############################################################
# Generate the VW Train/Test data format in a Pandas DataFrame using the apply method
#############################################################
def genTrainInstances(aRow):
userid = str(aRow['userid'])
urlid = str(aRow['urlid'])
y_row = str(int(float(aRow['rating'])) )
rowtag = userid+'_'+urlid
rowText = (y_row + " 1.0 " + rowtag + "|user " + userid +" |item " +urlid)
return rowText
def genTestInstances(aRow):
y_row = str(0)
userid = str(aRow['userid'])
urlid = str(aRow['urlid'])
rowtag = userid+'_'+urlid
rowText = (y_row + " 1.0 " + rowtag + "|user " + userid +" |item " +urlid)
return rowText
.
.
.
#############################################################
# Function to read the VW predict file, strip off the desired value and return a vector with results
#############################################################
def readPredictFile():
y_pred = []
with open('vw.predict', 'rb') as csvfile:
predictions = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in predictions:
pred = parseStr(row[0])
y_pred.append(val)
return np.asarray(y_pred)
.
.
.
#############################################################
# Function to train a VW model using DataFrame called df_train
# - Apply genTrainInstances
# - Write newly create column to flat file
# - Invoke Vowpal Wabbit for training
#############################################################
def train_model():
global df_train, trainCommand, environmentDict
print "Generating VW Training Instances: ", asctime()
df_train['TrainInstances'] = df_train.apply(genTrainInstances, axis=1)
print "Finished Generating Train Instances: ", asctime()
print "Writing Train Instances To File: ", asctime()
trainInstances = list(df_train['TrainInstances'].values)
f = open('train_vw.data','w')
f.writelines(["%s\n" % row for row in trainInstances])
f.close()
print "Finished Writing Train Instances: ", asctime()
subprocess.call(trainCommand, env=environmentDict)
print "Finished Training: ", asctime()
return
.
.
.
#############################################################
# Function to test a VW model using DataFrame df_test
# - Apply genTestInstances
# - Write new column to flat file
# - Invoke Vowpal Wabbit for prediction
#############################################################
def predict_model():
global environmentDict, predictCommand, df_test
print "Building Test Instances: ", asctime()
df_test['TestInstances'] = df_test.apply(genTestInstances, axis=1)
print "Finished Generating Test Instances: ", asctime()
print "Writing Test Instances: ", asctime()
testInstances = list(df_test['TestInstances'].values)
f = open('test_vw.data','w')
f.writelines(["%s\n" % row for row in testInstances])
f.close()
print "Finished Writing Test Instances: ", asctime()
subprocess.call(predictCommand, env=environmentDict)
df_test['y_pred'] = readPredictFile()
return
Welcome
Hello and Welcome!
I am a software consultant and have been involved with Machine Learning since
2002. A friend of mine and fellow Machine Learning enthusiast, Rohit
Sivaprasad of http://www.DataTau.com suggested I start a blog to share some of my
ideas and tips with the data science community.
Here you go !
Steve Geringer
I am a software consultant and have been involved with Machine Learning since
2002. A friend of mine and fellow Machine Learning enthusiast, Rohit
Sivaprasad of http://www.DataTau.com suggested I start a blog to share some of my
ideas and tips with the data science community.
Here you go !
Steve Geringer