Predictive Modeling from the Trenches
More Data Beats Better Algorithms
Apr 07, 2008 | /jeff |
Link
Anand Rajaraman argues that more data > better algorithm. Hear, hear.
Good software is still required to crunch all this additional data, though.
New site design
Apr 02, 2008 | /jeff |
Link
We're rolling out the redesigned ArrowModel web site today.
ArrowModel 0.2
Jan 11, 2008 | /jeff |
Link
Second beta of ArrowModel is out. Registered users can download it now.
If you are not a registered user, but would like to give ArrowModel a try, please sign up.
Highlights of the new version include:
- CVS import on Windows is at least 2x faster
- Help files and Assistant (help viewer) GUI translations improved
- Can check/uncheck all predictors using main menu or keyboard shortcuts
- Lots of small usability improvements and bugfixes
ARM file format remains unchanged. You will be getting warning messages when opening models created with previous versions, but they should work.
Thank you for the feedback and support.
Wir sprechen Deutsch
Sep 17, 2007 | /jeff |
Link
Second beta (build 888)
Jul 19, 2007 | /jeff |
Link
Second beta of ArrowModel is out. The biggest new feature is ODBC connectivity.
New hosting
May 09, 2007 | /jeff |
Link
ArrowModel moved to new hosting. We apologize for the downtime and inconvenience.
ArrowModel beta FAQ
May 04, 2007 | /jeff |
Link
ArrowModel goes through its first beta testing. Here are some of the frequently asked questions so far:
- How can I get my data from SAS to ArrowModel?
In SAS, export the data to CSV:
proc export data=mydataset outfile='c:\temp\mydataset.csv' dbms=csv replace; run;
Then import the CSV file in ArrowModel.
- Why is my KS so low, and why does the ROC look like a bow string (left picture) rather than like a bow (right picture)?


You probably decided to override ArrowModel's recommendation in the Stratify step and told it to keep 100% of events and non-events. After all, it is usually a good idea to use all the available data rather than to throw it away.
But there's also rounding. The output of logistic regression is the estimated probability of the event in the training sample (or non-event, if you choose the high value of score to indicate low probability of event, but it does not matter in the end). If the event is rare, this value is going to be close to zero for most observations. To get a score, which in the case of ArrowModel is an integer between 0 and 99, the output of logistic regression is multiplied by 100 and then rounded down to the nearest integer. For many observations small differences in estimated probability will be lost due to this rounding.
Check the score distribution in the Test step. If it is severely skewed, try going back to the Stratify step and restoring the defaults.
- How can I insert a chart from ArrowModel in my presentation?
Right-click on a chart, select "Save image as..." from the pop-up menu, then use the resulting PNG file.
[UPDATE 5/4] Pictures added to illustrate the differences in ROC curves.
Not Quite Normal
Apr 19, 2007 | /jeff |
Link
A lot of statistical magic relies on the premise that stuff is normally distributed.
Normal distribution
The normal distribution has nice properties that make things easy analytically, but chances are that, most of the time, you'll see distributions that look like this:

Not quite normal distribution
Of course I'm generalizing and there are exceptions, but it's clear that the good old normal distribution belongs on the endangered species list.
There are several reasons why:
- Counts. We often count events and the count cannot be negative, hence not really normal either: the number of accidents somewhere and some place will tend to be Poisson distributed, the number of an account number will tend to be uniformly distributed and the waiting time for your next e-mail will tend to be exponential.
- Long tail aka outliers. If income was normally distributed there would be no Bill Gates or Warren Buffett.
- Six degrees of separation, or everything is connected, making the law of large numbers and central limit theorem not really applicable.
So what is the poor modeler to do?
- Look at the distributions before plugging your data straight into the model. Even if you don't have time.
- Winsorize. ArrowModel does it automatically for you.
- Transform variables when needed, e.g., use log(income+K), where K is a constant, or √√income instead of income. I dislike log(income+K) because of the arbitrariness of K.
- Look for a natural break in the distribution to see
if a continuous variable can be transformed into an indicator
(dummy variable) like this:
CASE WHEN foo > 9 THEN 1 ELSE 0 END
There are more elaborate ways of dealing with not quite normally distributed data such as Johnson's SU functions and multivariate adaptive regression splines (MARS) which this margin is too narrow to contain.
Spanish translation
Apr 16, 2007 | /jeff |
Link
ArrowModel speaks Spanish, too.
Site translations
Apr 01, 2007 | /jeff |
Link
The main ArrowModel site now has French and Russian translations. Really. More i18n is underway.
Information Value
Feb 25, 2007 | /jeff |
Link
Deciding which predictors to use is one of the key steps in model building. A good place to start is to examine predictors individually to see how good they are in a univariate sense.
Information value is a metric that is often used to tell how good a predictor is. Let's follow the calculations step by step.
- Start by ranking the data by the predictor in question. The number of ranks is not very critical and, in most cases, deciles will do just fine.
- Calculate the total number of goods (total_good_ct)
and the total number of bads (total_bad_ct); - For each rank
- Calculate the number of goods (good_ct)
and the number of bads (bad_ct); - good_pct = good_ct / total_good_ct,
bad_pct = bad_ct / total_bad_ct; - diff_pct = good_pct - bad_pct;
- info_odds = good_pct / bad_pct;
- Weight of evidence: woe = log(info_odds);
- Information value: inf_val = diff_pct * woe;
- Calculate the number of goods (good_ct)
- Finally, sum up inf_val for all the ranks. This is the predictor's information value.
As you can see, the information value for each rank reflects log odds, but the order of ranks does not have any effect. This nicely takes care of nonlinearity and outliers.
Ordering predictors by information value and taking the top N is a tempting strategy, but not a very prudent approach. The predictors selected this way can turn out to be redundant, regression is rather sensitive to outliers, and we haven't done anything about nonlinearity yet. But it's a good way to screen out the least likely candidates.
Receiver Operating Characteristic
Feb 24, 2007 | /jeff |
Link
ROC curves were first used during World War II to graphically show the separation of radar signals from background noise. They are commonly used to graphically show the added value of any predictive model. To plot the receiver operating characteristic, or ROC curve, one plots B(s) vs. G(s) for all values of s. This curve goes from (0, 0) to (1, 1). The curve of an ideal model (complete separation) goes through (0, 1), while the curve of a totally useless model (no separation) is a straight diagonal line. The curve looks like a banana, hence the nickname banana chart.
![]() |
![]() |
| Very strong separation | Weak separation |
| Excellent model | Mediocre model |
The KS query from this post can be easily modified to return coordinates of the points on the ROC curve:
SELECT s
, cdf.b "Sensitivity"
, cdf.g "1-Specificity"
FROM ( SELECT a.s "s"
, SUM(distr.bad_cnt) /
( SELECT COUNT(*) FROM t WHERE outcome = 1 ) "b"
, SUM(distr.good_cnt) /
( SELECT COUNT(*) FROM t WHERE outcome = 0 ) "g"
FROM ( SELECT DISTINCT s FROM t ) a
JOIN (
SELECT s "s"
, SUM(outcome) "bad_cnt"
, SUM(1 - outcome) "good_cnt"
FROM t
GROUP BY s
) distr
ON distr.s <= a.s
GROUP BY a.s
) cdf
;
In the context of an ROC plot, B(s) is often called sensitivity or true positive fraction, and G(s) is called 1-specificity or false positive fraction.
Kolmogorov-Smirnov Test
Feb 23, 2007 | /jeff |
Link
One of the most widely (mis)used measures of scorecard performance is the Kolmogorov-Smirnov test (KS), colloquially known as the vodka test. In this post, I'll explain what KS is, and show a way to calculate it in SQL.
Given two samples of a continuous random variable, the two sample K-S test is used answer the following question: did these two samples come from the same distribution or didn't they? The idea is simply to compute the largest absolute difference between the two empirical cumulative distributions and to conclude that there is a significant difference if the difference is large enough.
Consider a risk score that predicts the probability of a customer defaulting (we'll call that 'going bad'). KS is the greatest difference between the cumulative distribution functions of the scores of the good and the bad populations:
where
- s is the score,
- B(s) is the number of bads with a score less than or equal to s divided by the total number of bads, and
- G(s) is the number of goods with a score less than or equal to s divided by the total number of goods.
KS is often multiplied by 100 for convenience. In many contexts 40 is considered to be a good KS.
Let's try an example. Start with the table t that contains initial data:
| Column | Description |
| id | Unique identifier |
| s | Score |
| outcome | 1 is bad, 0 is good |
The following query calculates the KS:
SELECT MAX(cdf.b - cdf.g) * 100 "KS"
FROM ( SELECT a.s "s"
, SUM(distr.bad_cnt) /
( SELECT COUNT(*) FROM t WHERE outcome = 1 ) "b"
, SUM(distr.good_cnt) /
( SELECT COUNT(*) FROM t WHERE outcome = 0 ) "g"
FROM ( SELECT DISTINCT s FROM t ) a
JOIN (
SELECT s "s"
, SUM(outcome) "bad_cnt"
, SUM(1 - outcome) "good_cnt"
FROM t
GROUP BY s
) distr
ON distr.s <= a.s
GROUP BY a.s
) cdf
;
The easiest way to understand how the query works is by decomposing it into smaller pieces. In this case there are five uncorrelated subqueries.
This subquery returns distribution of goods and bads by score:
SELECT s "s"
, SUM(outcome) "bad_cnt"
, SUM(1 - outcome) "good_cnt"
FROM t
GROUP BY s
Note how it relies on the fact that outcome can be either 0 or 1.
This subquery returns the list of all possible score values:
SELECT DISTINCT s FROM t
This subquery returns the total number of bads:
SELECT COUNT(*) FROM t WHERE outcome = 1
This subquery returns the total number of goods:
SELECT COUNT(*) FROM t WHERE outcome = 0
Finally, this subquery (abbreviated for clarity) makes the distributions cumulative:
SELECT a.s "s"
, SUM(distr.bad_cnt) / total_bad "b"
, SUM(distr.good_cnt) / total_good "g"
FROM a
JOIN distr
ON distr.s <= a.s
GROUP BY a.s
Note that it is rather inefficient because the join results in a partial Cartesian product. There's a better way to do the cumulation if your flavor of SQL supports online analytical processing (OLAP) functions:
SELECT s "s"
, SUM(FLOAT(bad_cnt)) OVER (ORDER BY s) / total_bad "b"
, SUM(FLOAT(good_cnt)) OVER (ORDER BY s) / total_bad "g"
FROM distr
Now the only thing left to do is to pick the maximum difference. This is the KS.
I'm new at predictive modeling. Help!
Feb 22, 2007 | /jeff |
Link
It's true that there does not seem to be a lot of information on scoring and predictive modeling available online, and that many articles are written in rather heavy language, peppered with statistical jargon. But don't panic. To help you navigate the unchartered waters, here are some good places to start.
- Using Predictive Models by Brian Teasley (part 2, part 3) gives a very nice non-technical introduction;
- Wikipedia is hard to beat when you need to know what generalized linear model or logistic regression is.
There are also a few exceptionally good books. My favorites are:
- Regression Modeling Strategies by Frank E. Harrell, Jr.
- Applied Logistic Regression (Second Edition) by David Hosmer and Stanley Lemeshow
Finally, these two classes by the SAS Institute are worth taking:

