Predictive Modeling from the Trenches
Reject Inference
Sep 05, 2011 | /jeff |
Link
ArrowModel doesn't do reject inference, at least not automatically. This post tries to explain why.
Let's look at an example. Suppose we are building acquisition score for a bank. The bank does not give loans to everybody who asks for one. Instead it has a credit policy, which for this example will be very simple: approve everyone with FICO > 640. FICO here is a generic risk score. The bank is buying it from a credit bureau.
The problem is, both training and validation samples are censored - they have no people with FICO below 640. A model trained and validated on such samples might fail spectacularly when used in credit policy instead of FICO - after all, it never saw the really risky cases (rejects).
Reject inference is supposed to solve this problem. There are several methods. For example, parceling, where rejects are assigned good/bad outcome randomly with probability that corresponds to their FICO. Or fuzzy reject inference, where each reject is added twice, once as good, and once as bad, with weights equal to probabilities for this FICO band.
Unfortunately there is no way to know how well these methods work before rolling out the new model (it was trained and validated on inferences, not real data), and by then it's already too late.
So we usually recommend to do 1/N testing instead: approve some accounts below existing score cuts anyway and use their performance, oversampling if needed.
To those unable or unwilling to do so we recommend to keep exising score cut in policy and add new score cut as overlay.
Finally, reject inference can still be done in ArrowModel by feeding it data with rejects added manually.
Site translations
Oct 18, 2010 | /jeff |
Link
The main ArrowModel site now has French and Russian translations. Really. More i18n is underway.
New hosting
Oct 18, 2010 | /jeff |
Link
ArrowModel moved to new hosting. We apologize for the downtime and inconvenience.
Spanish translation
Oct 18, 2010 | /jeff |
Link
ArrowModel speaks Spanish, too.
New Hosting
Oct 17, 2010 | /jeff |
Link
Arrowmodel.com changed hosting. Some things might still be broken, but we'll fix them as soon as we can.
ArrowModel can now import Excel files on Windows
Oct 10, 2009 | /jeff |
Link
It was always possible to get the data from Microsoft Excel to ArrowModel by exporting it to CSV first, but now it's even easier because ArrowModel can import Excel files directly. The data is taken from the first worksheet. First row should contain variable names.
Unfortunately this doesn't work on Mac OS X because in contrast to Windows there is no pre-installed Excel ODBC driver.
Britney Spears Algorithm
Oct 01, 2009 | /jeff |
Link
One of the neat things ArrowModel does for you is automatic creation of dummy variables from categorical predictors. For example, if the dataset contains a field sex with two possible values, M and F, ArrowModel will create a dummy variable that is 1 if sex=F and 0 otherwise.
Or maybe the other way, depending on the distribution.
Getting that distribution for each categorical variable by brute force can take a while.
Luckily, other people have encountered this problem before. American Scientist article calls it the Britney Spears problem because it appears, among other things, in finding the most popular search terms.
A stream algorithm similar the the one described in that article is implemented in ArrowModel starting with build 1167.
There are also a couple of bugfixes, most related to handling of large number of predictors. If you're working with 10,000 variables or more, get the new beta.
Important: Make sure to uninstall previous version of ArrowModel before installing the new one. One of the fixes is in a third-party component (QtSql4.dll). Its internal version remains the same, though, so ArrowModel installer does not realize that this DLL needs to be updated.
Folded root
Oct 01, 2009 | /jeff |
Link
ArrowModel build 1207 is available now. Changes:
The Unreasonable Effectiveness of Data by Peter Norvig et al.
May 06, 2009 | /jeff |
Link
"...invariably, simple models and a lot of data trump more elaborate models based on less data."
Full article (PDF)
Wir sprechen Deutsch
May 06, 2009 | /jeff |
Link
Memory management
May 06, 2009 | /jeff |
Link
When working with large datasets ArrowModel has to decide what to keep in memory and what to leave on the hard drive. Most of the time this decision can be left to the operating system, but there are always exceptions.
After running some benchmarks we tweaked this algorithm a little to improve performance. Now this is not an exact science. There are just too many moving parts. So if you noticed that build 1153 is faster or slower than the one you used before, please let us know. The changes only affect handling of large datasets (with design matrix over 256 MB).
More Data Beats Better Algorithms
May 06, 2009 | /jeff |
Link
Anand Rajaraman argues that more data > better algorithm. Hear, hear.
Good software is still required to crunch all this additional data, though.
When three hundred zeroes is not enough
May 06, 2009 | /jeff |
Link
One of our beta testers encountered an interesting and rare corner case. On a large dataset with very strong predictors ArrowModel decision tree algorithm did not pick the best top split. Situation improved a couple of levels down, but the tree still was suboptimal.
A quick look through the log revealed that p values of several predictors were very close to zero. In fact, so close to zero that they were indistinguishable from each other. ArrowModel picked the first one among them, which wasn't the best.
This was interesting on many levels. First, floating point underflow is a very rare problem. After all, smallest absolute non-zero double precision float is approximately 10-308. Which leads me to the next point: our beta testers are great. There's no way we could have caught it on existing test cases. You need real-world data to tickle bugs like that.
Anyway, the problem is now resolved. When p values are very low, exponents in χ2 distribution approximation are compared instead. We tried different approximations and jumped through some hoops to handle degrees of freedom and Bonferroni adjustments correctly, but now everything should work fine.
Bottom line: if you build decision trees on large datasets with multiple strong predictors, get the new beta.
ArrowModel 0.2
May 06, 2009 | /jeff |
Link
Second beta of ArrowModel is out. Registered users can download it now.
If you are not a registered user, but would like to give ArrowModel a try, please sign up.
Highlights of the new version include:
- CVS import on Windows is at least 2x faster
- Help files and Assistant (help viewer) GUI translations improved
- Can check/uncheck all predictors using main menu or keyboard shortcuts
- Lots of small usability improvements and bugfixes
ARM file format remains unchanged. You will be getting warning messages when opening models created with previous versions, but they should work.
Thank you for the feedback and support.
Second beta (build 888)
May 06, 2009 | /jeff |
Link
Second beta of ArrowModel is out. The biggest new feature is ODBC connectivity.
New site design
May 06, 2009 | /jeff |
Link
We're rolling out the redesigned ArrowModel web site today.
Decision trees
May 06, 2009 | /jeff |
Link
Decision trees are not as popular in credit scoring (most typical ArrowModel use case) as logistic regression, which is why we didn't initially plan to have them in version 1.
But it turns out that many beta testers wanted trees, and we listened.
The bad news is that adding a rather large new feature so late in development cycle is guaranteed to push back the release date. And it did.
The good news is that ArrowModel can now do decision trees, and it's still 2008!
If you are a beta tester, download build 1101 now. If you are not a beta tester yet, consider signing up. It's free.
Algorithm used by ArrowModel to create decision trees is based on venerable CHAID, and it's fast.
Notes:
- As of now decision trees are built completely automatically. There's no way to force a split or trim extra leaves by hand;
- Indicator variables created for logistic regression are not included in the tree. Original (unmodified) variables are used instead.
Not Quite Normal
Jul 26, 2008 | /jeff |
Link
A lot of statistical magic relies on the premise that stuff is normally distributed.
Normal distribution
The normal distribution has nice properties that make things easy analytically, but chances are that, most of the time, you'll see distributions that look like this:

Not quite normal distribution
Of course I'm generalizing and there are exceptions, but it's clear that the good old normal distribution belongs on the endangered species list.
There are several reasons why:
- Counts. We often count events and the count cannot be negative, hence not really normal either: the number of accidents somewhere and some place will tend to be Poisson distributed, the number of an account number will tend to be uniformly distributed and the waiting time for your next e-mail will tend to be exponential.
- Long tail aka outliers. If income was normally distributed there would be no Bill Gates or Warren Buffett.
- Six degrees of separation, or everything is connected, making the law of large numbers and central limit theorem not really applicable.
So what is the poor modeler to do?
- Look at the distributions before plugging your data straight into the model. Even if you don't have time.
- Winsorize. ArrowModel does it automatically for you.
- Transform variables when needed, e.g., use log(income+K), where K is a constant, or √√income instead of income. I dislike log(income+K) because of the arbitrariness of K.
- Look for a natural break in the distribution to see
if a continuous variable can be transformed into an indicator
(dummy variable) like this:
CASE WHEN foo > 9 THEN 1 ELSE 0 END
There are more elaborate ways of dealing with not quite normally distributed data such as Johnson's SU functions and multivariate adaptive regression splines (MARS) which this margin is too narrow to contain.
First post
Jul 26, 2008 | |
Link
Welcome to the newly-minted practical predictive modeling blog. We'll share tips, tricks, and techniques to make your life as a modeler easier.
I'm new at predictive modeling. Help!
Jul 26, 2008 | /jeff |
Link
It's true that there does not seem to be a lot of information on scoring and predictive modeling available online, and that many articles are written in rather heavy language, peppered with statistical jargon. But don't panic. To help you navigate the unchartered waters, here are some good places to start.
- Using Predictive Models by Brian Teasley (part 2, part 3) gives a very nice non-technical introduction;
- Wikipedia is hard to beat when you need to know what generalized linear model or logistic regression is.
There are also a few exceptionally good books. My favorites are:
- Regression Modeling Strategies by Frank E. Harrell, Jr.
- Applied Logistic Regression (Second Edition) by David Hosmer and Stanley Lemeshow
Finally, these two classes by the SAS Institute are worth taking:
- Predictive Modeling Using Logistic Regression
- Advanced Predictive Modeling Using SAS Enterprise Miner
Kolmogorov-Smirnov Test
Jul 26, 2008 | /jeff |
Link
One of the most widely (mis)used measures of scorecard performance is the Kolmogorov-Smirnov test (KS), colloquially known as the vodka test. In this post, I'll explain what KS is, and show a way to calculate it in SQL.
Given two samples of a continuous random variable, the two sample K-S test is used answer the following question: did these two samples come from the same distribution or didn't they? The idea is simply to compute the largest absolute difference between the two empirical cumulative distributions and to conclude that there is a significant difference if the difference is large enough.
Consider a risk score that predicts the probability of a customer defaulting (we'll call that 'going bad'). KS is the greatest difference between the cumulative distribution functions of the scores of the good and the bad populations:
where
- s is the score,
- B(s) is the number of bads with a score less than or equal to s divided by the total number of bads, and
- G(s) is the number of goods with a score less than or equal to s divided by the total number of goods.
KS is often multiplied by 100 for convenience. In many contexts 40 is considered to be a good KS.
Let's try an example. Start with the table t that contains initial data:
| Column | Description |
| id | Unique identifier |
| s | Score |
| outcome | 1 is bad, 0 is good |
The following query calculates the KS:
SELECT MAX(cdf.b - cdf.g) * 100 "KS"
FROM ( SELECT a.s "s"
, SUM(distr.bad_cnt) /
( SELECT COUNT(*) FROM t WHERE outcome = 1 ) "b"
, SUM(distr.good_cnt) /
( SELECT COUNT(*) FROM t WHERE outcome = 0 ) "g"
FROM ( SELECT DISTINCT s FROM t ) a
JOIN (
SELECT s "s"
, SUM(outcome) "bad_cnt"
, SUM(1 - outcome) "good_cnt"
FROM t
GROUP BY s
) distr
ON distr.s <= a.s
GROUP BY a.s
) cdf
;
The easiest way to understand how the query works is by decomposing it into smaller pieces. In this case there are five uncorrelated subqueries.
This subquery returns distribution of goods and bads by score:
SELECT s "s"
, SUM(outcome) "bad_cnt"
, SUM(1 - outcome) "good_cnt"
FROM t
GROUP BY s
Note how it relies on the fact that outcome can be either 0 or 1.
This subquery returns the list of all possible score values:
SELECT DISTINCT s FROM t
This subquery returns the total number of bads:
SELECT COUNT(*) FROM t WHERE outcome = 1
This subquery returns the total number of goods:
SELECT COUNT(*) FROM t WHERE outcome = 0
Finally, this subquery (abbreviated for clarity) makes the distributions cumulative:
SELECT a.s "s"
, SUM(distr.bad_cnt) / total_bad "b"
, SUM(distr.good_cnt) / total_good "g"
FROM a
JOIN distr
ON distr.s <= a.s
GROUP BY a.s
Note that it is rather inefficient because the join results in a partial Cartesian product. There's a better way to do the cumulation if your flavor of SQL supports online analytical processing (OLAP) functions:
SELECT s "s"
, SUM(FLOAT(bad_cnt)) OVER (ORDER BY s) / total_bad "b"
, SUM(FLOAT(good_cnt)) OVER (ORDER BY s) / total_bad "g"
FROM distr
Now the only thing left to do is to pick the maximum difference. This is the KS.
ArrowModel beta FAQ
Jul 26, 2008 | /jeff |
Link
ArrowModel goes through its first beta testing. Here are some of the frequently asked questions so far:
- How can I get my data from SAS to ArrowModel?
In SAS, export the data to CSV:
proc export data=mydataset outfile='c:\temp\mydataset.csv' dbms=csv replace; run;
Then import the CSV file in ArrowModel.
- Why is my KS so low, and why does the ROC look like a bow string (left picture) rather than like a bow (right picture)?


You probably decided to override ArrowModel's recommendation in the Stratify step and told it to keep 100% of events and non-events. After all, it is usually a good idea to use all the available data rather than to throw it away.
But there's also rounding. The output of logistic regression is the estimated probability of the event in the training sample (or non-event, if you choose the high value of score to indicate low probability of event, but it does not matter in the end). If the event is rare, this value is going to be close to zero for most observations. To get a score, which in the case of ArrowModel is an integer between 0 and 99, the output of logistic regression is multiplied by 100 and then rounded down to the nearest integer. For many observations small differences in estimated probability will be lost due to this rounding.
Check the score distribution in the Test step. If it is severely skewed, try going back to the Stratify step and restoring the defaults.
- How can I insert a chart from ArrowModel in my presentation?
Right-click on a chart, select "Save image as..." from the pop-up menu, then use the resulting PNG file.
[UPDATE 5/4] Pictures added to illustrate the differences in ROC curves.
Why look at histograms?
Jul 26, 2008 | /jc |
Link
Statisticians look at histograms, the way generals look at maps. There is just no way around it. But if you're not a statistician, what are you supposed to look for?
It's hard to answer, but it is easy to look at a few simple examples.
The first dataset we'll use contains data from the real customers of a bank. One variable, SAVBAL, contains the balance of the customers' savings account.
This is what the histogram of the raw variable looks like:

We clearly have lots of zero values, more than half in fact, since the median is zero.
We want to remember that the huge majority of savings is below $ 20,000.
We also have a very long tail to the right, which immediately makes us want to take the logarithm of SAVBAL + 1. We do this in SQL with one line after the "select"
log(savbal + 1) "lsav"
Why the + 1 ? Because we won't have to deal with Log(0), which you may remember is minus infinity. This way, since log(1) = 0 , a zero remains a zero.

Again, we have the same large number of zero values.
But now, we can see them more easily as a completely separate bunch.
And that takes us to the main point worth remembering:
Break the population in two and conduct two analyses, one for each population.
In our case, this means separating the customers who save anything from those who do not save at all, an easy step to take with this SQL line
CASE WHEN savbal > 0 THEN log(savbal) ELSE NULL END "lsav2"

Now, even though the resulting graph is not completely symmetric, nor very close to normal, it is much better than the raw SAVBAL, and this is what we want to use.
If we invested a lot more time, we'd notice that savbal raised to the power 0.1 gives a better approximation of the normal distribution. The SQL needed is
CASE WHEN savbal > 0 THEN pow(savbal, 0.1) ELSE NULL END "s01"

But the added work needed to find 0.1 is not worth it.
In conclusion, we have identified two populations, savers and non-savers, and the savings are log-normal.
Receiver Operating Characteristic
Jul 26, 2008 | /jeff |
Link
ROC curves were first used during World War II to graphically show the separation of radar signals from background noise. They are commonly used to graphically show the added value of any predictive model. To plot the receiver operating characteristic, or ROC curve, one plots B(s) vs. G(s) for all values of s. This curve goes from (0, 0) to (1, 1). The curve of an ideal model (complete separation) goes through (0, 1), while the curve of a totally useless model (no separation) is a straight diagonal line. The curve looks like a banana, hence the nickname banana chart.
![]() |
![]() |
| Very strong separation | Weak separation |
| Excellent model | Mediocre model |
The KS query from this post can be easily modified to return coordinates of the points on the ROC curve:
SELECT s
, cdf.b "Sensitivity"
, cdf.g "1-Specificity"
FROM ( SELECT a.s "s"
, SUM(distr.bad_cnt) /
( SELECT COUNT(*) FROM t WHERE outcome = 1 ) "b"
, SUM(distr.good_cnt) /
( SELECT COUNT(*) FROM t WHERE outcome = 0 ) "g"
FROM ( SELECT DISTINCT s FROM t ) a
JOIN (
SELECT s "s"
, SUM(outcome) "bad_cnt"
, SUM(1 - outcome) "good_cnt"
FROM t
GROUP BY s
) distr
ON distr.s <= a.s
GROUP BY a.s
) cdf
;
In the context of an ROC plot, B(s) is often called sensitivity or true positive fraction, and G(s) is called 1-specificity or false positive fraction.
SAS language idiosyncrasies
Jul 26, 2008 | /jc |
Link
Bjarne Stroustrup once said that there are only two kinds of
programming languages:
- those people always bitch about and
- those nobody uses.
I searched SAS-L for any criticism of SAS and found almost
none! That's kind of strange since I know that SAS is widely used.
I have used SAS since it came out on the market in the early 70's. At the time, I was delighted with the DATA step which saved me from writing silly little FORTRAN programs to manipulate my data into the form expected by BMDP. That DATA step is the main reason SAS blew all its competitors out of the water. The rest, as the saying goes, is history. Alas, when a product becomes dominant, it often endows its developers with an undesirable arrogance and a tendancy to respond "That's the way we do it!" to all suggestions for improvement.
I only realized the problem with SAS many years later when I studied
closely other programming languages:
SAS is probably among the worst widely used languages I know of.
Here are just a few examples:
There are two ways to write comments
in SAS:
/* C-like */
* and Fortran-like (with trailing semicolon);
Neither of those can be nested. To comment out a block of code, one needs to resort to the following trick:
%macro skip;
Stuff here is not executed
%mend skip;
Why? We know it can be fixed. But SAS won't do it.
There's a concept of NULL (missing value) in SAS, but it is not universally applied. For example, a logical operation between a missing value and anything else results in a missing value, which is perfectly logical. But if you compare a missing value to a numeric variable — surprise — the result is NOT a missing value.
Got that? In a comparison with a number, a missing value is
treated as if it is, of
all things, minus infinity. Why?
The notion of naming convention seems to elude SAS language designers. Compare proc import and proc export:
proc import
datafile='/somewhere/myfile.csv'
out=mydataset
dbms=csv;
run;
proc export
data=mydataset
outfile='/somewhere/myfile.csv'
dbms=csv;
run;
But why not this:
proc import
in='/somewhere/myfile.csv'
out=mydataset
dbms=csv;
run;
proc export
in=mydataset
out='/somewhere/myfile.csv'
dbms=csv;
run;
Which one is easier to remember?
And speaking of proc import, SAS will never finish if launched on UNIX from the command line. Why? SAS note SN-003610, says:
"When trying to use PROC EXPORT or PROC IMPORT in batch mode on UNIX systems, you may receive the following error:
ERROR: Cannot open X display. Check the name/server access authorization.
This happens because, even in batch mode, these procedures try to display the SAS SESSION MANAGER icon, which requires a valid X display. For any version 8 procedures that you want to run in batch mode without a terminal present you will need to use the -NOTERMINAL option when invoking SAS.
For example:
sas myprogram.sas -noterminalThis will prevent the session manager icon from trying to display."
Translation: "SAS will hang forever on proc export, and you won't even see the error message in the log, because the log is not flushed to disk until you kill SAS, and this is not a bug, it's been like that since the dawn of days, and we won't fix it because it's not a bug, it's perfectly OK to hang, but as a workaround, you can use the -noterminal option."
Here is a third example of a problem with proc import: when using it to read Titanic3.csv, a public dataset describing the 1,309 passengers of the Titanic, SAS truncates hundreds of values of name, cabin and home destination without any warning or error. You can get the file here.
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/DataSets
Of course, it is easy to fix; and it will not
affect your analysis, but still, is this what you expect from a leading product?
Arbitrary limits are everywhere. You create a string variable
and by default its length is limited to 8. You assign something to it
and it gets silently truncated. You import a
file and the line length is limited to 256. Of course you can change
it by using the lrecl=
option, but why can't SAS do it?
Proc sql is just like SQL, but not always.
GROUP
BY a variable works as expected.
But can you guess what GROUP BY any expression does?
Nothing!!!
Error messages are not always helpful in identifying the problem. If logistic regression fails to provide any output except for cryptic
"Error: There are no valid observations",what exactly does it mean? Why not just say
"Warning: all values of variable FOO are missing"exclude it from the list of predictors, and go on?
You are sorting a dataset in-place, and it's taking too long. You decide to cancel it. The dataset is still there, but it's now empty. Not unsorted, but empty. As in no observations! Of course, everyone knows that you should have used the out= option to redirect the output to another dataset, so that your data can take twice as much disk space.
Proc sql again. Guess what will be the name of the second variable in the new_table:
proc sql;
CREATE TABLE new_table AS
SELECT foo,
COUNT(*) "cnt"
FROM old_table
GROUP BY foo;
quit;
Of course it's _TEMA001, because cnt is the label,
not the variable name. Bizarre, but you can make it work with
proc sql;
CREATE TABLE new_table AS
(SELECT * FROM
(SELECT foo,
COUNT(*)
FROM old_table
GROUP BY foo
) x ( foo, cnt )
);
quit;
I can go on like this for a while, but I think you get the idea.
The strange thing is that people who use SAS on a day to day basis
tend not to see
how unnatural it is. It looks like the
Sapir-Whorf
hypothesis
in action.
But isn't it true that all the old
languages have their
quirks?
No! While it s true that Cobol and Basic will rot your brain
because of the paucity of
their features, many old languages were either done right from the start,
or evolved into coherent ones:
LISP, for instance, SQL, C or R (an interesting alternative to SAS for doing statistics).
Together with more recent
languages like Java, or Ruby, they are much more consistent than SAS.
Information Value
Jul 26, 2008 | /jeff |
Link
Deciding which predictors to use is one of the key steps in model building. A good place to start is to examine predictors individually to see how good they are in a univariate sense.
Information value is a metric that is often used to tell how good a predictor is. Let's follow the calculations step by step.
- Start by ranking the data by the predictor in question. The number of ranks is not very critical and, in most cases, deciles will do just fine.
- Calculate the total number of goods (total_good_ct)
and the total number of bads (total_bad_ct); - For each rank
- Calculate the number of goods (good_ct)
and the number of bads (bad_ct); - good_pct = good_ct / total_good_ct,
bad_pct = bad_ct / total_bad_ct; - diff_pct = good_pct - bad_pct;
- info_odds = good_pct / bad_pct;
- Weight of evidence: woe = log(info_odds);
- Information value: inf_val = diff_pct * woe;
- Calculate the number of goods (good_ct)
- Finally, sum up inf_val for all the ranks. This is the predictor's information value.
As you can see, the information value for each rank reflects log odds, but the order of ranks does not have any effect. This nicely takes care of nonlinearity and outliers.
Ordering predictors by information value and taking the top N is a tempting strategy, but not a very prudent approach. The predictors selected this way can turn out to be redundant, regression is rather sensitive to outliers, and we haven't done anything about nonlinearity yet. But it's a good way to screen out the least likely candidates.

