Predictive Modeling from the Trenches
SAS language idiosyncrasies
May 08, 2007 | /jc |
Link
Bjarne Stroustrup once said that there are only two kinds of
programming languages:
- those people always bitch about and
- those nobody uses.
I searched SAS-L for any criticism of SAS and found almost
none! That's kind of strange since I know that SAS is widely used.
I have used SAS since it came out on the market in the early 70's. At the time, I was delighted with the DATA step which saved me from writing silly little FORTRAN programs to manipulate my data into the form expected by BMDP. That DATA step is the main reason SAS blew all its competitors out of the water. The rest, as the saying goes, is history. Alas, when a product becomes dominant, it often endows its developers with an undesirable arrogance and a tendancy to respond "That's the way we do it!" to all suggestions for improvement.
I only realized the problem with SAS many years later when I studied
closely other programming languages:
SAS is probably among the worst widely used languages I know of.
Here are just a few examples:
There are two ways to write comments
in SAS:
/* C-like */
* and Fortran-like (with trailing semicolon);
Neither of those can be nested. To comment out a block of code, one needs to resort to the following trick:
%macro skip;
Stuff here is not executed
%mend skip;
Why? We know it can be fixed. But SAS won't do it.
There's a concept of NULL (missing value) in SAS, but it is not universally applied. For example, a logical operation between a missing value and anything else results in a missing value, which is perfectly logical. But if you compare a missing value to a numeric variable — surprise — the result is NOT a missing value.
Got that? In a comparison with a number, a missing value is
treated as if it is, of
all things, minus infinity. Why?
The notion of naming convention seems to elude SAS language designers. Compare proc import and proc export:
proc import
datafile='/somewhere/myfile.csv'
out=mydataset
dbms=csv;
run;
proc export
data=mydataset
outfile='/somewhere/myfile.csv'
dbms=csv;
run;
But why not this:
proc import
in='/somewhere/myfile.csv'
out=mydataset
dbms=csv;
run;
proc export
in=mydataset
out='/somewhere/myfile.csv'
dbms=csv;
run;
Which one is easier to remember?
And speaking of proc import, SAS will never finish if launched on UNIX from the command line. Why? SAS note SN-003610, says:
"When trying to use PROC EXPORT or PROC IMPORT in batch mode on UNIX systems, you may receive the following error:
ERROR: Cannot open X display. Check the name/server access authorization.
This happens because, even in batch mode, these procedures try to display the SAS SESSION MANAGER icon, which requires a valid X display. For any version 8 procedures that you want to run in batch mode without a terminal present you will need to use the -NOTERMINAL option when invoking SAS.
For example:
sas myprogram.sas -noterminalThis will prevent the session manager icon from trying to display."
Translation: "SAS will hang forever on proc export, and you won't even see the error message in the log, because the log is not flushed to disk until you kill SAS, and this is not a bug, it's been like that since the dawn of days, and we won't fix it because it's not a bug, it's perfectly OK to hang, but as a workaround, you can use the -noterminal option."
Here is a third example of a problem with proc import: when using it to read Titanic3.csv, a public dataset describing the 1,309 passengers of the Titanic, SAS truncates hundreds of values of name, cabin and home destination without any warning or error. You can get the file here.
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/DataSets
Of course, it is easy to fix; and it will not
affect your analysis, but still, is this what you expect from a leading product?
Arbitrary limits are everywhere. You create a string variable
and by default its length is limited to 8. You assign something to it
and it gets silently truncated. You import a
file and the line length is limited to 256. Of course you can change
it by using the lrecl=
option, but why can't SAS do it?
Proc sql is just like SQL, but not always.
GROUP
BY a variable works as expected.
But can you guess what GROUP BY any expression does?
Nothing!!!
Error messages are not always helpful in identifying the problem. If logistic regression fails to provide any output except for cryptic
"Error: There are no valid observations",what exactly does it mean? Why not just say
"Warning: all values of variable FOO are missing"exclude it from the list of predictors, and go on?
You are sorting a dataset in-place, and it's taking too long. You decide to cancel it. The dataset is still there, but it's now empty. Not unsorted, but empty. As in no observations! Of course, everyone knows that you should have used the out= option to redirect the output to another dataset, so that your data can take twice as much disk space.
Proc sql again. Guess what will be the name of the second variable in the new_table:
proc sql;
CREATE TABLE new_table AS
SELECT foo,
COUNT(*) "cnt"
FROM old_table
GROUP BY foo;
quit;
Of course it's _TEMA001, because cnt is the label,
not the variable name. Bizarre, but you can make it work with
proc sql;
CREATE TABLE new_table AS
(SELECT * FROM
(SELECT foo,
COUNT(*)
FROM old_table
GROUP BY foo
) x ( foo, cnt )
);
quit;
I can go on like this for a while, but I think you get the idea.
The strange thing is that people who use SAS on a day to day basis
tend not to see
how unnatural it is. It looks like the
Sapir-Whorf
hypothesis
in action.
But isn't it true that all the old
languages have their
quirks?
No! While it s true that Cobol and Basic will rot your brain
because of the paucity of
their features, many old languages were either done right from the start,
or evolved into coherent ones:
LISP, for instance, SQL, C or R (an interesting alternative to SAS for doing statistics).
Together with more recent
languages like Java, or Ruby, they are much more consistent than SAS.
Why look at histograms?
May 07, 2007 | /jc |
Link
Statisticians look at histograms, the way generals look at maps. There is just no way around it. But if you're not a statistician, what are you supposed to look for?
It's hard to answer, but it is easy to look at a few simple examples.
The first dataset we'll use contains data from the real customers of a bank. One variable, SAVBAL, contains the balance of the customers' savings account.
This is what the histogram of the raw variable looks like:

We clearly have lots of zero values, more than half in fact, since the median is zero.
We want to remember that the huge majority of savings is below $ 20,000.
We also have a very long tail to the right, which immediately makes us want to take the logarithm of SAVBAL + 1. We do this in SQL with one line after the "select"
log(savbal + 1) "lsav"
Why the + 1 ? Because we won't have to deal with Log(0), which you may remember is minus infinity. This way, since log(1) = 0 , a zero remains a zero.

Again, we have the same large number of zero values.
But now, we can see them more easily as a completely separate bunch.
And that takes us to the main point worth remembering:
Break the population in two and conduct two analyses, one for each population.
In our case, this means separating the customers who save anything from those who do not save at all, an easy step to take with this SQL line
CASE WHEN savbal > 0 THEN log(savbal) ELSE NULL END "lsav2"

Now, even though the resulting graph is not completely symmetric, nor very close to normal, it is much better than the raw SAVBAL, and this is what we want to use.
If we invested a lot more time, we'd notice that savbal raised to the power 0.1 gives a better approximation of the normal distribution. The SQL needed is
CASE WHEN savbal > 0 THEN pow(savbal, 0.1) ELSE NULL END "s01"

But the added work needed to find 0.1 is not worth it.
In conclusion, we have identified two populations, savers and non-savers, and the savings are log-normal.