2008/10/23

Language

What is a computer language? Vocabulary, syntax, semantics, and comments, for starters. There are low-level and high-level languages, abstract and concrete languages, general-purpose and domain-specific languages, first, second, third, and fourth-generation languages, markup languages, modeling languages, specification languages, etc. etc.

Let's pick a programming language, say C++. Out of the box, it's vocabulary consists of words like "class", "public", "double", and so on. Now you start to program in it. You start adding classes, methods, variables, etc. Now what is the vocabulary? It's the original one plus all the new words you defined. So to those who say "Who wants to climb the learning curve of a new language?" I say whenever you program you're defining a new language, like it or not, and creating a learning curve for others. The reason you climb or create the learning curve is to invest some up-front effort in order to save effort later. So I think it's important to understand how to get the biggest ratio of payback to up-front investment in the language that you're creating.

Here's my view of what a computer language should do. There is something you want the computer to do, so there is a model or set of requirements in your head. (In A.I. we have no problem talking about information that's in your head.) It is stated in terms (a language) that is, if you really know what you want, very good at describing what you want. You know it's a good language to represent your ideas, because you can easily change your mind without having it crash on you (i.e. be seriously invalid). You can see an approximation to this language by simply stating in natural language, to someone who understands the subject matter, what you want. If you are disciplined enough to write it out in a functional requirements document, that should do a fairly good job of capturing it. But you should not regard any such model or document as final. It will evolve, over time, like it or not.

Now suppose you had an autonomous program-writing machine that had a basic understanding of the subject matter (a smart, experienced human would do). You could convey the model to the program-writer and after some time have a working software. What this tells you is that your model contained all the information necessary, so it was the least redundant expression of what you wanted, and it was also an implementation language because you handed it to an autonomous program-writing machine (i.e. a code generator) and got back a working software that met your needs.

So, the ideal state we want to be in is to have a modeling language that matches as closely as possible the subject matter in our heads, so we can say what we want most directly. The test of that directness is when we change our mind about what we want, that change is easily expressed in the language (i.e. the language has low redundancy). Then we want that language to also be an implementation language.

This is not a new idea, of course. This is what domain-specific-languages (DSL) are supposed to do. This is what declarative languages are supposed to do. This is what universal modeling languages (UML) are supposed to do, but I suspect that is an oxymoron, because the very nature of a good modeling language is to be domain-specific, not universal. In any case, the redundancy concept gives a good measure of how specific a language is to any given domain.

Few of these existing domain-specific-languages are implementation languages in the sense that they can be handed to an implementation generator (program or human) and have working software come out the other end, without providing more information.

I would like to show a few techniques I've come up with for reducing the redundancy of source code, thus making the source code more like a DSL. Basically, it involves code generation. One way is to design a language, write a parser for it, and have it generate code in a language you already have, like C or C++. That's a simple form of automatic programmer. An even simpler one has been around forever, the macro preprocessor. I call it the poor man's automatic programmer. (That they ever decided it was a bad thing in Java and C# tells me how myopic this field can be.)

List macros

Suppose you have a list of things, and for each of them you may need to write some more or less predictable code. For example, a list of variables. Define a macro like this:

#define MYLIST \
  DEFVAR(A, int, 6) \
  DEFVAR(B, double, 7.0) \
  DEFARR(C, double, 10) \

Now you can use this macro to generate various codes:

#define DEFVAR(nm, typ, init) typ nm = init;
#define DEFARR(nm, typ, siz) typ nm[siz];
  MYLIST
#undef DEFVAR
#undef DEFARR

Generates

int A = 6;
double B = 7.0;
double C[10];

If you want to print all these variables, you could write a routine:

void PrintTheVars(){int i;
  #define DEFVAR(nm, typ, init) \
    printf("%s = %g\n", #nm, nm);
  #define DEFARR(nm, typ, siz) \
    for(i=0;i<siz;i++) \
      printf("%s[%d] = %g\n", #nm, i, nm[i]);
    MYLIST
  #undef DEFVAR
  #undef DEFARR
}

and it generates a routine to print the variables. There are other kinds of code you could also have it generate. Now the redundancy reduction comes because, if you want to add a variable to the list, it is a 1-line text edit, and all the rest of the code is generated automatically and correctly.

Many times I've seen things that would otherwise result in terribly repetitious code, that were very much simplified with list macros.

Please notice, this doesn't involve objects, at least not explicitly. Something is explicit if you the programmer have to think and write code about it. If it is naturally part of the model then it's a good thing. If not, it's a burden. On the other hand, if there are objects underneath, where you don't have to explicitly write code about or think about them, that's OK. The point is that anything that enters the source code, but is not part of the requirements, is a burden. If for example, "customers" are part of your requirements and you have to retain information about them, then it's natural to write code about customer objects. On the other hand, if you're writing a desk calculator program, and you want to parse numbers and operators, there's no particular need to be defining lots of classes.

Flyweight processes

Differential execution

2008/10/21

Redundancy

Redundancy is an easy concept to grasp. It can apply to data that is modified at run-time (i.e. application data) and at dev-time (i.e. source code). Consider a very simple piece of data, a 7-bit ASCII code word. The ASCII code contains 128 characters, and each character corresponds 1-for-1 with a 7-bit code word. If you change one bit in a code word, it simply stands for a different character. Every 7-bit code stands for some character.

Now suppose you add a bit to the code, such as a parity bit, say even parity, so that the total number of 1 bits is even. Now for each ASCII character there is an 8-bit code word, but half of the possible code words do not have a corresponding ASCII character - the invalid code words. Now, if you start from a valid code word, and you change one bit, you have an invalid code word. This is useful, because it lets you tell if the code word has been changed, at least by one bit.

Suppose you add two parity bits to the code word. That allows you to do some error correction, because if only one bit is changed, you can see which valid code words match it most closely. Now the invalid code words outnumber the valid ones by 3 to 1.

This is both good and bad. The good part is, if you have a way to detect that a code word is invalid, then you can use the redundancy to correct errors that have occurred. The bad part is, if you want to change valid code word A to valid code word B you have to change multiple bits. If you are the least bit unreliable, you may make a mistake and be left with an invalid code word. If you don't have that way to detect the error, you have a problem.

Now let's move up to the world of application data. There are valid data structures, that represent actual information, and there are invalid data structures, that do not represent any actual information. For example, if you have a tree data structure with child pointers and parent pointers, the valid tree structures all have the child nodes pointing back to their actual parents. If any parent pointer does not actually point to the parent, the tree structure is invalid. To change valid tree structure A to valid structure B requires multiple changes, and if any change does not happen, the structure is invalid. So why did you need that back-pointer?

Another example: Suppose you have a simple data record R containing scalar or string fields, and all possible values of those fields are allowed, so no matter what data is put into those fields, record R is valid. Any single change leaves it valid. Now suppose you have two copies of R, called R1 and R2, which equal each other, grouped into a higher-level collection called R1R2. Does R1R2 have invalid values? Yes, any value in which R1 and R2 are not equal. So to change R1R2 in a valid way, two identical changes have to made, one to R1, and one to R2. Now since it is possible for R1R2 to be invalid, you have to be concerned about its validity, which was not a concern when there was no redundancy. If R1 and R2 happen to be separated over a network, keeping them in agreement may be difficult and requires unit testing procedures. On the other hand, R by itself requires no testing because it has no invalid values.

Now let's look at dev-time data (i.e. source code). Let's define a valid source code program as one that compiles and does not crash or hang and satisfies some functional requirements A, even if they are different from the functional requirements we actually want, B. If B differs from A very little, we can define the redundancy of the source code as the number of editing changes we need to convert it into a valid program for requirements B. Clearly, the fewer manual edits required (by us sleepy, distracted programmers) the more likely we will get it right without putting in bugs.

Summarizing, to minimize errors, minimize redundancy. If you must have redundancy, have a way to manage it. These ideas are not new. In the world of databases, it is desirable to normalize a database, so that multiple steps are not required to keep it valid. In the world of programming, the Don't Repeat Yourself (DRY) principle (Hunt & Thomas) is getting at the same idea. The reason all this needs to be said is that people still don't do it, and maybe it isn't understood just how seriously this issue should be taken.

Here are some ways this concept is not followed in practice. People still make spurious copies of things and then start blaming each other when the copies get out of sync. Event-driven programming is (IMHO) a symptom of redundancy because events are usually meant to communicate state changes (say between R1 and R2), and when event messages get dropped or duplicated or out of order, the blaming starts again, and often the need to have the separate copies was never fully established.

Ways to handle redundancy

If some redundancy is truly necessary, and events are a lousy way to handle it, what are some good ways? I like ways that can tolerate some invalidity and can correct errors. For example, in the R1R2 case above, the holder of R2 may keep a backup copy of R1, say R1', so if a new copy of R1 is received, the new and old copies can be compared and the changes detected and used to update R2. Same goes for the holder of R1. New copies of R1 can be sent as often or as seldom as desired, without any problems.

What if R1 is a list of items, and changes need to be detected? For this I like a diff algorithm, and I'll give a simple example of one here. It looks a bit long, but when you see what it's doing it's not so bad. This one searches through two string arrays, saOld and saNew. Its output is an integer array iaDiff, in which the value 1 represents a string is in saOld but not in saNew, 2 means a string is in saNew but not in saOld, and 3 means a string is in both. That array can be used, for example, to update saOld to be equal to saNew, or to do other processing.

The inner loop has indexes iOld into saOld, and iNew into saNew. From there it searches forward to find the next matching string. It searches in a triangular order:
iOld+0, iNew+0

iOld+0, iNew+1
iOld+1, iNew+0

iOld+0, iNew+2
iOld+1, iNew+1
iOld+2, iNew+0

and so on. When it finds a match, it knows how many new and old items it had to skip over to get there, so it reports them, and then it reports the match. The outer loop just repeats this until the ends of both lists are reached.


int iOld = 0, nOld = saOld.Count;
int iNew = 0, nNew = saNew.Count;
while(iOld < nOld || iNew < nNew){
    // start a diagonal search to find lines that match
    int nSkipOld = 0, nSkipNew = 0;
    bool bMatchFound = false;
    int nMaxSkip;
    for (nMaxSkip = 0
        ; nMaxSkip <= (nOld-iOld) + (nNew-iNew)
        ; nMaxSkip++
        )
    {
        for (nSkipOld = 0, nSkipNew = nMaxSkip
            ; nSkipOld <= nMaxSkip
            ; nSkipOld++, nSkipNew--
            )
        {
            int iiOld = iOld + nSkipOld;
            int iiNew = iNew + nSkipNew;
            // if reach ends of both lists, that's a match
            if (iiOld == nOld && iiNew == nNew){
                bMatchFound = true; break;
            }
            // if run off end of either one, ignore that
            else if (iiOld >= nOld || iiNew >= nNew){}
            // if we find a real match, stop
            else if (saOld[iiOld] == saNew[iiNew]){
                bMatchFound = true; break;
            }
            // else keep looking
        } // inner loop of search
        if (bMatchFound) break;
    } // outer loop of search
    // assert: we have a match
    int kk;
    // record unmatched old names
    for (kk = 0; kk < nSkipOld; kk++){
        iaDiff.Add(1);
    }
    // record unmatched new names
    for (kk = 0; kk < nSkipNew; kk++){
        iaDiff.Add(2);
    }
    // advance over the unmatched names
    iOld += nSkipOld;
    iNew += nSkipNew;
    // record the match of old and new
    if (iOld < nOld && iNew < nNew){
        iOld++; iNew++;
        iaDiff.Add(3);
    }
} // end of diff loop

The reason I like this is, even though it's a little hairy to write, I can just pick it up as a piece of canned code and tweak it to suit. If the two lists are equal or nearly so, it is very fast, so I can call it as often as I want to keep the two lists in agreement. I can rely on it to tell me what the changes are. I do not need to rely on the owner of one of the lists to remember to send me messages informing me of the changes. It's guaranteed not to miss any changes, and even if it did, it would be my fault, not the owner of the other list.

Since the owner of the list does not have to have source code to maintain data structure to remember who should be notified of changes, and does not have to detect changes so as to broadcast the notifications, and does not have to have unit test code to verify that all of this works, not only is his/her data structure reduced, but the source code is also very much reduced in redundancy. Since I do not have to write source code to register/unregister myself to receive these notifications, my code is reduced. This follows the KISS principle (Keep it Simple, Stupid).

2008/10/20

How to optimize software performance

I'm not going to discuss compiler code optimization, big-O algorithm design, or whatever is considered the "root of all evil". I'm just going to present two different techniques that can be used to find and remove performance problems. (I call these "slugs" - short for slowness bugs, because they only make software slow, not wrong.)

1. Synchronous programs

The first technique, called deep sampling, is for single-thread programs, and the only tool required is a debugger having a way to interrupt execution, manually or on a timer, at a random time. (It is important that it be a true interrupt. The program cannot be allowed to proceed until it gets to a convenient break point, because then you won't see what it was actually doing.) When the program is halted, the call stack, consisting of the current program counter and all return addresses, is recorded. This may be easy or difficult, but it should always be possible. Then the process is repeated some small number of times, up to about 20, if you need that many.

Then those call stacks are examined for instruction addresses, whether they are call instructions or non-call instructions, that appear on multiple call stack samples. For example, "call _main()" should appear on all samples. The key point is that any instruction that appears on X% of call stack samples is responsible for about X% of execution time. So all you have to do is look at the instructions that show up on multiple samples and see which ones you can get rid of.

This seems superficially like what profilers do, but the differences are crucial. Some profilers do sampling, but only of the program counter or part of the call stack. Not good enough. Most profilers do some kind of "instrumentation", where they keep track of how many times each function is called, or how long a function executes, or what percentage of time is spent in each function. Some summarize at the level of basic blocks. These are summarizing at a too-high level, not the function-call level. The only thing you can edit is statements, and what you need to know is which statements are responsible for the most time. That is what deep sampling tells you directly, with no guesswork.

Profilers express a concern with accuracy of timing, but that's not what you need accuracy of. You need accuracy in pinpointing the slug without guesswork, so you can kill it and move on to the next one. I have seen cases where some precise thing that never could have been guessed was on four out of the first five samples. This does not tell the exact amount of time to be saved, but it does tell exactly what to fix. Then, if anyone wants to know the precise time savings, an overall timer (or stopwatch) suffices.

Then the great thing is, you can do it again. This time you may find a completely different slug. Since the first one has been removed, this one really stands out. You can repeat this until you can't find any more slugs that are easy to remove. Depending on how bad the program was to begin with, it can now be many times faster.

I won't moralize on "premature optimization". I will only point out that, since I discovered this technique, I found that often performance problems were due to "galloping generality" (excessively general data structures) and now I heartily avoid those. Oddly enough, the reason often given for using excessively general data structures is that, in principle, they admit more efficient algorithms! So the lesson is: don't complicate the data structure until you have to, and don't solve performance problems until you have them.

I've found this can be easier to say than to do, in the case that events have become central to the software design. (With OOP, the idea of message-passing has acquired cachet. Or maybe it was A.I. that popularized the idea of when-then programming.) I've come to see message-passing as sometimes necessary, due to architectural constraints, but always problematic. Often the performance problems (and bugs) are directly caused by placing excessive reliance on events, in place of something simple. When it becomes clear that this is so, it is not simple to fix. I'll say more about this in another post.

2. Asynchronous programs

The second technique is for programs involving asynchronous message-passing. If you can sample a program counter, that tells you what the program is doing at a point in time, but it does not tell you why. In synchronous programs, the call stack tells you why, because when a call instruction drops the program counter into a function, hoping to get it back some day, it leaves a record of itself on the stack. However, if program A sends a message to program B, and then hopes to get a reply some day, there is no easy way to halt the universe to figure out who is waiting for whom at that particular microsecond.

I have an effective (but laborious) way to track these down. It requires that each communicating process generate a time-stamped log of events of sending, waiting for, and handling messages. These can be merged together so the messages can be traced between programs. What you look for is delays between sending and handling of messages, or repeated transmission of messages due to bogus timeouts. Usually, this indicates something fairly silly that can be fixed, with a big performance reward. Again, this process can be repeated until the whole system is very responsive.

What this blog is about

Hello, and welcome to this blog. It's purpose is to host discussion of:

1. Software performance.

I am disappointed in the general state of this subject. On the one hand, there is discussion of "big-O" notation, and getting the right algorithm. On the other hand, there is drum-beating for "profilers" and selling of same. In my experience, real software is about 1% algorithms, and the rest is things like UI, data structure, and "plumbing" to get information from here to there. While bad algorithms can certainly hurt performance, the rest of the software can also contain big performance problems. In my experience, that's the bulk of the problem. I will show simple methods to speedup real sofware by big factors in less time than a profiler can be run and its output puzzled over.

2. Programmer performance.

I think the subject of programming paradigms is dominated by "bandwagons" such as OOP and functional programming. Those are good ideas in their own right, but are used without any overall framework to analyze problems and propose alternative solutions with pros and cons. Certainly OOP has a methodology, in which mental nouns become classes, and verbs become methods, but it seems to me this leads to incredibly bloated software. If you go back to Alan Kay's original work on OOP, it was supposed to improve on the prior practice of having raw data structure, where all activities consisted of bonking bits. Well, that's fine as far as it goes, but I'm saying we're missing the bigger picture. We program with way too much explicit data structure. We act as if programming is all about objects and methods - the more the merrier. I'm saying: Forget Objects (for now). The question to ask is: What Needs To Be Said, by Whom, to Whom, and When. If this is done, and there is a good "toolkit" of techniques, such as Domain Specific Languages, source code size and maintainability can be improved by large factors.

I will go into these in great detail, and I hope to see some interesting debate.

I do not talk so much about specific tools and languages. Those are fine, but they are only software. What really matters is brainware, the software that goes into your head and controls what you do. After all, A.I. says your head is a computer, and you can't work any better than the software that you've loaded into it. Then if some computer software assists you, that's fine, but without upgrading brainware, no tool will help very much.