How to Reduce the Contagious Power of ‘const’

For the purpose of optimization, the C++ programming language decided that the users should manually specify ‘const’ annotations for variables that they know will never be modified. While I admit the possible usefulness of this annotation, I often see it misused in the code base I’m working on, and this often results in a contagious phenomenon where lots of variables and parameters need to be annotated as ‘const’.

In this post, I reason philosophically about the true meaning of the ‘const’ annotation, and propose some tentative principles regarding the use of it. By following these principles, I hope we can reduce the contagious power of ‘const’ to the minimum.

First, the rule for ‘const’ annotations for variables:

Rule 1. As a variable, declare yourself as ‘const’ if you want to be inlined, or if you don’t want to be modified.

By this principle, these examples are legitimate:

const int max = 1000;
const char* name = "permanent name";

These declarations can be global variables as well as local variables. Declaring them as ‘const’ can facilitate the inlining opportunities of them, as well as prevent unwanted modification of data.

Now we just have one other rule. This one concerns the ‘const’ annotations on functions:

Rule 2. As a function, be self-disciplined about what you take, but be relaxed on what you give.

This means, put ‘const’ on the parameters whenever you are sure that you will by no chance modify the input argument, but NEVER put ‘const’ on the return type.

The reason behind this may take a little bit of philosophical thinking. By putting ‘const’ on the input parameter, the function is basically declaring: ”Whoever passes me this argument, I promise not to modify it.” “I’m side-effect free on this parameter.” “I’m pure on this parameter.”

This is a good thing (for C++), because you give the callers of the function more freedom of what they can give you. If you don’t declare your “pureness” (on the parameter), some callers may not be able to pass you certain data because they don’t want their data to be accidentally modified by you. They need their original values intact, because they are going to use them after you return.

So in general, this is a good thing to do in C++:

  void foo(const char* in);

But there is a catch here: whether you can put ‘const’ here depends on the function body and any calls you make inside the function. You can’t just try to put ‘const’ annotation on the parameter. The annotation should be a property of the function, and not just to satisfy some caller’s requirements. Whenever you put ‘const’ on the parameter type, you have to make sure that any functions you call don’t modify this parameter either. That is to say this is a transitive property. For example, something like this will not work:

  void foo(const char* in) {
    foo2(in);
  }

  void foo2(char* in2) { ... }

Is this to say that you should declare foo2 as  void foo2(const char* in2)? It all depends on whether foo2 actually modifies its argument or not. If it does, then there is no way you can declare in2 as ‘const’, consequently you cannot declare foo as taking a const char* either. Then the type of foo should be ”void foo(char* in)”, not having the ‘const’.

There is no way you should use ‘const’ in foo then, because the helper foo2 modifies the data. You have to say this honestly: “I may modify the contents of in, because one of the functions I call modifies it.”

This is where it can get painful in C++, because the other functions you call may be written by other people and they may not follow the same principles I’m describing here. They may not have ‘const’ on their parameter types even though they haven’t modified the parameter in their function body. This is not their fault, because writing ‘const’ is so boring. Some automatic inference can certainly be helpful in these cases, but unfortunately inference for ‘const’ is not provided by C++.

On the other hand, why should you NOT put ‘const’ on the return type? This is because if you do, then you are basically announcing: “Whoever calls me cannot modify the value that I return!”

This like saying this in your will: “I will give my house to my son, but he is not allowed to modify it.” Well, the law may give you the right to specify this, but what good can this do to your son? You are limiting the happiness you can provide. More over, can this kind of restriction really be enforced? Who has the energy or right to keep an eye on your son and restrict his action on the house? This is just impossible to work.

 

Coming back to the context of programming, what’s the point of not allowing the callers to modify what you GIVE them? You, the function, will be dead after the return. You will not need the return value any more. By returning it, you should be giving up control of it to the caller. It should be up to the caller, and not you, to decide whether to modify the return value or not. Even if you put ‘const’ annotations in the return type, they are usually ignored by the callers. They just use a const_cast whenever the ‘const’ gets in their way. This makes the ‘const’ annotation on return types completely useless.

If you put ‘const’ on the return type and the callers respect it, then it will contaminate the whole data path wherever the return value goes. This is nasty.

So in general, this is not a good thing to do:

  const char* bar();

Because your caller would have to be like this:

void baz() {
  const char* ret = bar();
  bizaar(ret);
}

And the receiver of the return value (bizaar) must be defined as something like:

  void bizaar(const char* in) {
    bizaar2(in);
  }

And now bizaar2 needs to be declared as:

  void bizaar(const char* in);

And so on… You see how the contagion started? Because bar() returned a const value, which is unnecessarily restrictive. So in general, it is not a good idea to return ‘const’.

Back to the Future of Databases

Why do we need databases? What a stupid question. I already heard some people say. But it is a legitimate question, and here is an answer that not many people know.

First of all, why can’t we just write programs that operate on objects? The answer is, obviously, we don’t have enough memory to hold all the data. But why can’t we just serialize and store the objects to disk and load them when we need them? The answer is yes we can, but not in Unix, because Unix manages memory as “pages”, not as “objects”. There are systems who lived before Unix that manage memory as objects, and it worked really well. Here are some pictures of them:

system-38

lisp-machine

Oberon

Those systems never needed, and will never need, databases. Data integration is seamless. You don’t need to know the existence of a “disk”, a “file system”, or a “database”. You can just allocate objects and work on them. You can pretend that you can allocate infinite number of objects. Unfortunately most of those systems were terribly expensive and no single person could afford them. Plus they had problems in other aspects of their design. Finally, they died out.

But this is not to say that there is nothing we can learn from their design. On the contrary, their ways are far superior than the current state-of-the-art of Unix-based systems. In comparison, Google’s “Big Table” and Apache’s “HBase” are just clumsy imitations of them. They need much more work. They will never be seamless.

But any how, Unix rules the world. We can live with it, but it is just mediocre. Please don’t believe everything that the modern Operating Systems books tell you. Sometimes you have to look further into the past for the future. As Einstein said, “nothing is more needed to overcome the modernist’s snobbishness.”

Unix used to be free, but you get what you pay for. Although there is a thing called “virtual memory”, your programs can’t just allocate objects and then operate on them without any knowledge about the “file system”. Nothing works seamlessly in Unix. It is more like a “non-operating system” than an “operating system”. It leaves too much work for you to do, and leaves more than enough rope to hang yourself.

That is why people “invented” databases. The combination “Unix + databases” is supposed to be a cheap replacement for those advanced systems where programs don’t need to know the existence of such a second-level data storage layer, whether it is a file system or a database. But because of some irreparable design issues, Unix still can’t achieve that even with the help of databases. However, databases were somehow considered a big thing, and people who made it became the richest men in the world.

Consequently, you have to take database classes if you want a Computer Science degree. So here is an ultimate cheat sheet for those who really want to know what a database is. You will not need to sit through a semester’s course if you remember the few things that I put below. Trust me, many students got A+’s because I told them this ;-)

The so-called “primary keys” of a database table are in essence persistent memory addresses. When serializing an object, you can’t just put the memory address of an object onto disk. The address may not be the same when the object is reloaded into memory. This is why you need primary keys. In a sense, “keys” are a more general notion than “addresses” — addresses are just keys that are integers.

But what are “foreign keys“? Well, they are persistent pointers (i.e. references to primary keys). Whenever you need to dereference a pointer, you do a “join” in the database.

A database “schema” is in essence a “structure type”, like the struct definition in C. For example, the schema created by the following SQL statement

CREATE TABLE Students ( sid CHAR(20),
                        name CHAR(20),
                        login CHAR(20),
                        age INTEGER,
                        gpa REAL )

is equivalent to the C struct

struct student {
  char* sid;
  char* name;
  char* login;
  int age;
  double gpa;
}

That’s almost the whole story. You have addresses, pointers, dereference operation, structure types, so now you can implement things like linked lists, graphs etc. With them, you can implement something like Dijkstra’s shortest path algorithm in a database. You just need to take a data structure class, and then translate what you learned there into a database language like SQL.

What is SQL then? It is a “non-programming language”. SQL wasn’t designed for programmers. It was designed for manual input by human operators (usually non-technical people like accountants). You type in a “query”, and the computer prints out a “response”. That is why it is called a “query language”. This language does its job for human operators, but it was then abused. It was interfaced with computer programs (in place of humans) using band-aids and then used to write serious programs. I doubt if those people knew what they were doing, but it just happened, like many other silly things. The result is a fragile system held together by band-aids. You have to be very careful otherwise you lose blood.

If you really want to learn SQL, here is the cheat sheet for it:

The query

SELECT Book.title
 FROM Book
 WHERE price > 100

is equivalent to the Lisp expression

(map (lambda (b) b.title)
     (filter (lambda (p) (> p 100)) Book)

This program is then sent to the “database engine” for execution. That is, we move the program to the data, instead of loading the data to the program. And that’s also the principle behind Google’s MapReduce.

The problem with SQL is that you need yet another layer of language before programs can operate the database. SQL is a weak and quirky language. It is not Turing-complete and at some places it doesn’t “compose”. You need to combine it with a decent language before you can write serious programs. Your programs send SQL commands to the database to store and load data, just like a human operator would do. This is a very low-tech way of data integration. It is error-prone, inefficient and subject to security risks such as “SQL injection”.

Maybe you have seen, for some weird reasons we are trapped in the Dark Ages of computer programming. We are not supposed to be here since better designed systems already existed. It would be foolish to dismiss them as failures. They are just ahead of their times. By looking to the past, we may see a way back to the future.

Why is Indexing Faster Than Binary Search

We all know that indexing into an array takes only O(1) time, while searching for a number in a sorted array takes O(n) time with linear search, and O(log n) time with binary search. But why is indexing so fast? What is the secret sauce?

The reason is really about the nature of indexing — how it is implemented in a circuit. In order to illustrate this, let me show you an “addressing circuit”.

addressing cuicuit

Here, A and B are the two-bit address lines, they represent the indices: 00, 01, 10, 11. The output Z, Y, X and W are the selectors of the items in the array. Notice that an output selector is enabled only when both of the input lines of the corresponding AND gate is “1″.

Now, ignore the input B and just look at A. See how its signal flows through the direct wires and the inverters. We can make the following observations:

  • When A is “1″, then the AND gate of X and W will receive a “1″ on one of their input ports, while the AND gate of Z and Y will receive a “0″ on one of their input puts.
  • On the other hand, if A is “0″, then the AND gate of X and W will receive a “0″ on one of their input ports, while the AND gate of Z and Y will receive a “1″ on one of their input puts.

From the above, I hope you have seen that the value of A partially selects half of the AND gates — it is either the set {X, W} or {Z, Y}. By “partially select”, I mean they are not fully selected, because we haven’t taken B into account. At this point, I hope you have noticed that A is in fact doing one step of a “binary search”.

Now we do a similar thing, but this time focus on B and ignore A. You should see a similar thing: depending on the value of B, either we partially select {Y, W}, or we partially select {Z, X}. So we can also think of B as doing one step of a “binary search”.

Now, we see that A and B are each a step of a binary search, and it is interesting to see that B’s selection will cut A’s selection evenly, whether A is 0 or 1. We can say the same thing vice versa: A’s selection will cut B’s selection evenly, whether A is 0 or 1.

Also notice that the selection of A and B can happen at the same time. That means, when they work simultaneously, it takes just O(1) for a binary search through an array of length 4. If we generalize this circuit to N bits of input, then within O(1) time, we can do a binary search through an array of length 2N.

This explains why indexing an array is faster than binary search, because it is a parallel binary search where (log n) steps happen at the same time.

Why I Don’t Care About Purely Functional Programming Languages

To write programs in a purely functional programming language is to live in a wired world.

There are no electromagnetic waves nor sound waves in such a world, so you don’t have wifi, cell phones, satellite televisions, radios, etc. You don’t even have light or sound. In other words, everything is blind and deaf.

All information must pass through wires or pipes, connected by switch boxes which we call “monads”. You must carefully route all the wires and pipes and connect them correctly before any information processing device, including your eyes and ears, can properly work.

If you would like to live in such a world, please use purely functional programming languages.

Undecidability Proof of Halting Problem without Diagonalization

As a TA for a graduate level Theory of Computation course, I don’t understand why we have to teach the diagonalization proof of the undecidability of the halting problem. It is much easier if we just use a normal argument about functions in a programming language. Basically, functions correspond to Turing machines in a theory of computation context. In order to improve the situation, I came up with a simpler formulation based on the view of “Turing machines as functions”.

Suppose that we have this function Accept(M,w) which takes two arguments, a function M and its input w, and can tell you whether or not M(w) will output True. Basically, suppose that Accept is the hypothetical solver for the halting problem. To connect to usual theory of computing terminology, the fuction M corresponds to a Turing machine, True corresponds to reaching an accept state, and False corresponds to reject state. Note that although the argument M may go into infinite loops, never will Accept.  Accept always returns True or False in a finite amount of time.  Accept determines this without actually running M because M may go into an infinite loop and never return. This way of finding properties of programs without running them is usually called static analysis. It has the same essence as fortune-telling and all other sciences.

Now we come to our simple contradiction: does the following expression (call it F) return True or False?

Accept(λm.not(Accept(m,m)), λm.not(Accept(m,m)))

It turns out that this question cannot be answered at all! If F returns True, then when we actually apply the function λm.not(Accept(m,m)) to the argument λm.not(Accept(m,m)), it should return True, right? But when we apply it, we get

not(Accept(λm.not(Accept(m,m)), λm.not(Accept(m,m))))

Notice that it is exactly the negation of the original expression F. This means that F should return False (because the “actual run” returns False). On the other hand, if F returns False, then when we actually apply the function on its argument, it will return True. Thus the contradiction. QED.

Some may argue that I’m implicitly using “diagonalization” here. Well, you can say that they are equivalent in effect, but the equivalence doesn’t mean that “diagonalization” is a good word for describing what I’m doing. Actually I’m just making a loop in a “circuit”. I don’t see the diagonal, then why call it that way? The diagonal argument and the “loop argument” may have the same essence, but they are very different in their comprehensibility. I feel that the loop way is a much easier to understand than diagonalization.

An interesting final question may be, what is the connection between the above expressions with the Y combiantor?

λf.(λx.f(x x)) (λx.f(x x))

Hindley-Milner Type System is Fundamentally Flawed

On Oct 19, 2012, I gave a talk on type systems at Indiana University’s Programming Languages Group. In the talk, I briefly described my understanding of the Hindley-Milner type system and its various enhancements. I disclosed a rarely known observation of the so-called let-polymorphism — that it is similar to dynamic scoping, but reincarnated in a type-level language. The ‘forall’ quantifiers in the types are put in random and nonsensical places. Thus I conclude that the Hindley-Milner system is as flawed as early Lisp languages’ value-level semantics.

I believe that the various enhancements to the HM system, such as value restriction, MLF etc. are just patches to a deadly disease. They can never cure it. I propose a way to generalize at lambda’s instead of let’s, and show why it provides a sensible type system without the problems that have been haunting the HM-based systems for decades.

During the talk, I attempted to answer the following questions:

  • What are the key intuitions behind type inference?
  • Why is let-polymorphism fundamentally flawed, and how to fix it?
  • Why are type systems more powerful than Hindley-Milner system seldom used?
  • Why is there always a conflict between expressiveness and effectiveness and how to find a balance point?

The above Youtube video is the slides I used for the talk. It is a video converted from Keynote. You may prefer some static slides from SlideShare (see below) , but the video is more fun to watch because it has all the animation effects.

http://www.slideshare.net/yinwang0/a-fresh-look-at-type-inference

A Mind Map Guide for TeXmacs

TeXmacs was my best kept secret for surviving homework assignments with lots of mathematical formulas. It is a state-of-the-art typesetting system with unparalleled beauty. Many people think that it is just a “frontend” for TeX/LaTeX (like LyX), but it is not. It is very different from TeX, but with the same typographical quality. It is a fully independent software, with TeX’s core algorithms but with all the code rewritten and enhanced for use in a beautiful and ergonomically designed graphical editor.

The editor environment is just amazing. It is 100% WYSIWYG. Yep, 100%. That means, the screen shows exactly what you would print out on paper. I don’t know of any word processor that can do this today. Nope, not even MS Word. If you haven’t used TeXmacs before, it would refresh your mind about what the word WYSIWYG really mean.

But a more amazing thing is that with all the beauty, you don’t lose detailed control of document structure. You will be surprised to find out that it is also WYTIWYG (What You Think Is What You Get), and it has exploited this notion to the extreme. You will sense a “human centered” design in it.

So, ready for a try? Just download it (free) and dive in. Over the years, I have accumulated lots of useful knowledge regarding its use and have collected them into a mind map. Click on the following image and you can see all my best tricks with this software. Have fun!

My PhD Oral Exam Slides

These are my PhD oral exam slides (used in May). They contain some of my most important (and secret) ideas. Now I share them with everybody partly because my head is overflowing with secrets ;-)

What these slides will tell you:

  • Some rarely known intuitions about Hindley-Milner type inference
  • What is fundamentally wrong with let-polymorphism and how to fix it
  • The reason why we have Curry-Howard correspondence
  • The relationship between program analysis and Hoare Logic (Separation Logic)
  • The relationship between Linear Logic and intersection types
  • The relationship between automatic theorem proving and supercompilation

What is the value of this line of work? It seems that I haven’t invented anything new. But I think the academia has been focusing too much on creation. We already have too many concepts, and many of them don’t really have separate essences. It is my job to unify them and simplify them. This is what the value of this line of work lies.

I’m sure some of the ideas here are deep and publishable, but I don’t have much motivation writing academic papers. I don’t even think academic papers are the right way to explain ideas and let people understand them. Animations and mind maps are much better. The slides are probably still too brief to be of use to people who haven’t worked deeply in those topics. If you find some of them interesting and want to use them as a paper/dissertation topic, please don’t hesitate to ask me for help.

Thanks to my new friend Keynote, all the animations in the original PPT file was successfully converted into PDF, and then converted into SlideShare. Have fun!

What is a Program?

In this article I hope to explore possible answers to the question “What is a program?” from a programming language’s perspective. This is in constrast to a usual “machine perspective”. While the machine perspective may be broader, it is often more complicated (think of Turing machines). A programming language’s perspective may be simpler, but it may be overly simplistic to the degree such that it doesn’t capture everything that the machine perspective offer. Later revisions may provide more insight into this matter.

Are programs necessary for computation?

Programs are usually what we use for computation, but are they necessary? I think the answer is No. To understand this, first let us answer the question “What is computation?”

Simply put, computation is the many ways we simulate the world. This simulation could happen in the brain, in a crystal ball, in an electronic circuit, in a chemical reactor, or more. Originally, the purpose of this simulation is to predict the future (or at least the near future) for our advantages or survival. This goal is evident in the use of computation in meteorology, particle physics, biology etc. We want to know how the real world changes with given conditions and time, so we model the world with data structures in a computer, and run simulations on those “models”. To be practical, we have to ignore many apsects of the real word, so the models we have are only abstractions of it. The models must reflect accurately the aspects that we care about, without over-assumptions, that is, constraints or conditions that doesn’t really exist in the real world. Errors may be tolerable, but should be well-controlled. This is why computation is more and more important in sciences, because it lets us predict the future. In today’s sciences, it also lets us experiment with our “guesses” of how the universe was “built”. There are many other aspects of computation, for example as an information processing tool. Information is also part of the world, so we are actually using a computer for the same purpose: manipulating the world’s model.

Now let’s come back to our first question, do we need programs for computation? I said that it was not necessary. The reason is that a “computer” can be a special-purpose computer with only one specific goal, for example calculating the result of 2+3. Such simple circuit may be built using a few logic gates with fixed inputs on the pins. We will get the output as electric signals from the output pins.

So you see, a program is not really necessary for computation. We can build any kind of circuit we want, be it electrical, chemical or biological, as long as they reflect the real world, and then we can utilize the physical charateristics of the circuit to do our “computation”.

Programs are dynamic circuits

Why do most computers run programs then? This is because building circuits is time-consuming and usually very expensive, so we want to build a circuit once and use it for many purposes.

We call those circuits which can serve more than one purpose general-purpose computers. For example, in addition to calculating 2+3, they can also calculate 3+4, 2-15, 124*45, etc and can do more complicated things. So you see, those general-purpose computers have the functionalities of many, even a whole class of, special-purpose computers. In order to achieve this, a general-purpose computer simulates the special-purpose computers. That is, it “pretends” to be one of them at a given time. Now the question is, how does the general-purpose computer know which special-purpose computer to simulate? The answer is to use a data structure — a description of a special-purpose computer which the general-purpose computer can manipulate. This data structure is normally called a program. The general-purpose computer is usually called the interpreter of the program. Once the interpreter gets a program as input, it behaves like the program.

Since a special-purpose computer is a circuit, a program is then a description of a circuit. This is evident from the following example Scheme program which computes the factorial of the input number.

(define fact
  (lambda (n)
    (cond
     [(= n 0) 1]
     [else (* n (fact (- n 1)))])))

As we can see, there is an input n, and the output is either 1 or (* n (fact (- n 1))), depending on whether n is 0. Notice that “fact” refers to the function itself. If we build a circuit for computing this function, it would look like:

This circuit can be physically implemented using electronic devices. I haven’t drawn all the details. You must think of those four n’s as connected together by wires, and have the same values at any time. The triangular shape is a demultiplexer (demux), it tests whether n is equal to one, and then decides which output to activate. The demux is common to all programming languages, sometimes they appear as conditional statements (if, cond, …), sometimes they appear as pattern matching, but in essence they are all the same thing. The reason of using them is simply because different values of input can come in, and we need to distinguish them — we look at them and do different things according to their values.

But notice that this is not a usual circuit. It is a “dynamic circuit”. The botton part denoted by (fact (- n 1)) is not an endpoint. It is an instantiation of the same circuit by passing (- n 1) as input. So the circuit is growing as we compute. For example, when we pass in the initial value 5, we end up with a circuit like the following after one expansion of the endpoint:

Interpreters are circuit simulators

So now you may wonder whether interpreters are circuit simulators. In fact they are. If you look at a simple interpreter, you will find that it is using some tricks to simulate the signal-flow in a circuit.

Usually the interpreters are recursive, they traverse the sub-components and in the meanwhile carrying signal values in what we call an environment or symbol table, so that they may know the values of each wire that may be used. Sometimes the wires in both a component and its sub-component can be labeled with the same name, if the wires in the component don’t also go into the sub-component. This is called “shadowing”. But we should remember at all times that the wires that are labeled by the same name are in fact distinct wires and have absolutely nothing to do with each other. We use the same names just for our notational convenience. If we don’t reuse names, we may run out of unique names very soon in large circuits.

Lexical Scoping v.s. Dynamic Scoping

We can also explain why the notion of lexical scoping (or static scoping) is more reasonable than dynamic scoping. This because when we write the definition of a function that is later used as a value, the free variables inside the function refer to the names in an outer scope of the definition site. When we pass the function around as a value, those free variables (as electronic pins) should still be “connected” to their original wires. If we use dynamic scoping, those variables may be reconnected to a wire labeled with the same name but at the call site. This is not correct, because those two wires with the same name are different wires, and thus have completely different meanings. Our intention is probably to use the wire at the definition site because that is what we see when we write the function. We probably have no intention of connecting to a random wire with the same name somewhere else. This why programming languages should not use dynamic scoping.

This intuition of a program as a circuit is very useful. It can guide us to become aware of the most import aspects of programming languages and help us become better programmers. I will talk about compilers, parallel computing, type inference, logics and theorem provers in later posts using this general framework of thinking.

Understanding the Yin-Yang Puzzle

I have a friend who is a very fast learner. I introduced him to the Scheme programming language a few days ago and he soon digged into this yin-yang puzzle:

(let* ((yin
        ((lambda (cc) (display #\@) cc) (call/cc (lambda (c) c))))
       (yang
        ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))
  (yin yang))

This program prints out the infinite string @*@**@***@****@*****@****** …

Why does this happen? This took me almost a full afternoon to figure out. It is always good to have friends who feed you with questions and challenges. I’m going to document my finding here. I may want to make some animated slides (as is always a good idea) later, but for now a short note may suffice.

The Plan

You will probably never understand the program if you examine the contents of the stack. You need their high level semantic meaning to think clearly. The higher level meaning of the stack is the continuation. So instead of frightening you with the details of the stack, I’m going to use functions to show the continuations explicitly.

Also I will use a little bit of compiler optimizations along the way. The optimizations will simplify the program and show its deeper meanings.

Here is an overall plan of this article:

  1. Find the functional representations of the two continuations. Name them c1 and c1.
  2. Plug in c1 and c2 into the places of the two call/cc’s.
  3. Simplify the program using compiler optimizations.
  4. Perform behavioral analysis on the simplified program.

Transform and Remove Call/cc

So here we go. Let’s repeat the original program here:

(let* ((yin
        ((lambda (cc) (display #\@) cc) (call/cc (lambda (c) c))))   ; B1
       (yang
        ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))  ; B2
  (yin yang))

Note that there are two binding clauses, B1 and B2, each prints a char and binds a continuation to a variable.

B1 binds yin to the “current continuation” of the first call/cc. Let’s call this continuation c1. Its function representation is:

(lambda (k)
  (let* ((yin
          ((lambda (cc) (display #\@) cc) k))           
         (yang
          ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))          
    (yin yang)))

We can simplify it using some compiler optimization tricks. I will go very slowly here just in case you are not familiar with compiler optimizations.

First, since Scheme is a strict language, k will be a pure value (with no side-effects), so we can lift the side-effect (display #\@) out:

(lambda (k)
  (display #\@)
  (let* ((yin
          ((lambda (cc) cc) k))           
         (yang
          ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))          
    (yin yang)))

Now we can “partial evaluate” ((lambda (cc) cc) k) to k:

(lambda (k)
  (display #\@)
  (let* ((yin k)           
         (yang
          ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))          
    (yin yang)))

Then we can copy propagate the value of yin, k, to the body of let*, (yin yang). And now we have only one binding left:

(lambda (k)
  (display #\@)
  (let ((yang
         ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))
                                         ^^^^^^^^^^^^^^^^^^^^^^^^
    (k yang)))

Now we try to figure out the underlined continuation. It should be:

(lambda (j)
  (let ((yang
         ((lambda (cc) (display #\*) cc) j)))
    (k yang)))

You see why? Since the “current continuation” means “what is going to happen with this value”, it doesn’t include the computation before it, namely (display #\@). Now we do a similar optimization on this continuation: lifting out the side-effect (display #\*), do some partial evaluation and copy propagation:

The result of this inner continuation is:

(lambda (j)
  (display #\*)
  (k j))

Now we can plug it back:

(lambda (k)
  (display #\@)
  (let ((yang
         ((lambda (cc) (display #\*) cc) 
          (lambda (j)
            (display #\*)
            (k j)))))
    (k yang)))

Do a similar sequence of optimizations: lifting out the first (display #\*), partial evaluation, copy propagation. And we have the final result for the value of yin. Let’s name it c1 with a definition:

(define c1
 (lambda (k)
   (display #\@)
   (display #\*)
   (k (lambda (j)
        (display #\*)
        (k j)))))

Now, the original program would look like the following. The program would have by now printed a @ before binding yin to c1, so we make a sequence and include the display term in the front of it.

(begin
  (display #\@)
  (let* ((yin c1)
         (yang
          ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))
    (yin yang)))

Try it. It will behave the same as the original program. Now we do another copy propagation to get rid of the first binding:

(begin
  (display #\@)
  (let ((yang
         ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))
    (c1 yang)))

Since the call/cc here will not cause any side-effects (why?), we can lift (display #\*) out:

(begin
  (display #\@)
  (display #\*)
  (let ((yang
         ((lambda (cc) cc) (call/cc (lambda (c) c)))))
                           ^^^^^^^^^^^^^^^^^^^^^^^^
    (c1 yang)))

Now we just have to find what this underlined continuation is, and it will become the value of yang. It is:

(lambda (k)
  (let ((yang
         ((lambda (cc) cc) k)))
    (c1 yang)))

After our routine sequence of optimizations we have the value of yang. Let’s define it as c2:

(define c2
  (lambda (k)
    (display #\*)
    (c1 k)))

Plug c1 and c2 back into the original program:

(begin
  (display #\@)
  (display #\*)
  (let* ((yin c1)
         (yang c2))
    (yin yang)))

And do a copy propagation:

(begin
  (display #\@)
  (display #\*)
  (c1 c2))

Transformed Program

Now we have our final transformed and optimized program. Let’s put the definitions of c1 and c2 here for an overview:

(define c1
 (lambda (k)
   (display #\@)
   (display #\*)
   (k (lambda (j)
        (display #\*)
        (k j)))))

(define c2
  (lambda (k)
    (display #\*)
    (c1 k)))

(begin
  (display #\@)
  (display #\*)
  (c1 c2))

For simplicity, let’s inline c2 into the main program. Now c2 disappears.

(define c1
 (lambda (k)
   (display #\@)
   (display #\*)
   (k (lambda (j)
        (display #\*)
        (k j)))))

(begin
  (display #\@)
  (display #\*)
  (c1 (lambda (k)
        (display #\*)
        (c1 k))))

Try it. It will still work the same as the original program.

Behavioral Analysis

Now the program doesn’t contain any call/cc’s and is much easier to understand. Let’s try to figure out how it executes:

1. First, @* will be printed, and we will invoke:

(c1 (lambda (k)
      (display #\*)
      (c1 k)))

2. Inside the invocation, @* will be printed by the body of c1, and k is now bound to (lambda (k) (display #\*) (c1 k)). So the program proceed to:

((lambda (k)
   (display #\*)
   (c1 k))
 (lambda (j)
   (display #\*)
   ((lambda (k)
      (display #\*)
      (c1 k)) 
    j)))

It will print a *, and becomes:

(c1
 (lambda (j)
   (display #\*)
   ((lambda (k)
      (display #\*)
      (c1 k)) 
    j)))

Now we have seen @*@**

3. Notice that we are back to a call to c1! This is a good sign of recursion. But this time the argument is different. If we simplify it a little, we get:

(c1
 (lambda (j)
   (display #\*)
   (display #\*)
   (c1 j)))

4. I think you see what’s going on here. This time, c1 is called with

(lambda (j)
   (display #\*)
   (display #\*)
   (c1 j))

which means “When called, display TWO *’s, and then behave like c1 on the argument”. If we go on, the argument to c1 will be longer and longer, and each time with an additional (display #\*). It will print @***, and then @****, and then @*****, and so on.

5. Let’s introspect a bit. Which part of the program is responsible for creating the ever-longer displays, and why? It is this piece from the definition of c1:

(k (lambda (j)
     (display #\*)
     (k j)))

Here k is c1′s argument. Notice that k is always of the form:

(lambda (j)
   (display #\*)
   (display #\*)
   ...
   (c1 j))

When k is applied, it will print the corresonding number of *’s inside it, and then behave like c1. The argument to k is:

(lambda (j)
  (display #\*)
  (k j))

What does this mean? It says: “When called, print a * and then behave like k on the argument.” This is how you get a “new k”, with one more display of *.

This will go on indifinitely. This is why you see the infinite string: @*@**@***@****@*****@****** …

Propositions As Programs

The Curry-Howard correspondence says that propositions are types, and proofs are programs. I had been wondering if there is a simpler way to think about it, so I came up with this:

Propositions are programs; proofs are “abstract interpretation” of them to True.

Here True is a meta-level truth value. A theorem prover is then an “abstract interpreter” which tries to evaluate the proposition to True. This computational process which derives True from the proposition, is called a judgment. Here I may have conflated the concept “abstract interpretation”. I chose to use this term just because those two words “abstract” and “interpretation” conveys the general idea I hope to capture, but the term “abstract interpretation” here may not be what you use to think it is. I could have used “supercompilation”, but I don’t think the word conveys the concept very well.

Any way, this is a special kind of evaluation. It may be partial evaluation, supercompilation, or some sort of static analysis, whatever you call it. It can also take human inputs interactively, as long as it is semantic-preserving — it returns True whenever an actual interpreter would return True. But it is a lot more efficient, because unlike a normal interpreter, it takes shortcuts (e.g. induction hypotheses). If it thus evaluates to True, then the proposition is proved. This is much like fortune-telling. It immediately tells you the result that an actual interpreter would eventually give you. It has a limited number of basic “reduction operations” such as unfolding, induction and rewriting, so we can record reduction sequences as “proofs”, and we can verify them later on.

This seems to match what we have been doing with proof assistants like Coq, and possibly a more natural way of thinking about this kind of theorem proving. I feel that the propositions in Coq are a little far-fetched to be called “types” of the proof terms (note: not to be confused with tactics). For example, we can have a proof term like

fun (n' : nat) (IHn' : n' + 0 = n') => ...

Is n' + 0 = n' the type of IHn'? Well, you can call it whatever you like, but calling it a “type” doesn’t feel natural at all. What we want is just a way to bind an equation to a name, so that we can refer to it. It feels better if we just think of propositions as programs with boolean return types and the proof terms reduction sequences of them into True. If you take a look at the proof terms of Coq, you may find that this is the case. For example, take a look at this simple theorem and the tactics that prove it:

Theorem plus_0_r : forall n:nat, n + 0 = n.
Proof.
  intros n. induction n as [| n'].
  reflexivity.
  simpl. rewrite -> IHn'. reflexivity.  Qed.

They produce the following proof term:

plus_0_r = 
fun n : nat =>
nat_ind (fun n0 : nat => n0 + 0 = n0) eq_refl
  (fun (n' : nat) (IHn' : n' + 0 = n') =>
   eq_ind_r (fun n0 : nat => S n0 = S n') eq_refl IHn') n
     : forall n : nat, n + 0 = n

You may think of this proof term as a program with the theorem as its type, but you can also think of it as a reduction of the program n + 0 = n to True, where n is a natural number. It is saying: “Do an induction where the first branch executes an equivalence judgment, and the other branch does an unfolding, a rewriting using the induction hypothesis, and an equivalence judgment.” I think the second way of thinking of it is much more natural.

This interpretation can also explain the operation of Boyer-Moore style theorem provers (e.g. ACL2), as this is almost exactly how they work. They just don’t normally output a machine-verifiable reduction sequence.

You may have noticed that we have an important difference from the orginal Curry-Howard correspondence here:

Proofs are no longer considered programs.

At least proofs are not object-level programs. They are the “operation histories” in the abstract interpreter, which are at the meta-level. We can give this operation history to an independent verifier, so that it can “replay” the abstract interpretation and verify the correctness of the proof.

Alternatively, if you consider the verifier as an interpreter then proofs are its input programs. In this sense you can also say that proofs are programs for the proof verifier, and both propositions and proofs can be considered programs. But they are programs at two different semantic levels: the proofs are at a higher level than the propositions. This is very different from Curry-Howard.

Thus we have arrived at a unified way of thinking about the two very different styles of theorem proving: Curry-Howard correspondence (as in Coq), and Boyer-Moore (as in ACL2).

On Software Design Patterns

In this post I try to answer the contraversial question

Are software design patterns good or bad?

This used to be a long post, but I have since condensed it into a very short one, thanks to the pictures. One picture is worth ten thousand words. Hopefully you will have understood most of Richard Gabriel’s 239-page book after looking at the following two pictures.

  1. Here are some nice design patterns. You can use them to decorate the floors in your house.

  2. Here is a masterpiece. Can you construct it using the above patterns, or any patterns at all?

I think the answer is already obvious. One cannot produce a masterpiece by just putting patterns together. Artists know that techniques are never as important as vision. That is why they spend their lives studying the nature of things such as the human body. Once they get hold of the essence, they use whatever techniques at hand to express what they want. They can also change the techniques or invent new ones as needed.

The same holds for computer programming. We need to keep in mind that design patterns are means and not ends. True programmers are never constrained by existing patterns. They observe and grasp the essence of things, inventing new techniques and patterns as needed.

Concurrent Stack Does Not Exist

Some weeks ago I read this thought-provoking article by Nir Shavit Data Structures in the Multicore Age. The topic is about efficiency of concurrent data structures. There have been many thinkings going on after the reading, but the most interesting point to me is that the author started trying to use a “concurrent stack”, but ended up happily using a pool.

“In the end, we gave up the stack’s LIFO ordering in the name of performance.”

Now there can be several interesting questions to ask. Can we give up some property of a data structure if it is crucial to the correctness of the program? If we can give up the LIFO ordering, then why did we think we need a stack? But the most interesting question is probably:

Does a “concurrent stack” really exist?

I think the answer is No — a “concurrent stack” hasn’t really existed and will never exist. Here is why:

  1. If the threads are allowed to push and pop concurrently, then the “stack” can’t really maintain a LIFO order, because the order of the “ins” and “outs” is then indeterminate and contingent on the relative execution speeds of the threads.
  2. If the execution of two threads strictly interleave. Each time a “pusher” pushes an element, a “popper” immediately pops it out, then this is a FIFO order, a queue.
  3. If a pusher pushes all the elements before a popper starts to pop all of them out, then this is indeed a LIFO order, a stack. But notice that there is no concurrency in this case — the executions of the threads must be completely sequential, one after another.
  4. If two threads interleave randomly, or there are more than two threads accessing the “stack” at the same time, then nothing can be said about the access order.

From the above, we can see that there is a fundamental conflict between the two notions, “concurrency” and a “stack”.

If a “stack” can be accessed concurrently, then there is no way we can maintain a LIFO order. On the other hand, if we enforce a LIFO order, then the stack cannot be accessed concurrently.

If we have realized that the essence of a stack is a continuation, and a continuation (by itself) is sequential, then it is no surprise we arrive at this conclusion.

Since a LIFO order is essential to a stack, we can’t call the data structure a “stack” if we can’t maintain a LIFO order.

We can’t call a data structure a “stack” just because it has the methods named “push” and “pop” — we have to look at what the methods actually do.

Even if we continue to think that we are using a stack, the threads are in effect just distributing messages, with the operations “push = send” and “pop = receive”. So in essence this data structure is a pool. This exactly justifies the author’s relaxation to a pool, although no actual relaxation happens — the data structure has been a pool from the beginning. It was just disguised as a stack and less efficient.

So we see that the concept of a “concurrent stack” does not really exist. We also see that some data structures we have in the sequential world may not have concurrent counterparts.

How to Reinvent the Y combinator

I always believed that one can never learn the essence of anything without reinventing it. Once you reinvent things, you will never forget them — because otherwise you can just re-reinvent them.

Today I found that I forgot how to derive the defintion of the Y combinator. I learned it several years ago from an online article, but now the search term “Y combinator” gives me all the news about startups (sigh). I tried for two hours, but still couldn’t make the leap from the “poor man’s Y” to “Y”. Finally, I opened my good old friend The Little Schemer. Alas. Chapter 9 tells me exactly how to reinvent Y. Thank you Dan Friedman and Matthias Felleisen.

To prevent myself from forgetting how to derive Y again, I made a slide to record my understanding of it. I hope it can be of help to people (including the future me). So here it is.

A Reformulation of Reducibility

I found the theory of computation books very imprecise about their descriptions of Turing machines and reductions. They usually start with something precise and mathematical, for example they would define a Turing machine as a 7-tuple, but after that, everything about decidability and reduction is described with impenetrable paragraphs in natural languages.

I believe that the use of natural languages leads to most of the confusion in theory of computation because natural languages are inherently imprecise and ambiguous. There are too many ways to say the same thing. For example, you can find these sentences which mean exactly the same in the books:

  • “Run M on w”
  • “Run M on input w”
  • “Simulate M on w”

The pronouns and references are even more confusing. For example:

“Use the description of M and w to construct the TM M1 just described.”

What is “just described”, and do M and w here mean the same things as in the earlier paragraph?

Another serious problem is that natural languages are very weak at representing the notion of “interpretation”, which underlies much of theory of computation. For example, a Turing machine simulates another Turing machine, which again contains and simulates yet another one.

After some thought, I found that we could use something precise such as mathematical notations combined with programming languages to describe the concepts. As an example, I’ll show here how the notion of reduction can be defined precisely as a homomorphism which can be instantiated for reducing one problem to another.

Definition 1 (reduction): A reduction (as in theory of computation) is a homomorphism (as in universal algebra):

Reduce(TM, I) = (TM', I')

satisfying the property

TM @ I = TM' @ I'

where

  • TM = Turing machine which we reduce from
  • TM' = Turing machine which we reduce to
  • I = input string of TM
  • I' = input string of TM’
  • @ = simulation, similar to the Scheme code ((eval TM) I)
  • TM @ I = result from simulating TM on I (accept or reject)
  • TM' @ I' = result from simulating TM’ on I’ (accept or reject)

End of Definition 1.

Notice that the simulation (TM @ I) may go into an infinite loop and never produce any result. Now I show how to use this homomorphism to describe the reduction from ATM ==> HALT, where

  • ATM = the “acceptance problem” (deciding whether a Turing machine M accepts string w)
  • HALT = the “halting problem” (deciding whether a Turing machine M halts on string w)

For convenience, we let

  • DATM = “the decider of ATM”
  • DHALT = “the decider of HALT”

Now the reduction can be fully described by the following homomorphism:

Reduce(DATM, (M,w)) = (DHALT, (M',w))
where
  M' = <if (M @ w) then accept else loop>
satisfying
  DATM @ (M,w) = DHALT @ (M',w)

Yes, that’s an all-inclusive formal proof that HALT is undecidable. It even includes the notion of “reduction” itself.

Let me explain it a bit. The first line says that there exists a function (named Reduce) from the pair (DATM, (M,w)) to another pair (DHALT, (M',w)), where M' = <if (M @ w) then accept else loop> is a description of a Turing machine.

Now let’s look at the last line:

DATM @ (M,w) = DHALT @ (M',w)

It says: if we have a decider for HALT (DHALT), then we can use it to define DATM, thus deciding ATM.

Why this is a valid defintion for DATM? This is because from the definition of M'

<if (M @ w) then accept else loop>

we know that:

  • If (M @ w) accepts, M' accepts, thus DHALT @ (M',w) accepts
  • If (M @ w) rejects, M' loops, thus DHALT @ (M',w) rejects
  • If (M @ w) loops, M' loops, thus DHALT @ (M',w) rejects

Notice from the colored words that DHALT @ (M',w) will accept if and only if M accepts w. Thus this defines a decider for ATM.

So if DHALT exists, then we can have DATM. But this contradicts the fact that DATM cannot exist, so DHALT must not exist.

If you wonder how this proof corresponds to Definition 1, here is some details how it is instantiated:

  • TM = DATM (nonexistent)
  • TM' = DHALT (hypothetical)
  • I = (M,w) where M is the description of a Turing machine which we want to know whether it accepts input w.
  • I' = (M',w) where M' is <if (M @ w) then accept else loop>
  • TM @ I = DATM @ (M,w) (running decider of DATM on input (M,w))
  • TM @ I' = DHALT @ (M',w) (running decider of DHALT on (M',w))

This is a lot more concise, accurate and easier to understand than a paragraph:

F = "On input <M,w>:
  1. Construct the following machine M'
     M' = "On input x:
        1. Run M on x.
        2. If M accepts, accept.
        3. If M rejects, enter a loop."
  2. Output <M',w>."

ADTs and Objects

After reading William Cook’s essay On Understanding Data Abstraction, Revisited, let me try to condense the difference between abstract data types (ADTs) and objects into a few words.

(To avoid ambiguity, I use “instances” to stand for data created by instantiating ADTs)

  • “Instances” created by the same ADT share the same functions (methods). Functions may be parameterized, but the code is the same in each instance. Because of this, they need to share the same representation.
  • “Objects” (as proposed by Cook) don’t necessarily share function code. Each object may have a completely different set of functions (with matching names and types). This is in essence the same as “call-backs” in GUI widgets. Because of this diversity, each object may have completely different representation.

Ironically, it is usually the former case in mainstream object-oriented languages like Java and C++. In Java you can somehow achieve the latter with interfaces and anonymous classes, but it is awkward. JavaScript’s prototype-based system seems to be closer to the essence of the latter, but still not feel natural.

But different from Cook’s view, I think it is better not to consider binary operations like set union as methods. Union may take two sets with different representations and create a new one, thus it can operate by using the “true methods” of both arguments (possibly iterators and membership).

A Bug in GHC’s Type System

Several days ago, I implemented an experimental type inference system with first-class polymorphism. When comparing it with other systems, I found a possible bug in GHC’s type system regarding universal quantification. The phenomemon was confirmed and reproduced by people at #haskell IRC channel for GHC versions above 7.01. The code that causes trouble is:

gen :: [forall a. a -> a]
gen = [id]
test1 = head gen 1

Obviously this program should typecheck, since:

  • id has the type forall a. a -> a.
  • A list gen containing id should have type [forall a. a -> a](as in the annotation).
  • head has the type forall a. [a] -> a.
  • head gen should have the type forall a. a -> a.
  • head gen should be able to be applied to 1.

But GHC rejected this program for a strange reason.

Couldn't match expected type `t1 -> t0'
with actual type `forall a. a -> a'
Expected type: [t1 -> t0]
Actual type: [forall a. a -> a]
In the first argument of `head', namely `gen'
In the expression: head gen 1

On the other hand, it works if (head gen) is bound at let:

test2 = let hg = head gen in hg 1

It doesn’t break the soundness of the type system since it only rejects some correct programs, but this kind of pecularities of type systems can be painful when they add up. I guess this may be caused by the interaction between GHC’s internal type system with the legacy let-polymorphism.

On Point-free Programming

Concatenative programming, or point-free style, is useful sometimes, but has some serious drawbacks similar to the SKI combinators. Applicative programs can be compiled into point-free style easily, but writing and reading them directly in large scale is usually a mental burden.

It only works well with functions with one or two arguments. Concatenation of functions with more than two arguments will not be so convenient. If the receiver has an argument order that is different from the sender’s output order, you will need a non-trivial permutation of argument order.

For example, if the function is defined as:

f :: Int -> String -> Bool
f x y = ...

If you want to use it as the predicate for filtering a list of strings, that’s fine. You just write something like filter (f 2) .... But what if you want to filter a list of integers? Then you will need to swap the order of the first two arguments before you can do the partial application. So you write filter (flip f 2) .... Fine. But what if the function looks like:

g :: Int -> A -> String -> Bool
g x y z = ...

And you want to filter a list of A’s? Which function do you use to switch the argument order, and you expect the reader of the program to learn it?

What about functions with four arguments. Notice that there are 4! = 24 different argument orders. How many order switchers do we need?

In order to prevent this kind of plumbing, we have to take unnecessary care when we decide the order of the parameters. This often makes the function look ugly.

Names are more convenient. Notice that even mathematics uses names for permutations (as in algebra):

(a b c d)
(b a d c)

Concatenative programming is like connecting the components of a circuit. But even electronic engineers don’t do it this way. They use net-lists with names and labels.

Names are essential and useful in most cases. Concatenative programming, although nice when used sparsingly, may not be good to serve as a major way of composition.

At this point, I found this sentence from Tao Te Ching (Chapter 1) especially relevant:

“The nameless is the origin of Heaven and Earth
The named is the mother of myriad things”

The Essence of Register Allocation

As an independent study project, I designed a new method for register allocation. Different from earlier methods, it departs from the graph coloring formulation and is based on a variation of abstract interpretation which I call “model transformer semantics”. I show that register allocation is essentially not a graph coloring problem, but rather similar to a cache replacement or scheduling problem, thus possibly deserves much easier solutions.

I have drafted a paper, but because of other priorities, I don’t have time benchmarking it and submitting to a compiler conference. I have put the full text to arxiv. I welcome any feedback. I gave a talk about it at Indiana University in Nov 2011. Here are the slides [PPT] [PDF].

ydiff: Structural Comparison of Programs

(Click on the above picture to see it in action. See the end of the post for more demos)

Motivation

I have been imagining a world where programs are not represented as text, but as data structures. They will be edited not with text editors, but with structural editors, which create and manipulate the abstract syntax trees (AST) of programs directly. In such a programming environment, the line-by-line diff utilities and version control systems will stop to work, because we no longer have “lines” or “words” in our programs.

ydiff is a proof-of-concept for solving this problem. It is a diff tool (and hopefully evolving into a full-fledged version control system) designed for “structural programming”.

Currently ydiff takes pains to first parse program text into ASTs and do structural comparison on the trees, but hopefully in the future, programs will be stored directly as data structures, so that the parsing step will be skipped, and ydiff will be readily extensible to all programming languages.

The following is the introduction to ydiff and the demos taken from my homepage

Why ydiff?

We very often want to know how our code is modified. This is normally done by using the diff program. While effective for finding differences between close revisions, the result from diff can be very hard to read for revisions over a long time span. The worst case is when the code is massively rearranged or refactored, where diff may consider every line to have been changed.

ydiff can help in this kind of situation. Different from line-based or character-based diff, ydiff is “language-based”. It parses the programs and then performs structural comparison on the parse trees. In addition to the usual functionalities of diff, it has the following properties:

  • Language-awareness. ydiff truly understands language constructs and will not make non-sensical comparisons. For example it will never compare a string “10000″ with an integer 10000 even though they look similar. Also, it tries to find definitions with the same name before it attempts to destruct and compare definitions of different names.
  • Refactor detection. It can find renamed, moved, reordered, wrapped, lifted, combined or fragmented code. Refactored code can be detected however deep they are into the structures.
  • Format insensitivity. The comparison result will not be affected by linebreaks or whitespaces.
  • Comprehensible output. The output of ydiff is designed for human understanding. The interactive UI helps the user navigate and understand changes efficiently.

These properties make ydiff helpful for understanding changes. It may also be possibly used for detecting plagiarism in programming classes or copyright infringement of code.

Supported Languages

Currently ydiff supports the following languages:

  • Scheme
  • Emacs Lisp
  • Python
  • Javascript
  • C++

The supporting part of a language is just a parser for that language, into a unified parse tree data structure. I have designed a parser combinator library to make parser-writing less painful, but they are still quite tricky to write. Hopefully in the future, the programs will be edited by structural editors and stored as data structures, so no parsers will be needed any more.

Demos

Before going further into the design details, you are invited to take a look at the demos. They are interactive HTML files with a nice UI design. The left and right windows are always locked in their relative position. A mouse click on changed, moved or unchanged nodes will highlight the matched nodes and scroll the other window to match. After that, the windows will be locked into their new relative position for browsing.

Okay, here are the demos.

  • Scheme. Comparison of the original miniKanren from Professor Dan Friendman and the version I modified in order to support condc, a “negation operator”. Pay attention to the function unify, whose major part is moved into unify-good.
  • Emacs Lisp. Comparison of two versions (v20 and v22) of Taylor Campbell’s paredit-mode, a structural editing mode of emacs for Lisp programs. (example removed for non-technical reasons)
  • Arbitrary S-expressions. Trying to find a bug in an optimizing pass of my Scheme compiler by comparing the IRs (represented as S-expressions).
  • Python. Comparison of two implementations of a small list library that I wrote, which implements Lisp-style lists. The first implementation uses recursive procedures while the second uses Python’s generator syntax and is iterative. Pay some attention to append, whose code is moved inside another function appendAll.
  • Javascript. Comparison between two major revisions of the UI code of ydiff itself.
  • C++ demo1 and C++ demo2. There are two demos for C++. The first demo compares two versions of the d8 Javascript debugger from the V8 project(v3404 from 2009 and v8424 from 2011). The second demo compares V8′s simulators for two different processors (MIPS and ARM).The d8 demo is especially interesting because by clicking on the lines of the method Shell::Initializein the old version, it can be clearly observed that its body has been distributed into several procedures in the new version:
    Shell::Initialize
    Shell::CreateGlobalTemplate
    Shell::RenewEvaluationContext
    Shell::InstallUtilityScript

    Also the major part of Shell::Main is moved into the helper Shell::RunMain.

Algorithms and Technical Details

There are basically two main ideas inside ydiff: Tree Editing Distance and Substructure Extraction, plus the use of the neat idea of parser combinators (similar to Parsec).

  • Tree Editing Distance. Tree Editing Distance measures difference between two trees. In our case, it is the abstract syntax tree. When we encounter two trees, how do we compare them? If the first nodes of them appear to be the same, then we simply proceed to the next right? But what if the first nodes are different? There are three possibilities:
    1. The programmer inserteda node into the first tree.
    2. The programmer deleteda node from the first tree.
    3. The programmer modified a node in the first tree.

    We have to consider those possibilities for each level of nodes, so this naturally becomes a dynamic programming problem. Normally dynamic programming is done by cleverly find out the order to evaluate the problem space, but ydiff uses another important technique: memoization. There is a two dimensional hash table which stores the “delta” (changes) of two nodes, indexed by the two nodes themselves. The dynamic programming is then done in a recursive way, instead of the usual iterative way. The hash table automatically reduces the cost of recursion to the magnitude of iteration, by avoiding redundant evaluation of the nodes that we already compared.

  • Substructure Extraction. Tree Editing Distance can only find the “edits” — insertions, deletions and modifications, but it cannot find refactorization of code. If a piece of code is lifted out of a scope, the Tree Editing Distance function will produce two big changes: a deletion of a big node (for example, an if-statement), and an insertion of a slightly smaller node (with the “if (condition) { … }” wrap stripped off). Notice that although we only made a small change to the program, the resulting delta can be large. What we really want is that it shows a deletion of the wrap (“if (condition) { … }”), but a movement of the parts inside the wrap. This is called Substructure Extraction.Substructure Extraction works by recursively compare the substructures of two nodes, finding possible matches between one node and the subnodes of another. This substructural comparison is only done on the pairs of the resulting insertions and deletions from the Tree Editing Distance function. This little trick captures almost all possible refactorizations between two programs, however deep they are. This is demonstrated in the following picture, where ydiff successfully detects the movement of the function append into another function appendAll.
  • Parser Combinators. The parsers of ydiff are written with the help of a parser combinator library (in Scheme). In addition to the common features of Parsec, it detects left recursion and reports a “stack trace” of the left recursion, so that the parser writer can find the origin of left-recursion and fix the problem accordingly. This helps a lot when constructing the parsers. For examples, I constructed the JavaScript parser in one day, and the C++ parser in just two days. There are corner cases missing from the C++ parser and especially the macro part, but it parses significant pieces of code correctly (for example Google’s V8 project code).

Get the code

ydiff is an open source project. You can follow its development on github: yinwang0/ydiff or get its source code from there.

Ideas about structural version control

I have thought about extending ydiff to a structural version control system and gave a talk on that. You can find the talk slides from another post

A Minimal Boot Sector Tutorial

The natural first step of building an operating system is to find a way to run programs on “bare hardware”. The task turns out to be quite easy despite its daunting first impression. After researching on the net and trying out various tutorials, I decided that all of them are overly complicated. Most of them assume interfacing with a C-like language and implementing a Unix-like system, but C and Unix are not the only story about operating systems design. A boot sector tutorial should teach nothing more than how to boot a computer.

After learning the good parts from the tutorials and applying my own simplifications, I arrived at my first boot sector. It is very simple and does very little — it just boots the machine and displays a colorful banner — but it illustrates the only things you need to know for booting a computer.

The code is very short — only 21 lines of code excluding comments and blank lines.

org 7C00H                      ; the program will be loaded at 7C00H

start:
  mov eax, string_start
  mov ch, 1                    ; ch contains color of text
  mov ebx, 0B8000H + 718H      ; B8000H is VGA memory
                               ; 718H is offset to approx center

print:
  mov cl, [eax]                ; load char into cl
  mov [ebx], cx                ; store [color:char] from cx into VGA
  add ch, 1                    ; change color to (ch+1) mod 16
  and ch, 0x0F
  add eax, 1                   ; advance string pointer
  add ebx, 2                   ; advance VGA pointer
  cmp eax, string_end          ; until the end of string
  jg stop
  jmp print

stop:
  jmp stop                     ; infinite loop after printing

  string_start db 'My colorful new OS!'
  string_end equ $

  times 510-($-$$) db 0        ; pad remainder of boot sector with 0s
  dw 0xAA55                    ; standard PC boot signature

It is in NASM syntax and needs nasm to be assembled into machine code. After that it can be booted by QEMU, a processor emulator. The only two necessary command lines are (assuming the code is stored in a file named myfirst.asm):

nasm -f bin -o myfirst.bin myfirst.asm
qemu -hda myfirst.bin

Of course, you can also burn the disk image (myfirst.bin) onto a CD and boot a real machine from it.

Sum Types and Union Types

In a new type system I’m designing, I was trying to find a good reason to get rid of sum types (as are commonly used in ML and Haskell). Well, I don’t hate them that much, but for simplicity’s sake, I always try to get rid of things unless there are undeniable reasons that they must exist.

I think sum types have a large overlapping of functionality with product types and cause inelegance and inefficiency, so I hope to replace them with union types (some people call “open unions”). Union types are more orthogonal with respect to product types. I seem to have found a good idea where sum types originated and the reason why we don’t need them in a programming language.

It is quite evident that sum types were orginated from some branch of mathematics. In those mathematical context, there are no nominal (named) product types. All we have are Cartesian products. Cartesian products have no constructors (or type tags) in them. How do we distinguish a student represented as (Height x Weight) with a professor, also represented as (Height x Weight)? We put tags on them before putting them into a disjoint union. This process is called injection. We do this every time when putting something into a disjoint union, and we project the elements out when we need their values.

From a programming point of view, the injections and projections are very inefficient and inconvenient. This is equivalent to constructing new objects whenever we need to put them into a list, and destructing them when we need their components. Then construct new objects (again) when we need to put them into another list, and so on.

In order to avoid these shortcomings, programming language researchers decided to attach type tags to the objects when they are created. The tags tell us what the object is and they stay with the objects throughout their life-time. This results in record types. But somehow the PL researchers seemed to have been influenced by mathematics’ way. They decided that each variant of a union needs to define a new constructor (injection). Thus sum types were born. For example,

data T1 = Foo Int | Bar String

Here Foo and Bar are essentially injections. Why do we need the injections and projections in a language where type tags always tell us what an object is? Another problem is that a constructor can only belong to one sum type, and there is no way we can reuse it in another. For example, there is no way you can define another sum type like:

data T2 = Foo Bool | Baz Float

because Foo is already defined in T1. What we wanted to express is that Foo belongs to the union T1 and it also belongs to the union T2. This is a very reasonable need, but most Hindley-Milner style type systems doesn’t allow this.

If we relax the requirement that each variant of a sum type must define a new constructor, we get union types. They are more general and orthogonal with respect to product types. We can always choose to put another level of type tags upon the variants, therefore we lose nothing if we don’t have sum types.

Boolean Expression Simplification

A friend of mine posed an interesting question to me a few days ago. He was trying to make a search engine for an advertisement system where each advertisement belongs to several categories (or keywords). He would like to implement a search system that can efficiently find advertisements using a simple expression language containing boolean expressions. For example, he would like to have expressions like ((“football” or “basketball”) and “baby”). For efficiency, he would like that the boolean expressions be simplified before they are used to collect streams of data items.

After a while, we decided that the boolean expressions should be converted into a “sum of products” (SoP) form so as to make the search efficient. So I made a prototype of the algorithm which can turn boolean expressions (containing “and”, “or” and “not”) into SoP form.

It turns out to be quite simple if we do it recursively. The basic idea is to:

  1. Push ‘and’ into ‘or’.
  2. Push ‘not’ into ‘and’ and ‘or’
  3. Do (1) and (2) repeatedly until no more simplifications can be made.

It is easy to see that we will have a SoP form after this transformation, because the rules (1) and (2) will be applicable if there are and’s outside of or’s, or there are not’s outside of and’s and not’s. Thus we are assured to have a SoP form when this process terminates.

The rest of the problem is how to represent the final result. I proposed to group and terms so as to reduce nesting. For example, the form (and (and a b) c) is not as good as (and a b c), because the latter contains the implicit information that a, b and c are interchangeable in their order of evaluation, so we may achieve accelerations in the algorithm by searching for keyword which contains fewer items first. This simplification in the result can be achieved with a rewriting procedure which is “context-aware”. We pass the context types into the recursive calls, and the recursion gives us different results in accordance with the context types.

The algorithm for doing these simplifications is prototyped in Scheme and uploaded to my GitHub repository.

On Literate Programming

A friend and I had a discussion about the current trends of programming methodologies. At some point, the topic turned to Literate Programming (LP). I personally have written a non-trivial literate program (in CWEB) a few years back, but like many other people, I did not persevere. As a hindsight, there are some good reasons why people do not adopt LP.

Missing The Big Picture

Although we love books, they may not be the best medium for documenting programs. Programs are much like circuits or cars. They are very complex systems with interconnected parts. The interactions and constraints among the parts can’t really be understood when they are taken apart.

LP systems usually segments the program into small pieces and rearrange them into a book-like order. This is like taking a car apart when teaching car-making. There is a huge loss of structural information in this rearrangement. It may be useful to take some parts out for explanations and experiments, but they should be placed back immediately after examination. The student must have a clear view of the overall structure of the car.

If we take all the parts out and lay them along the road, the student would have trouble figuring out how they would fit and work together. In short, the student see trees and not the forest when reading a book generated by Literate Programs.

Programs Are Not Text

LP systems usually allow programmers to fragment a program into small pieces, reorder the pieces into an order that is “convenient to the human mind”. Later on, a program (weave) generates printable documentation for the program; another program (tangle) assembles the fragments into machine-executable code.

However, these systems have a big problem — treating programs as text. LP systems usually don’t parse the programs. They just segment the code textually. This often messes up the variable scopes and causes subtle bugs. This is much like macros in early Lisps and C, extremely dangerous and error-prone.

There are other kinds of LP systems such as Literate Haskell, CoffeeScript etc. They don’t break up the procedures and don’t have the above “hygiene problem”. But because we already have the freedom to rearrange the order of procedures in any programming language, those systems are nothing more than a fancy comment facility like JavaDoc. In views of authentic literate programmers, those systems miss the essence of LP.

Human Supremacy, Human Languages and Human Cognition

LP systems rely their arguments on a false belief in human supremacy, an overemphasis in human languages, and a misunderstanding of human cognition. Practice has shown that the human brain is incapable of competing with computers in anything that requires rigorous reasoning and perseverance. Natural languages are woefully inaccrurate and confusing for programming. Any programming language that tries to mimic natural languages is doomed to suffer the same “design flaws” of natural languages (for example SQL).

It is also doubtful whether the order set by literate programming is suitable for human cognition at all. Programs should be constructed in an order that matches the nature of the concept it models, and not in the order of a particular person’s way of thinking, nor in the arbitrary order set by a compiler. Once the program structure matches nature, it will immediately appeal to human understanding.

So instead of investing in LP, a probably better direction is to investigate better ways of representing programs that enable them to match the concepts more naturally. Functional Programming and Logic Programming are gradually moving toward this direction.

Painful Code Navigation

Once the code is weaved into a book, it will be very hard to navigate through. Several years ago, I spent some time reading Knuth’s MMIXware, a book generated by a CWEB program. I noticed how much time I spent on finding definitions and cross-references of variables. Although Knuth painstakingly constructed an index for the book, obviously I still had to turn hundreds of pages back and forth, while in IDEs I can jump to the definition of a variable with just one keystroke. One should wonder why we should bother publishing code as books.

From these observations, it is really unclear how literature and books can serve as a good medium for programs. The world has changed and technology advanced, instead of pursuing something static and classic, we may just need to open our minds to new kinds of media.

Layout Syntax Considered Harmful

Although the idea of layout syntax—using whitespace characters to delimit blocks—has been promoted by several languages (notably Python and Haskell), I feel that this kind of syntax brings more trouble than benefits.

It takes just one keystroke to produce a serious bug

Programs in layout syntax are fragile. For example, consider the following Python programs:

# correct definition
def member(x, ls):
    for y in ls:
        if x == y:
            return True
    return False         # correct indentation
# incorrect definition
def member(x, ls):
    for y in ls:
        if x == y:
            return True
        return False     # incorrect indentation

The second definition has been produced from the first by an inadvertent TAB key, which indented the return statement one more level to the right. While the two definitions differ only in one indentation, they produce totally different results. The first definition is correct, while the second has serious bugs:

  1. If ls is non-empty, it will always return False whether x is an element of ls or not.
  2. If ls is empty, it will return None (instead of False) because there is no return statement after the for-loop, a “missing return statement” bug.

Although this is just a minimal example, the bug may take quite some time to show up and be fixed. In order to prevent this kind of bugs from happening, I often find myself moving the cursor up-and-down in a straight line to check the alignment of the statements.

Let’s see why this bug cannot happen in a language which does not use layout syntax. We now invent an alternative syntax for Python, so that the pervious program looks like:

def member(x, ls) {
    for y in ls {
        if x == y {
            return True
        }
    }
    return False
}

Given the correct definition, can you imagine how you could reproduce the bug with just one keystroke? It is almost impossible. To see why:

  1. The return statement can never get into the loop with a change in indentation.
  2. It takes at least two edits and one movement in the editor to move the return statement into the loop. There are two alternatives to choose from:
    • Cut return False. Move the cursor into the closing curly brace of the for-loop. Paste.
    • Delete the closing curly brace of the for-loop. Move the cursor beyond return False. Insert an closing curly brace.

Either way, you must be deliberate and precise in order to reproduce the bug. Otherwise the parser would have complained (for example, if you just delete a closing curly brace).

However, the situation is very different with layout syntax, where one TAB key press produces the same amount of change as the above three operations. The change happens quickly and the program remains grammatically correct, obscuring the presence of a bug.

The situation is a little better for Haskell, because incorrect indentations often cause type errors and the programmer will be alerted, but in either languages, it takes quite some time to fix this kind of bugs.

Unconvincing advantages

It is often claimed that layout syntax has the following advantages over curly braces:

  • Your programs become a lot shorter because less curly braces are used.
  • You write less clutter such as curly braces and semicolons, and that beautifies your code.

I found that neither of the two advantages convincing. For the first part, Python and Haskell programs are indeed several times shorter than equivalent Java or C programs, but this cannot really be creditted to layout syntax.

We need to have some blank lines even in a Python or Haskell program, between definitions and sometimes in the middle of a block. The blank lines naturally denote groups of statements. So if we count the number of additional lines introduced by curly braces, we will find that there aren’t many. Curly braces also naturally denote statement groups, so not only they don’t look bad, they are helpful.

In Python and Haskell, it is the semantic features (pattern matching, first-class functions etc.) that make the programs short, not layout syntax. If we had an alternative syntax of Java which uses layout, then Java programs would still be several times longer than equivalent Scala programs. Java programs are longer not because they use curly braces, but because they don’t have things such as first-class functions, so they have to use some tedious design patterns.

Second, layout syntax does not really save “clutter”. Even in a language with layout syntax, we may still need to write almost the same amount of (if not more) clutter. The following is a random piece of code taken from the current release of GHC. We can still see lots of curly braces and semicolons in it. I guess layout syntax actually caused trouble, so the authors of GHC decided that they will just write curly braces.

tcInstanceMethodBody skol_info tyvars dfun_ev_vars
                     meth_id local_meth_id
		     meth_sig_fn specs
                     (L loc bind)
  = do	{       -- Typecheck the binding, first extending the envt
		-- so that when tcInstSig looks up the local_meth_id to find
		-- its signature, we'll find it in the environment
          let lm_bind = L loc (bind { fun_id = L loc (idName local_meth_id) })
                             -- Substitute the local_meth_name for the binder
			     -- NB: the binding is always a FunBind

	; (ev_binds, (tc_bind, _))
               <- checkConstraints skol_info tyvars dfun_ev_vars $
		  tcExtendIdEnv [local_meth_id] $
	          tcPolyBinds TopLevel meth_sig_fn no_prag_fn
		  	     NonRecursive NonRecursive
		  	     [lm_bind]

        ; let full_bind = AbsBinds { abs_tvs = tyvars, abs_ev_vars = dfun_ev_vars
                                   , abs_exports = [(tyvars, meth_id, local_meth_id, specs)]
                                   , abs_ev_binds = ev_binds
                                   , abs_binds = tc_bind }

        ; return (L loc full_bind) }
  where
    no_prag_fn  _ = []		-- No pragmas for local_meth_id;
    		    		-- they are all for meth_id

Better ways to save clutter

Even if we do hate curly braces, there are better ways to reduce or even completely eliminate them. For a trivial “solution”, we could just use a dim color for curly braces and semicolons in the editor, so that they are less noticeable. For example the above Python program in that curly bracy syntax could look like this in your editor:

def member(x, ls) {
    for y in ls {
        if x == y {
            return True
        }
    }
    return False
}

Better still, we could use a structural editor that lets us manipulate the AST (abstract syntax tree) directly. Those editors could provide several options of denoting blocks. You can switch between colored blocks, curly braces, or nothing at all. You can switch the look of your code at any time, instantly. People have implemented such editors, for example this editor designed by Kirill Osenkov.


Re-indentation hassle

In a language that doesn’t use layout syntax (Java, C, Scheme, …), no re-indentation is really needed when the code changes. The programmer can move a block of code by a simple copy-and-paste and continue solving the real problem. Re-indentation can always be done later and can be done automatically.

But in a language that uses layout syntax, re-indentation is mandatory, and worse, it can only be done manually. Layout syntax completely disables any editor’s auto-indent function. One may think that we might be able to invent a smarter editor that can auto-indent code for those languages. This is simply impossible. This is evident in the analysis of the above example. The two programs differ only in indentation, but they have completely different meanings. Both are grammatically correct programs and the editor has no way to tell which is the one you want unless it can read your mind.

Some people say that because those languages have advanced semantics, programs are so short that we don’t need to re-indent code very often. But experiences prove to me that the need for changing and rewriting code can never be eliminated. Writing code is like writing a book, you can always find pieces that need change or even complete rewrite. Usually changes in the following category will cause re-indentation:

  • Scope changes. There are lots of examples in this category: lifting an internal function to top level or push a top-level function into an internal function, surrounding a block with let-bindings, loops, conditional statements, try-except or lifting a block out of them, factoring out duplicated patterns, lifting “invariant code” out of recursion, … to name a few. These will necessarily change the indentation of a block of code, and each line needs to be re-indented.
  • Align code. For example in Haskell, when we align the arrows (->) of a case expression or the equal signs (=) of a function definition, we will notice that we have to re-indent most of the right-hand-sides, because they often contain multi-line expressions but the editors (for example, Emacs’ align-regexp function) only move the lines that contains the arrows or equal signs.
  • Renaming. We seldom choose the best names on the first shot, and good names make programs self-explanatory, so renaming is a very important and commonplace action. But in the following simple Haskell program, if we change the name from “helloworld” to “hello” and don’t re-indent the rest of the lines, we will get a parse error.
    helloworld z = let x = 1
                       y = 2 in
                     x+y+z

    Because the code becomes the following after the renaming, and the second line will no longer be aligned to “x = …”, and that confuses the parser.

    hello z = let x = 1
                       y = 2 in
                     x+y+z

    A similar thing happens when we lengthen the name to something like “helloworldcup”. Try it yourself. From this example, I hope you see how simple things are made frustratingly complicated by layout syntax. If you haven’t been convinced, try adding more lines to the above let expression.

The interruption from re-indenting code is usually not just one or two seconds, but often tens of seconds, even minutes. The programmer has to put down real problems at hand and turn to the mind-dead re-indenting job. This disrupts the “flow”, which is essential for producing creative and elegant code.

Syntax considered harmful

I believe that syntax, although an important aspect of natural languages, should not play a significant role in programming languages. It has already brought us too much trouble and frustration and wasted too much of our energy. You can read my other blog post detailing the harm of syntax in general and why we should remove the concept of “syntax” as a whole from programming languages.

Syntax has prevented lots of new design possibilities in programming languages. You may have heard language designers say: “Hey this is a nice feature, but the syntax of my language hasn’t any room left for it.” Layout syntax pushes to this direction even more. It forces us to consciously and constantly think about syntax, drawing our attention away from semantics design. It poses certain constraints on how code must be formatted, and makes a language even less extensible. Thus I think layout syntax is the most troublesome type of syntax.

Limitations of the Unix Philosophy and the Ultimate Solution to Parsing

“Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface”. This the widely accepted Unix Philosophy.

There is probably nothing wrong with the first two clauses; those are the basic principles of modular design. But the last clause (about text streams), although being accepted by many people, is not quite right. Here I hope to explore a bit of the reason behind it, and from that I propose a simple and once-and-for-all solution to the parsing problem.

Text is a universal interface, but an inconvenient universal interface

We all have observed how typical Unix programs work together:

  • Each program read text from from some input stream (a “file” in the virtual file system), computes some result, format the result and print the resulting text to the output stream.
  • Programs can be connected using pipes, or they write into and read from files, so that they can communicate.

All seem to work well? But how does the second program extract information from the text stream? Here is exactly where the pain starts. In order to put information into a text stream, the first program has to encode it, and the second program has to decode the received text.

The encoding part is easy, we just decide on some syntax, for example “each line contains comma-separated fields”. But when decoding, we will need something like regular expressions if the encoding of information is simple enough, but we will need a parser if we have to communicate something of a recursive nature, such as program text. For regular expressions, we could use Perl, AWK, or a regexp library in any language, but for the parser, we have no choice but painstakingly writing it.

Even with modern parsing technologies, writing a parser is still way too much pain. They are hard to learn and hard to tweak. The language at hand may have a horribly designed syntax, making it very hard to parse. Ambiguity, left recursion, context sensitiveness… Those problems have never stopped harassing compiler writers.

The need for a standard binary format

Those problems suggests that it is problematic to use text to represent data structures such as abstract syntax trees (ASTs). It is just unreasonably hard to decode the information that we encoded a minute ago. Indeed, text is a universal interface, but it is not the only one. There are many other universal interfaces around. Why should we stick to text?

If we think of it in a type theoretic way, text streams are only one type: string. We have many other types: integers, booleans, records, … Why should we encode all other types into strings? We need a more general and efficient format that can represent all types, instead of converting them into strings.

In fact, text is part of a more general concept which I will call standard binary format in this post. Some people hate the term “binary formats”, but they haven’t realized that all data in computers and networks are in binary formats. The so-called “plain text” is just binary bits encoded in ASCII or Unicode, so they are also “binary formats”.

Just think about this question: If there were no standardization of text (such as ASCII or Unicode) until today, would anybody still use text as an interface? Obviously No. If everybody uses a different encoding of the alphabet, we will have no way to display text, not to mention to use it as an “interface” for other kinds of information.

So the point is not really about “text” or binary, it is all about standardization. If we have a “standard binary format” for all types of data (including text), we would have a much better universal interface. Data structures will be passed around as they are, and there will be no need for complicated decoding procedure. This is a principle originated from the functional programming community: “avoid encoding.”

We have many candidates for standard binary format: XML, JSON, protocol buffers, S-expressions, … I’ll try to analyze their capabilities and limitations later, but let’s assume such a format can be designed and standardized.

Programs should not be encoded as text either

Now suppose that we have a standard binary format for all data structures, we can also use it to represent ASTs. We will need extensions for existing editors or browsers. The editors or browsers will “render” the ASTs into various styles. They could be either text, graphics, sounds or some other sensory signals. The editors will allow the programmer to manipulate the ASTs directly using natural and intuitive commands, and they will save the ASTs in a standard binary format. By doing so, we completely eliminate the need for parsers because compilers can just read in the ASTs directly. We also completely eliminate the concept of “syntax”. Compiler and IDE writers will be living a much happier life and programmer productivity will be greatly enhanced, because we will easily have much better editor support for ALL programming languages.

Don’t get me wrong. I’m not saying that parsing is not an interesting problem. In fact, it is fascinating. Maybe it is the pure interest in the problem itself that is driven language designers to use syntaxes are harder rather than easier to parse, thus creating problems rather than solve problems.

You may think that extending editors and browsers is too much work, but actually it can be quite easy. In fact, a standard binary format for ASTs can be much simpler than formats like JPEG. Every editor and browser should be easily extensible to it. The only issue is how to make people agree on a standard. We may need some sort of committee to do the standardization. But aren’t we doing that for each new language, to make us agree on a syntax? This solution only requires us to do the standardization for ASTs once, for all languages.

This is the ultimate solution to the parsing problem, because it eradicate the root of the problem by changing the way we represent and manipulate programs. It is a trivial solution from a research point of view, but it demonstrates how a hard problem can be solved in a trivial way. However, the implication of this solution is definitely non-trivial. It will open up many surprisingly interesting possibilities.

UPDATE 4/2/2012: In order to demonstrate how our programming environment can change once we have a standard format for ASTs, I developed a “structural diff” tool called ydiff. You can find information about it here.

PySonar: a Static Analyzer for Python

PySonar is a static analyzer for Python. It uses a similar technique (abstract interpretation) as Coverity or CodeSonar, but does it in a much simpler and more principled way. Although it does not gather as much information as them, it produces results that are surprisingly accurate in the realms that it works on. To the best of my knowledge, it is the state-of-the-art of Python static analysis.

Demo

To get a quick feel about what kind of information it provides, here is a demo generated by PySonar demonstrating most of its features.

Features

  1. A static type system.
    It is no surprise that a dynamic language can be statically typed. In fact, all languages can be statically typed. It is just a matter of how accurate the approximations are. PySonar’s type system has a blend of features that match Python’s semantics: structural subtyping (duck typing), record types, union types, parametric types, recursive types. Types are first-class citizens — it can analyze code that operates on types (e.g., take types as input or produce types as output). Because Python doesn’t have type annotations and its flexible semantics precludes unification-based type inference, all types in PySonar are inferred by interprocedual analysis.
  2. Treatment of Python’s dynamism.Static analysis for Python is hard because it has many “dynamic” features. They help to make programs concise, but they also make automated reasoning about Python programs hard. Fortunately, some of these features can be reasonably handled. For example, function or class redefinition can be handled by inferring the effective scope of the old and new definitions. For code that are really undecidable, PySonar uses a universal honest answer: “I don’t know.” Well, not quite so. It attempts to report all known possibilities. For example, if a function is “conditionally defined” (e.g., defined differently in two branches of an if-statement) and the condition is undecidable, then PySonar gives it a union type which contains all possible types it can possibly have. By doing that, PySonar reduces false negative rates.
  3. Path-sensitive analysis.The analysis performed by PySonar is path-sensitive, which enhances the accuracy of type inference and reduces false positive rates. We can see this in an example:
    def g(x):
      if (x > 10):
        y = 1
      else:
        y = A()
        ...
        y = "hi"
      return y

    In this function g, because we can’t statically know the input value of x, we must assume that it’s possible to take either branch. In the true branch, y is assigned an int. In the false branch, y is assigned an instance of A, but is later overwritten by a string. At the end of the false branch, y can only have the type string. The output type should be the union of the two branches, so the path-sensitive analysis gives function g the type int -> {int, string}. If the analysis were path-insensitive, g would be typed int -> {int, string, A}. Notice the extra A in the output type. Since it is not possible for g to output a value of type A, some call sites of g may rely on this fact and will not even test whether the output is of type A. At these call sites, a path-insensitive analysis will report a false positive type error, while a path-sensitive analysis will know that there is no problem.

  4. Accurate semantic indexing.
    Generating code indexes for an Eclipse-like code browser was the main purpose of the project (although I discovered other interesting uses later). PySonar parses the code into an AST and performs type inference on it, and at the same time it builds indexes that respects scopes and types. Because it performs inter-procedural analysis, it is able to find the definitions of attributes inside function parameters.The following image shows that it can accurately locate the field x.z which refers to the “z” fields in classes B1 and C1, but not A1.
  5. Semantic error reports.PySonar catches type errors statically. The type system is conservative — it reports possible type errors on possible inputs even if some of them may not actually happen. This is a usual trade-off to ensure soundness type systems, and PySonar is no exception. An interesting fact is that because of Python’s duck typing, most of the time type errors appear as “trying to access undefined field”. It also reports some other errors that can be easily catched, such as “unreachable code”.

Limitations

  1. It does whole-program analysis and has exponential worst-case time complexity. Fortunately, like the Hindley-Milner system, the worst case happens very rarely.
  2. It requires all source code to be available. This could be a problem for closed-source projects.

Availability

There had been two versions of PySonar. Version 1 is open-sourced and merged into Jython. It is available from Jython’s indexer package. Version 1 hasn’t interprocedual flow analysis, so some variables cannot be resolved. Version 2 has all the features described above and lots of clean-ups, but it is not open-sourced.

On Linux Kernel Memory Addressing

I got addicted to this book named Understanding the Linux Kernel by Daniel Bovet and Marco Cesati, which I happened to have picked up from a shelf beside the pool table at Google’s Kirkland office, where I have been working as an intern. As a consequence, I did nothing for my internship work these two days.

I wrote Linux drivers when I was an undergrad and (unfortunately) had to read quite some Linux kernel code in order to figure things out. How time flies. I finally got my breath back and have some time to read more of it. This time I look at the Linux kernel in a very different perspective, as you will see.

Today I just finished reading the second chapter “Memory Addressing” which talks about two different ways of addressing memory: segmentation and paging, and also some details into memory management. I’m going to make some observations that interested me.

The essential difference between segmentation and paging

Segmentation and paging are similar in the sense that they both provide some kind of indirection or mapping from one address to another:

  • Segmentation maps an “offset address” to a “linear address”.
  • Paging maps a “virtual address” to a “physical address”.

Those are what every textbook tells us, but if you ponder a while on them, you will find that they don’t make much sense. You know these fancy names and facts about them, and that most people tell you to use paging, not segmentation, so what? What the heck is the difference between segmentation and paging, and why should segmentation be inferior? If you have the same feeling as I do, this is the section that you may want to read and write comments for.

Segmentation and its historical purpose

On the surface, segmentation maps an “offset address” to a “linear address”. From a history perspective, segmentation had been mainly designed as a workaround for limited register size in early processors. Because early processors (especially the x86 family) have more memory address pins than their data paths, they can’t use the registers alone to address all available memory. They had to logically subdivide the memory into segments, and use segment descriptors to extend the registers’ address range. This idea can be captured in this simple formula:

length(SegmentBase) + length(Offset) = length(LinearAddress)

The segments’ base addresses are usually stored somewhere in the main memory, so they have virtually unlimited size, which compensates the limitations of register addressing.

Of course this mechanism is tedious and inconvenient to use, but it served its purpose in the early days. Unfortunately even in the modern age where data paths are of good size (64-bit), some processors have this mechanism built-in and force the OS writers to use segmentation. In fact, Linux sets all the segment base addresses to zero and limits to all the address space, so it essentially bypassed the use of segmentation.

Paging and its connection to “automatic memory management”

Paging maps a “virtual address” to a “physical address”. The size of a virtual address has no set relation with the size of a physical address, so no formula about the sizes can be provided.

Paging has an interesting connection with “automatic memory management” in high-level languages such as Java (or Scheme, Python, Ruby, …). Basically, mapping a new page in the page table is equivalent to object creation in Java. If you look at this carefully, the Unix systems allocate memory to a process in the size of pages, instead of objects, and the process then use virtual addresses to further subdivide the chuck of memory that is allocated to it.

An intuition can be seen from the following side-by-side comparison (in a mythical machine where a page has only five bytes).

Java:     String s = new String(“hello”);
kernel:  map page frame 0×00000000–0×00000004 to physical page 0x00200cf2–0x00200cf6.

Here the physical page 0x00200cf2–0x00200cf6 corresponds to the storage for the new String object, while the page frame 0×00000000–0×00000004 corresponds to the object name s. The hardware mapping corresponds to variable binding in the Java language. This analogy might require some twist of mind, but I think it pretty much captures the essence of paging.

Segmentation is just paging, with a fixed number of pages

Despite of the history of segmentation, we can’t eliminate it from our choice for a memory management mechanism just by prejudice. Really, paging has won for a reason. We can see it from the observations I just made.

From the analogy I made from paging to “automatic memory management”, we can see that the virtual addresses are actually “holders” for the allocated memory blocks, similar to variable names in Java. Linux programs access the allocated memory by using addresses (and may further subdivide the allocated memory blocks, e.g., using “malloc” in the standard C library). This allocation can be readily extended to additional memory blocks in arbitrary locations. We just need to find free physical pages and make mappings in the page tables. The programs then access the addition pages using the addresses that have already been there.

But this is not the case with segmentation. Yes, the segments provide some kind of abstraction, but we can basically view each segment as one page (with a flexible size though). How many such “pages” can a program have for its data? Just count the number of segment registers (cs, ds, es, ss, … ). In fact, you can’t really concatenate those “pages” and use them all together, because they must be accessed by their names. So in effect a program usually has only one “page” for its dynamic data.

A huge problem comes when we need to extend the allocation. Once allocated to a program, the segment will have fixed base and limit. Imagine how we could extend the allocation if the original space becomes too small after some use? We would have to find a larger block of memory (in another address if the limit can’t be extended any further), copy the data to the newly found block, change the segment descriptor, reload the segment selection registers. This is doomed to inefficiency of course.

Just think about it carefully, you will find some truth in this sentence: Segmentation is just paging, but with a fixed (and very small) number of pages.

Question 1.1

For a deeper understanding about paging, think about swapping. Swapping happens when there is not enough physical memory for new allocations. In that case, the following things happen:

  1. The kernel has to store some pages on disk and load them back into memory later.
  2. When a page is swapped back, it could occupy a different physical page.
  3. The program’s page table has to be updated to map the corresponding process page frame to the physical page where the page is loaded into.

Here comes the question: What is an analogy of this phenomenon in a high-level language program (demonstrating all the three phenomena above)?

Questions 1.2

What property of segmentation prevents us from making a similar analogy—from segmentation to automatic memory management in high-level languages?

Shared Memory Mutliprocessing Is An Illusion

Believe it or not, shared memory multiprocessing systems don’t really exist. This may sound crazy, but I hope you see what I mean.

Uncontrolled access to shared memory contents led to chaos, thus in the past several decades, various synchronization mechanisms have been devised. However, all the mechanisms that we have seen so far have very ad-hoc semantics and are notoriously hard to use. Operating systems books usually present the classical synchronization problems and their standard solutions as they are, but never explain how somebody could ever arrive at the solutions systematically. The secret is that nobody really knows a systematic way. The current state-of-the-art in shared memory programming is that race conditions, deadlocks, … are still hampering even the world’s best programmers.

My understanding is that, if you happen to get it right, those synchronization mechanisms (subconsciously) serve the same and one purpose: to simulate a message passing system. Shared memory segments serve as message buffers, and synchronizations establish communication channels. Reads and writes of memory locations correspond to “receives” and “sends” of messages. For any channel and at any time, information flows in only one direction. This equivalence can be depicted in the above picture.

So semantically speaking, every correct shared memory program is essentially a message passing program. Shared memory multiprocessing is an illusion, hiding the message passing (or transactional) nature of concurrent computation. It is helpful to think of a shared memory system as a particular implementation of a message passing system even if we have to write shared memory programs.