RubySonar: a static analyzer for Ruby

(This is a reposting of a blog post I wrote on Sourcegraph‘s website)

Since the last enhancement to our Python analysis, we have been working on improving the Ruby part of Sourcegraph by building a similar static analysis tool for Ruby. The result is a new open-source project RubySonar. With its help, now we are happy to see significantly improved results of Ruby repos which in general finds over ten times more symbols, provides more examples and more accurate jump-to-definition. It also provides local variable highlighting which was not available in our previous Ruby results. With the improved speed of analysis, we can now process the full Ruby standard library, Ruby on Rails, and Ruby apps such as Homebrew.

RubySonar’s analysis is interprocedural and is sensitive to both data-flow and control-flow, which makes it highly accurate. Although we currently don’t show the types, RubSonar internally uses the same type inference technique of PySonar, and thus can resolve some of the difficult cases that can challenge a good Ruby IDE.

Although Ruby and Python seem to be very similar languages, lots of work has been done to adapt PySonar’s code to Ruby. Still more work has to be done before RubySonar can achieve the same coverage of PySonar2, but so far the Ruby results are a lot more usable. We have been carefully keeping things simple and easy to extend, so we are optimistic about further improvements to the Ruby analysis.

On object-oriented programming

[written at the end of 2013 AD, during the Dark Ages of programming]

The programmer’s world is full of fads and superstitions. Every now and then there will be somebody who come up and announce: “I can save the world!” No matter how bad the ideas are, there will always be followers, and the ideas soon become their religion. They then develop their community or camp, try to let those ideas dominate the world, and try to make the ideas live forever.

Object-oriented programming (OOP) is such a religion. It claimed to be able to save the world from the so-called “software crisis”, but as a hindsight after so many years since it was introduced, not only didn’t OOP save us, it brought us way more confusion and harm than benefits. Unfortunately its dogmas and mispractices have become so wide-spread and deeply intrenched. In this article, I hope to provide my viewpoint into this matter and try to find out the lessons that we can learn.

Like every article on my blog, the opinions are completely personal and not representing my employers or professors.

Is everything an object?

“Everything is an object” is the core dogma of OOP and deemed as the highest standards of OO language design. Now let’s take a careful look to see if it is true, or if it is a good idea to make things that way.

Many people take “everything is an object” for granted because when this sentence is taken literally it matches their everyday experience. Since the word “object” in English basically means “a thing”, how can “everything is an object” be not true? But be careful since the definition of an “object” in OOP has a specific meaning which is very different from its meaning in English.

OOP’s definition of an object is “a combination of data fields and associated procedures known as methods“. Can you really fit everything into this model?

First let’s look at the real world and see if this definition can capture everything. Cars, trees, animals may sometimes be thought of as objects, but what about a change of the objects’ position, its velocity and duration? What methods do they have? Well, you may define classes called Velocity or Time, with methods such as addition, but do velocity and time really contain the things that you call “methods”? They don’t. They are just your imagination. You can add the velocities or time, but how can velocities or time contain the addition procedure? This is like saying that the bullets contain the gun.

So the most you can say is that “everything is an object” is a good way of thinking, but that is not true either. The definition of an object implies that a method can only belong to one object, but most of the time it doesn’t make sense thinking of it as belonging to any object. Say we have the expression 1+2, does the operator ‘+’ belong to 1, or does it belong to 2? You have to make some arbitrary choice. Since you can make a choice, this means that the ‘+’ operator doesn’t really belong to either of them. The operation is inherently outside of the objects.

So thinking of some things as objects may be helpful, but thinking of everything as an object is neither true nor useful. Unfortunately “everything is an object” has been taken as a dogma and the highest standard of OO language design. Some OO languages claim that everything is an object in them. Whenever you notice that something is not an object, somebody will try to make it one. They may succeed in that, but things get very complicated that way, because that’s not how things work.

The idealism of “everything is an object” is similar to “everything is a function” in the functional programming world and “everything is a set” in the math world. Before computer science was conceived there was a thing called the lambda calculus. Some people encoded everything including numbers and their operations, various data structures and control structures, … all in lambdas. One of the encodings of numbers is called the Church numeral. Every programming language researcher has played with them during their education. But unlike “everything is an object”, “everything is a function” has never become a dogma or marketing phrase. Those formulations sometimes provide thought experiments and inspirations to the researchers but nobody really use them for actual computation, because they are inefficient and they are not really how things work. They are just approximations (models) to some essence of computation that we can’t see. If you really use them for practical projects, things become complicated.

Mathematicians have a similar concept: set theory. Some geniuses encoded everything — numbers, operations on numbers, mathematical structures, … all in sets. Everything is just sets containing sets containing sets and so on. What’s the problem? But when they really tried to do their proofs using those sets, the proofs fell under their own weights. They are too complicated. Even with the complexity, set theory is not expressive enough to capture whatever the mathematicians have to say. Many people tried to fix it, but they all failed.

So “everything is an object” is in some sense on the same track of “everything is a function” and “everything is a set”. Good thought exercise, but doesn’t really work well in practice. I don’t think that there is some “one true language”. When compared to the “absolute truth” every theory is wrong, but some theories are more wrong than others. The model of OOP is too far from correct or practical. It’s somewhat like the flat earth theory. Until today some people still believe that the earth is flat and make all kinds of theories to prove it. Some of their arguments look very scientific, but do you believe in their formulas or a picture of the earth from a spaceship? When you get the fundamental things wrong and don’t throw them away, you have to patch them endlessly with even more complicated theories. You will have to make theories that bend the light and gravity in weird ways. And that’s what happened to OOP.

Are functions objects?

From what I know, the original motivation of putting functions inside objects was to support GUI applications. You click on a button and some function (a callback) will be invoked. For the convenience of referring to the button, the callback takes the triggered object as its first argument. Since the callback does nothing more than this, it seems to be convenient to just store it inside the button. And thus we had an “object” which combines the attributes of the button and a method (the callback). Indeed this is a good idea, but this limited usage case can’t really justify a universal notion of “everything is an object”, just like a two-mile walk can’t prove that the earth is flat.

If you really understand what is abstraction, you may have noticed that even the above story contains a subtle mistake: the callback in the button is not really a method. The true purpose of a method is to provide abstraction to the attributes, but the callback’s purpose is not to provide abstraction. It is just a usual function triggered by the button, which happens to take the button as its first argument.

Very few functions should be considered methods of an object. If you look carefully, most of the time the objects just serve as a namespace (or module) in which you can store attributes and functions, but those functions don’t logically belong to the objects. They just take the objects as inputs and produce some output. Only the functions that are most intimately connected to the attributes and provide an abstraction layer to them should be considered methods. Most of those are called “getters” or “setters”. Some others hide implementation details for more complex data structures such as lists, hash tables, sets etc.

In some languages such as Scala or Python, functions are also treated as objects, but they actually just wrapped the functions into an object, give them names such as “apply” or “call“, so that when the objects are “invoked” you know which functions to call. But putting a function into an object doesn’t really mean that functions are also objects, just like inviting friends to your house doesn’t make them your family.

Functions are fundamental constructs. They don’t belong to objects. They describe a change, transition or transformation of objects. They are not objects and can’t be simulated by objects. They are like a base case of an inductive definition. They are where the illusion of “everything is an object” ends.

The cost of excessive abstraction

The major appeal of OOP is abstraction (and thus code reusing and DRY), but actually most of those abstraction facilities are already provided by traditional procedural languages and functional languages. Some of them do it even better than OO languages. OO claims its originality by emphasizing abstraction much more strongly than other languages. The result is that OO programmers usually overdo it. Some of them pursue abstraction and code reusing to the degree as if they are everything about programming.

For the purpose of code reusing, OO encourages a level of abstraction which makes programs hard to understand and hard to analyze. I often see Java programs with multiple levels of inheritance, overloading and design patterns, but actually doing very little. And because there is so much code that doesn’t do useful things, it is really hard to find out which part of the code is doing the thing you want. It is like going through a maze. Another nice word for this is “robustness”. If I have to go into all this trouble to make code reusable or robust, I’d rather just make copies of the code and modify them, but keep each copy simple and easy to understand.

Whenever you criticize Java or C++ for their verbosity, OO proponents will tell you that they are not authentic OO languages. They would ask you to look at Smalltalk. If Smalltalk’s ways are that good, why almost nobody is using Smalltalk now? Because there are real problems in its approach. I think Smalltalk is the origin of over-abstraction and over-complication you find in other OO languages.

The “authentic” OO style of Smalltalk promotes the notion of “extremely late binding”, which basically means that the meaning of the program constructs is determined as late as possible. Late binding gives you a chance to swap out the underlying implementation without forcing the upper levels to change, but this also means that you are no longer sure what a piece of code means. When I look at expressions such as ’1+2′ and ‘if (t) then … else …’ in Java or C++, I at least know for sure that they mean an integer addition and an usual conditional. But I’m no longer sure about this in an “extremely late binding language”, because the meaning of ‘+’ and ‘if” can be redefined. Giving the programmers the power of defining control structures is a bad idea, because soon your language will be abundant of quirky control structures designed by programmers who try to be clever. It will no longer be the language that you used to know.

An example for this feature is Smalltalk’s conditional structure, which looks like this:

result := a > b
    ifTrue:[ 'greater' ]
    ifFalse:[ 'less or equal' ]

You send a message ifTrue: to a Boolean object, passing as an argument a block of code to be executed if and only if the Boolean receiver is true.

First of all, if you really have a well-designed language, you shouldn’t be wanting to define your own control structures. As a seasoned Lisp/Scheme programmer, I have seen many custom-designed control structures (such as the various looping macros) over the years, but none of them turned out to be good ideas. I’d rather write slightly longer and more verbose code in the vanilla language than to learn those weird control structures. Second, if you are really genius enough to have invented another good control structure, the late binding feature of Smalltalk probably won’t provide you the necessary power for defining it. The power of functions as an abstraction tool is limited. It is strictly less powerful than Lisp/Scheme’s macros. Third, this feature of Smalltalk is not really a novel approach, and it has a serious problem. A similar but more beautiful conditional construct had been defined in lambda calculus since before computer science was born:

TRUE = λx.λy.x
FALSE = λx.λy.y
IF = λb.λt.λf.b t f

This is very beautiful and can be done in any functional language, but why none of the functional languages implement conditionals this way? Because when you see an expression IF b t f, you will have no idea whether it is a conditional or not, because IF can be redefined in the program. Also because IF is just a function, it may also accept unexpected values other than TRUE or FALSE. This may happen to make the conditional construct work but cause trouble later on. This is called “unintentional semantics”. This kind of bug can be very hard to track down.

This approach also makes compiler and static analysis hard. When the compiler sees IF b t f, it no longer knows that it is a conditional so it can’t optimize it that way. It has to treat it as a usual function call. Similarly when the type checker sees it, it doesn’t know what type to expect for b, because it may not be a conditional at all. The above argument against the lambda calculus can easily be adapted to Smalltalk.

So abstraction is a powerful weapon when used moderately, but when you do it in excess, it backfires. Not only does it make it hard for humans to understand the code, it makes automated analysis tools and compiler optimizations difficult or impossible to make.

Design patterns, the brain eater

Although OO languages are touted for their abstraction capabilities, they are actually not strong in terms of abstraction and expressiveness. There are certain things that are very easy to do in traditional procedural languages and functional languages, but has been made unnecessarily hard in OO languages. This is why design patterns appeared. Design patterns’ origin was mostly due to the dogma of “everything is an object”, the lack of high-order functions (or the correct implementation of them) and OO’s tendency of mystifying things.

When I first heard about design patterns I was already a PhD student at Cornell doing some PL research. I mostly used Standard ML and Haskell. After hearing my friends’ high opinions of the Design Patterns book (the “GoF” book) I developed curiosity, so I borrowed one from the library. Within a few hours I found a mapping from all the weird names it introduced to the programming techniques I had been using all the time. Some of them are so fundamental and they exist in every high-level language, so they don’t really need names. Most of the advanced ones (such as visitor) are transcriptions of functional programming concepts into a convoluted form in order to get around OO language’s limitations. Later on I found that Peter Norvig gave a talk on design patterns as early as 1998, pointing out that almost all of the design patterns will be “transparent” once you have first-class functions. This confirmed my observations — I don’t need them.

I have to admit that some of the design patterns are cleverly designed and contain some ingenuity. You really need to get to the essence of the OO languages’ internal magics and also understand lots of functional programming techniques in order to create them. But intelligence =/= wisdom. Even if they can achieve what functional languages can do, they are usually a lot more complicated. Choosing the hard ways can’t really prove your genius. When you have first-class functions, things become so much easier and you won’t even notice the design patterns’ existence. Like Peter Norvig said, they will become “transparent”. So what a good language designer should do is to add first-class functions into the language instead of proposing design patterns as workarounds.

Every time I remove a design pattern (some other people wrote), the code becomes simpler and more manageable. I just removed the last visitor pattern from my Java code a few days ago and I felt so relieved. They gave me nothing but extra work when they existed. They hindered my progress. By deeply understanding how OO languages are implemented, you can write more advanced things than those provided by design patterns but without actually using them. I owe these insights to some functional programming people. If you really want to understand the essence of OO design patterns and how NOT to use them, this little book may be a good starting point.

Unfortunately design patterns somehow got really popular in companies, to the degree of unbearable. I saw the GoF book on almost every bookshelf when I interned at Google. Even if you don’t write them yourself, there was almost no way you could avoid other people slipping design patterns into your code. Design patterns’ marketing strategy was much like weight loss products: “It can burn your fat without you doing any work!” They appeal to some new programmers’ hope that they can write programs without understanding the fundamental concepts of computer science. Just by applying several patterns and patching things together, they hope to have a good program. This is too good to be true. You end up doing more work than you hoped to avoid. Design patterns eat programmers’ brains. After using design patterns for some time, they no longer see things or write programs in clear and straightforward ways.

What is an OO language any way?

To this point we haven’t yet talked about what makes a language an “OO language” and what makes it not. Is it an OO language just because I can put both data fields and functions into a record? Or is it an OO language only if it also provides extremely late binding? How about inheritance, overloading, etc etc? Must I have all of them? Any of them?

It turns out that there is no good answer to this question. There really is no such thing as an “object-oriented language”. Objects can be part of a language, but it is just a small part of it. You can’t really say that a language is object-oriented just because it provides objects as a feature. The so-called OO languages are solidly rooted in traditional procedural programming (PP). OOP basically stole everything from PP, renamed the terminologies and acted as if the ideas were its own.

Historically the term OO was mainly used for marketing reasons. It could give a language some advantages of attracting people if you claim it to be an OO language, but now this advantage is diminishing because more and more people have realized the problems of OO’s methodology.

Harm in education and industry

Although OO has lots of problems, it is very successful in marketing and has risen to a dominant position over the years. Under social and market pressure, many colleges started using OO languages such as Java as their introductory language, replacing traditional procedural languages such as Pascal and functional languages such as Scheme. This in a large degree caused the students’ failure to learn the most essential concepts of programming. The only thing that OO emphasizes is code reusing, but how can you teach it to the students who can’t even write usable code, not to mention that code reusing is not really as important as some people believe.

At both Cornell and Indiana, I have been teaching assistant for introductory programming courses in Java. I still remember how confused the students were. Most of them had trouble understanding things such as the meaning of “this”, why everything needs to be put inside classes, why make every field private and use getters, the difference between a method and a static method, etc etc.

There is a good reason that they don’t understand — because OO is not how things work. Most of the time I feel that I was teaching design flaws and dogmas to them. Many of them learned very little in the end. Worse, some of those students really believed in OO. They ended up being proud of writing over-engineered and convoluted code. They no longer see things or write programs in straightforward ways. This is sad. I feel that we are no longer educating students as creative and critical thinkers, but mindless assembly line workers.

In industry, OO hasn’t really proved its effectiveness with evidence. Good systems may be built in a “OO language”, but the code is often written by people who understand the problems of OO and don’t embrace “everything is an object” or “design patterns”. Good programmers usually use workarounds in OO languages and are essentially writing in a traditional procedural style combined with bits from functional programming. So some OO languages and their tools may be pretty widely used, but the OO style doesn’t really have much influence on the advancements of programming as a field.

Final word

So what does this post has to say? A jihad against OO languages? Advocate functional programming? Neither. As I said, there is no such thing as an “OO language”, so where is the war? Every so-called OO language also contains good elements that it borrowed (or stole) from procedural languages or sometimes functional languages, so they are not completely useless.

It is the extra features added by OO in addition to procedural programming that are causing most of the problems. Those extra “true OO techniques” contain way more confusion than real value. In my experience, accepting even one or two of those ideas may put you into a series of troubles and wrong ways of thinking which may take a long time to examine and recover. They are like diseases.

Thus I suggest not to buy OO’s way of thinking and don’t try to exploit its “features”. By eschewing those problematic features you can still produce acceptable programs in an OO language, because you are essentially using it as an non-OO procedural language.

(Chinese translation by ZoomQuiet)

Purely functional languages and monads

In general, functional languages are pleasant to work with because they support first-class functions, which are a very powerful modeling tool. But if you pursue the extreme — to use a purely functional language, you get adverse effects similar to OO design patterns. In a conventional OO language, having to use OO design patterns makes hard things even harder; but in a purely functional language, having to use pure functions to model side-effects makes trivial things hard and hard things impossible. So speaking of the two evils, OOP is the lesser one because at least easy things are still easy in them.

The problem with pure FP is: there exists things that are not pure.

Side-effects are real

Electrical engineers probably have the best understanding of purely functional languages, because electric wires are pure by default (if you are not paranoid enough to count heat as a side-effect). Purely functional languages are the analogous of combinational logic circuits. They are very useful and are an important part of a system, but you can’t use them alone to build up complex systems. This is why people invented flip-flops and sequential logic circuits. If you look at the design of the flip-flops, you will start to appreciate the ingenuity in them. They create memory out of “pure circuits”. And the memory is where the side-effects come from.

So pureness comes for free, but side-effects took efforts to invent. Unfortunately a large portion of the physical capabilities provided by the sequential circuits are disregarded by purely functional languages. Instead, purely functional languages are obsessed in simulating the effects on their own. There is a difference between the physical and the simulated. The simulated can’t be as efficient as the physical.

For a simple example, many efficient data structures rely on mutations to create connections between their components (think about circular data structures and networks). In such structures, side-effects act very much like a physical force to hold the parts together, and update them as needed. To implement this in a purely functional language, you have to use indirect ways. Every time you change it, you need to create a new data structure which shares most of the old data structure. But then you got into the trouble of keeping track of where the “current structure” is. It is moving in the dataflow and you have to pass it on. Had you used side-effects, the structure can stay at the same location, so you won’t need to worry that you will lose track of it. You don’t need to pass it on. This will save lots of administrative code.

Also sharing the old structures induces extra levels of indirection in the data structure, which causes significant overhead. Thus purely functional data structures can’t really match the performance provided by its impure counterpart. Purely functional data structures also cause lots of object creation and thus stress the garbage collector.

Some people say that purely functional data structures will reduce contention when there is lots of concurrency, but notice that if the observers of the old data structure “purely update” it, they will hold a different structure other than it’s current state. After that point, the universe is “forked” and they will live in completely different worlds. You will lose consistency. If you really want all threads to see the same data structure, then the contention is unavoidable, because you need a channel for the information to pass through. Nothing can save you from the contention because there is data dependency.

Everything is persistent in purely functional data structures, but how often do you need the outdated ones? In impure data structures, you still have the choice of making copies when necessary and achieve the same persistence effect, just with better performance.

Monads are design patterns

Purely functional languages also complicates the programs and ensue a huge cognitive cost. Modeling side-effects using pure functions is in the same mentality of wrapping functions inside objects as in OOP. They are both over-engineering and cause unnecessary manual labor.

One of the major “design patterns” in purely functional languages is called “monads”, a highly stylized way of structuring side-effects. If you look deep into them, monads make programs complicated and hard to write, and monad transformers are in essence a hack to get around monads’ limitations — they are not principled ways of composing monads. Representing side-effects using monads is as convoluted as writing interpreters or compilers using visitor patterns. For this matter, I wrote a short article some time ago which not many people have read:

To write programs in a purely functional programming language is much like living in a wired world. In such a world, there are no electromagnetic waves nor sound waves, so you don’t have wifi, cell phones, satellite TV, radios, etc. You don’t even have light or sound. In other words, everything is blind and deaf.

All information must pass through wires or pipes, connected by switch boxes which we call “monads”. You must carefully route all the wires and pipes and connect them correctly before any information processing device, including your eyes and ears, can properly work.

Trivial things in other languages (such as random number generators or circular data structures) become non-trivial in a purely functional language. Easy things often become research problems when you try to write them using monads, so you often see papers titled similar to “A Monadic Approach to a-solved-problem”.

About this I have an interesting story to tell. Once upon a time, my PhD advisor Amr Sabry tried to reimplement Dan Friedman’s miniKanren (a logic programming language) in Haskell, but he couldn’t figure out how to compose the monads. He asked for help from Oleg Kiselyov, arguably the world’s most knowledgeable person about getting around Haskell’s type system. And if you don’t know, Amr Sabry is probably the world’s most knowledgeable person about purely functional programming languages and side-effects. His paper What is a Purely Functional Language is often referred to as the official definition of “pureness”. After solving the problem with Oleg’s help, they coauthored a Functional Pearl paper. Ironically, Dan Friedman, the original author of that piece of code, didn’t have any such trouble writing it in Scheme in the first place. Certainly there is no reason Amr should be able to figure out how to compose the monads. He and Oleg just wrote the Haskell code for fun, but this story tells me something about monads: they make things unnecessarily complicated.

How did Dan Friedman write the code so easily? He just directly passed the states in-and-out and not using any monads, or you can say that he used monads’ essence without actually using them. Following Dan’s style, I rewrote miniKanren, added constraint logic programming and a highly sophisticated negation operator to it. All this is done within three weeks during my first semester as a PhD student, together with Dan’s B521, Amr’s B522, other course loads and teaching duties. I would certainly be bogged down had I used monads, and I don’t see any point of translating it into a monadic style. It is just so much simpler without monads.

(Correction to some historical facts by request from Prof. Friedman: He’d like to give a lot of the credits of the current miniKanren code’s simplicity to Chung-chieh Shan, who at one moment simplified the code to its current style.)

Equational reasoning can’t save the world

On the safety side, purely functional languages haven’t much advantage either. Some people claim that the value of monads is that they explicitly delimit the side-effects and support equational reasoning, but programs are not always as easy as algebra formulas such as a(b+c)=ab+a*c. If all programs are that easy, we would not need monads at all. Monads don’t really make your program easier to reason about.

Pure functions always return the same results when they are given the same input, but every monadic function has an extra “implicit” argument which is different at every call to the same function, so although it is still true that “pure functions always return the same results when they are given the same input”, the problem is, you never get the same input because of that always-changing extra argument! You don’t know what’s inside that argument. That argument is called the “state”.

There is no way you can statically know the values inside state monads, for the same reason that you don’t know the values inside the heap. This is because the state monads are in essence the same as the heap. The heap maps memory addresses to values, and state monads map variables to values. If you have written static analysis, it will be clear that monadic code is in essence putting parts of a static analysis into the user’s code. In other words, every monadic code is reimplementing part of a static analysis. So monadic code is as hard to analyze as if you use side-effects, only that it takes a lot more work to write. You can write horribly side-effective code with monads which nobody can understand and no static analysis tool can help. There is no such thing monads can make easier which can’t be done by static analysis. Static analysis researchers know this very well.

You can write pure functions in any language

Monads are contagious. Once the code gets into monads, it is not easy to get out. Having to explicitly specify side-effects in the types is similar to having to explicitly declare exceptions in Java. You must either “handle” it, or you must declare that you have passed it on. Why should programmers write them while they can be easily inferred by static analysis? Static analysis use what is essentially monads (or even more advanced techniques), and they take the burden of writing monadic code away from programmers.

Of course overusing side-effects will make programs harder to analyze, but you can reduce side-effects and write pure functions even in an impure language. For example, the following C function is a pure function, satisfying every requirement of the definition of “pureness”.

int f(int x) {
    int y = 0;
    int z = 0;
    y = 2 * x;
    z = y + 1;
    return z / 3;
}

Advanced static analysis tools would have no trouble figuring out things about this kind of code. They would know that this function is pure without you writing any annotations, and they can do a lot more for you than this. So pure functions don’t just belong to purely functional languages. You can write pure functions in any language (including assembly), but the point is, you should be allowed to use side-effects too, especially when they make things easier.

Thinking critically about mathematics as a language

Looking back into history, the dogma from mathematics is the driving force behind purely functional languages. Mathematical functions are simple and beautiful, but unfortunately they work well only when the thing you are trying to model is pure by nature. Purely functional language proponents like using buzzwords like “category theory”, and call those who don’t understand it “uninitiated”. I know a considerable amount of category theory. Even the category theorists themselves call it “abstract nonsense”, because it is to a large extent a grotesque way of saying what other mathematicians already know, in much the same sense that GoF design patterns are a grotesque way of saying what most decent programmers already know. So category theory is the analogous of design patterns in mathematics.

If you read Gottlob Frege’s article Function and conceptyou will be surprised that most mathematicians got functions wrong before his writing, and that was just a little more than a hundred years ago. Mathematics have done lots of things wrong with its language. This has been pointed out a long time ago by Gerald Susman in his Structure and Interpretation of Classical Mechanics and recently in his InfoQ talk. There is a lot of truth in his words. There is no reason programming language designers should blindly follow the ways of mathematics, because it is just another quirky language.

What is functional programming, really?

The above is not to disagree with functional programming in general. On the contrary, functional programming in general is highly valued. I just disagree with the dogma of the “purely functional” ones. Impure functional languages such as Scheme and ML haven’t these problems. They have “benign side-effects”. In fact, in Scheme functions are called “procedures”, not “functions”. This is because its designers know that they are not functions in the mathematical sense, and they intended to make them not necessarily pure. The purely functional language community often try to steal the word “functional programming” from traditional functional languages (Lisp, Scheme, ML), as if only purely functional languages deserve the name. This is not fair and this is harmful. We should be able to use the word “functional programming language” to refer to any language with correct implementation of first-class functions.

Don’t fall in love with your model

Everything starts to do harm when they are pursued to the extreme. Purely functional programming tries to fit the world into its model, but the world works in a completely independent way. It is wrong to think of everything as a nail when you have a hammer. Only by observing the reality can we get out of the religions that are limiting us. Don’t fit the world to your model. Fit your model to the world.

psydiff: a structural comparison tool for Python

(click on the above picture for a demo)

Psydiff is a structural comparison tool for Python, written in Python itself. The main algorithm and UI of psydiff is almost the same as ydiff.

If interested, you can see a demo of it here (psydiff comparing itself):

http://www.yinwang.org/resources/pydiff1-pydiff2.html

All the source code can be downloaded from my GitHub repo:

http://github.com/yinwang0/psydiff

It’s still in early stage of development. I appreciate your bug reports, feature requests or contributions.

Special thanks to Ronny Pfannschmidt for code contribution and suggestions.

(Note: This project was named “pydiff”, but I found that the name has already been taken for another project. Thus it is now renamed to “psydiff“, meaning “PYthon Structural diff” or “PSYcho diff” :=)

Null reference may not be a mistake

null

The null pointer is considered to be a “billion-dollar mistake” by Tony Hoare, but was he really saying that null pointer should never be used? After years of using languages both with null pointers (e.g. Java) and without them (e.g. Haskell), I found that null pointers are much easier and more natural to use than its counterparts (e.g. Haskell’s Maybe type). I have been wondering why there is such a notion of “billion-dollar mistake” until I saw the original video where Tony Hoare claims it to be his mistake. In fact, he didn’t really say that null pointer should not be used, so I realized that I made a mistake by taking the words “billion-dollar mistake” literally.

From this video, you can see that introducing null reference is not really a mistake. On the contrary, null references are helpful and sometimes indispensable (consider how much trouble if you may or may not return a string in C++). The mistake really is not in the existence of the null pointers, but in how the type system handles them. Unfortunately most languages (C++, Java) don’t handle them correctly.

Every class type A of Java is in fact a union type {A, null}, because you can use null anywhere an object of class A is expected. This is equivalent to the Maybe type of Haskell (where null in Java corresponds to Nothing of Haskell). So the trouble really is that an annotation like {String, null} should be distinguished from String, so that it will be clear what can possibly end up in the value. Unfortunately most languages don’t provide a convenient union type that you can put String and null together (Typed Racket is an exception). If Java is to have union types, we can say something like:

{String, null} findName1() {
  if (...) {
    return "okay";
  } else {
    return null;
  }
}

This is saying: findName may return a name which is a String, or it may return nothing. In comparison, we can say something slightly different:

String findName2() {
    ...
    return "okay";
}

By distinguishing the return types of findName1() and fineName2(), the type system knows that you should check for null when you have called findName1(), but you don’t need to check for null if you call findName2(). So you have to write something like:

String s = findName1();  
if (s != null) {
  x = s.length();      // use s as a String only after null check
}

But you may write:

String s = findName2();
x = s.length();

For the latter, you don’t have to check whether s is null because we know definitely that findName2() will return a String which is not null.

In fact this approach is hinted by Tony Hoare in the above video at 00:24.00. He said that null should be a class. Indeed, the union type {String, null} certainly thinks of null to be at the same status of String – it is a class.

But we soon realize that it doesn’t really matter whether null is a class or not since the class Null will have only one value — null. So any language with null references should work equally well given a correct type system.

Advanced static analysis tools can already help alleviate this issue by essentially inferring the union types like {String, null} even when the programmers write String instead, although a type annotation system which allows the programmers to specify union types directly certainly makes type checking easier and also makes programs easier to read.

Back to the future of databases

Why do we need databases? What a stupid question. I already heard some people say. But it is a legitimate question, and here is an answer that not many people know.

First of all, why can’t we just write programs that operate on objects? The answer is, obviously, we don’t have enough memory to hold all the data. But why can’t we just swap out the objects to disk and load them back when needed? The answer is yes we can, but not in Unix, because Unix manages memory as pages, not as objects. There are systems who lived before Unix that manage memory as objects, and perform object-granularity persistence. That is a feature ahead of its time, and is until today far more advanced than the current state-of-the-art. Here are some pictures of such systems:

IBM System/38

system-38

Lisp Machine

lisp-machine

Oberon

Oberon

Those systems don’t really need databases (in its usual sense). Data integration was seamless and transparent to the programmer. You don’t need to know the existence of a “disk”, a “file system”, or a “database”. You can just pretend that you can allocate infinite number of objects and work on them in the most natural way. Unfortunately most of those systems were either terribly expensive or had problems in other aspects of their design. Finally, they seemed to have died out.

Good ideas never die. This is not to say that there is nothing we can learn from their design. On the contrary, their ways are far superior than the current state-of-the-art of Unix-based systems. Unix will never reach this level of elegance and power.

But any how, Unix rules the world. We can live with it, but it is just mediocre. Please don’t believe everything that the modern Operating Systems books tell you. Sometimes you have to look further into the past for the future. As Einstein said, “Nothing is more needed to overcome the modernist’s snobbishness.”

Unix used to be free, but you get what you pay for. Although there is a thing called “virtual memory”, your programs can’t just allocate objects and then operate on them without any knowledge about the “file system”. Nothing works out-of-the-box in Unix. In fact it is far from that. Unix and its “philosophy” is a constant source of trouble. It is more like a “non-operating system” than an “operating system”. It leaves too much work for you to do, and leaves more than enough rope to hang yourself.

Unix builds its reputation and authority by blaming the users. If you don’t know how to use me, you are an idiot! This is the same trick that the weavers played on the emperor: If you can’t see the clothes, you are either stupid or incompetent. What a powerful way to cover the ass of any crap!

Unix’s incapability is why people “invented” databases. The combination “Unix + databases” is supposed to be a cheap replacement for those advanced systems where programs don’t need to know the existence of such a second-level data storage layer. But because of some irreparable design issues, Unix still can’t achieve that even with the help of databases. The databases, until today, still relies on Unix’s low-tech mechanisms such as memory mapped files to store data, which causes complications and performance issues.

However, databases were somehow considered a big thing, and people who made it became the richest men in the world. Consequently, you have to take database classes if you want a computer science degree. So here is an ultimate cheat sheet for those who really want to know what a database is. You will not need to sit through a semester’s course if you remember the few things that I put below. Trust me, many students got A+’s because I told them this ;-)

Every “row” in a database “table” is a data structure, much like a “struct” in C, or a “class” in Java. A table is then an array (or list) of such data structures. The “keys” of a database table are in essence “persistent memory addresses”. When serializing an object, we can’t just put the memory address of an object onto disk, because the address may not be the same when the object is reloaded into memory. This is why we need “keys”. In a sense, “keys” are a more general notion than “addresses” — addresses are just keys that are integers.

There is a white lie in the above paragraph – I didn’t mention that there is some redundancy in a database table in comparison to a serialized data structure. Some data is duplicated across multiple rows because a RDBMS table row has fixed width, so it can’t store variable length data such as arrays. What can you do then? By recognizing that the table is the only thing that can “grow” in a relational database, an obvious solution is to turn the array 90 degrees clockwise, and make each element a row in another table! But how do you connect from where the array is originated from? You add the key of the object to each row of this new “array table”. See how the key is duplicated? This is why people “invented” column-based databases (such as Vertica, HBase etc) for “compressing” these keys. But what they achieved was essentially making the tables slightly closer to the format of serialized data structures.

You create the problem, and then you solve it. And you call this two inventions.

Keys are persistent pointers. Whenever you need to dereference a pointer, you do a “join” in the database, so “join” is equivalent to “following pointers”.

A database “schema” is in essence a “structure type”, like the struct definition in C. For example, the schema created by the following SQL statement

CREATE TABLE Students ( sid CHAR(20),
                        name CHAR(20),
                        login CHAR(20),
                        age INTEGER,
                        gpa REAL )

is equivalent to the C struct

struct student {
  char* sid;
  char* name;
  char* login;
  int age;
  double gpa;
}

(Note that I use a SQL declaration here just because I don’t want to draw a picture of the schema. This equivalence of a relational schema with a structure type has nothing to do with SQL.)

That’s almost the whole story. You have addresses, pointers, dereference operation, structure types, arrays/lists of structures, so now you can implement things like linked lists, graphs etc. With them, you can implement some complicated algorithms such as A* search in a database. You just need to take a data structure class, and then translate what you learned there into a database language like SQL.

But SQL is a crappy language. It wasn’t designed for programmers. It was designed for manual input by human operators (usually non-technical people like accountants). You type in a “query”, and the computer prints out a “response”. That is why it is called a “query language”. This language does its job for human operators, but it was then abused beyond its capabilities. It was interfaced with computer programs to write serious programs. I doubt if those people knew what they were doing, but it just happened, like many other silly things. There are just so many things you can’t express in that language. The result is a dumb and fragile system held together by band-aids. You have to be very careful otherwise you lose blood.

If you really want to learn SQL, here is the cheat sheet for it:

The query

SELECT Book.title
 FROM Book
 WHERE price > 100

is equivalent to the Lisp expression

(map (lambda (b) b.title)
     (filter (lambda (p) (> p 100)) Book)

<

p>This program is then sent to the “database engine” for execution. That is, we move the program to the data, instead of loading the data to the program. And that’s also the principle behind MapReduce. Have you noticed how easy this can be done with Lisp? You just send the code to the interpreters running on remote machines!

The problem with SQL is that you need yet another layer of language before programs can operate the database. SQL is a weak and quirky language. It is not Turing-complete and at some places it doesn’t even “compose”. You need to combine it with a decent language before you can write serious programs. Your programs send SQL commands to the database to store and load data, just like a human operator would do. This is a very low-tech way of data integration. It is error-prone, inefficient and subject to security risks such as “SQL injection”.

Indeed, there is a “good component” in SQL, because it has some “relational programming” features. However, the other word for “relational programming” is “logic programming”, where languages like Prolog and Datalog excel. They are both more expressive and more elegant than SQL. Considering that Prolog and Datalog appeared much earlier than SQL (1972, 1977 v.s. 1986), I would say that SQL is a step backwards.

Maybe you have seen, for some weird reasons we are still in the Dark Ages of computer programming. We are not supposed to be here since better designed systems already existed. It would be foolish to dismiss them as failures. They are just ahead of their times. By looking to the past, we see a way back to the future.

On pureness

It is usually a pleasure to use a functional programming language, but it is doubtful whether we should care about the “pureness” of the language. To write programs in a completely pure functional programming language (e.g. Haskell) is much like living in a wired world.

There are no electromagnetic waves nor sound waves in such a world, so you don’t have wifi, cell phones, satellite televisions, radios, etc. You don’t even have light or sound. In other words, everything is blind and deaf.

All information must pass through wires or pipes, connected by switch boxes which we call “monads”. You must carefully route all the wires and pipes and connect them correctly before any information processing device, including your eyes and ears, can properly work.

So I think a language will need parts of it not to be pure, in order to naturally represent the effects such as electromagnetic waves. Normally those effects are called “benign effects“.

Undecidability proof of the halting problem without diagonalization

As a TA for a graduate level Theory of Computation course, I don’t understand why we have to teach the diagonalization proof of the undecidability of the halting problem. It is much easier if we just use a normal argument about functions in a programming language. Basically, functions correspond to Turing machines in a theory of computation context. In order to improve the situation, I came up with a simpler formulation based on the view of “Turing machines as functions”.

Suppose that we have this function Accept(M,w) which takes two arguments, a function M and its input w, and can tell you whether or not M(w) will output True. Basically, suppose that Accept is the hypothetical solver for the halting problem. To connect to usual theory of computing terminology, the fuction M corresponds to a Turing machine, True corresponds to reaching an accept state, and False corresponds to reject state. Note that although the argument M may go into infinite loops, never will Accept.  Accept always returns True or False in a finite amount of time.  Accept determines this without actually running M because M may go into an infinite loop and never return. This way of finding properties of programs without running them is usually called static analysis. It has the same essence as fortune-telling and all other sciences.

Now we come to our simple contradiction: does the following expression (call it F) return True or False?

Accept(λm.not(Accept(m,m)), λm.not(Accept(m,m)))

It turns out that this question cannot be answered at all! If F returns True, then when we actually apply the function λm.not(Accept(m,m)) to the argument λm.not(Accept(m,m)), it should return True, right? But when we apply it, we get

not(Accept(λm.not(Accept(m,m)), λm.not(Accept(m,m))))

Notice that it is exactly the negation of the original expression F. This means that F should return False (because the “actual run” returns False). On the other hand, if F returns False, then when we actually apply the function on its argument, it will return True. Thus the contradiction. QED.

Some may argue that I’m implicitly using “diagonalization” here. Well, you can say that they are equivalent in effect, but the equivalence doesn’t mean that “diagonalization” is a good word for describing what I’m doing. Actually I’m just making a loop in a “circuit”. I don’t see the diagonal, then why call it that way? The diagonal argument and the “loop argument” may have the same essence, but they are very different in their comprehensibility. I feel that the loop way is a much easier to understand than diagonalization.

An interesting final question may be, what is the connection between the above expressions with the Y combinator?

λf.(λx.f(x x)) (λx.f(x x))

Understanding the yin-yang puzzle

I have a friend who is a very fast learner. I introduced him to the Scheme programming language a few days ago and he soon digged into this yin-yang puzzle:

(let* ((yin
        ((lambda (cc) (display #\@) cc) (call/cc (lambda (c) c))))
       (yang
        ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))
  (yin yang))

This program prints out the infinite string @@@@****@*****@****** …

Why does this happen? This took me almost a full afternoon to figure out. It is always good to have friends who feed you with questions and challenges. I’m going to document my finding here. I may want to make some animated slides (as is always a good idea) later, but for now a short note may suffice.

The Plan

You will probably never understand the program if you examine the contents of the stack. You need their high level semantic meaning to think clearly. The higher level meaning of the stack is the continuation. So instead of frightening you with the details of the stack, I’m going to use functions to show the continuations explicitly.

Also I will use a little bit of compiler optimizations along the way. The optimizations will simplify the program and show its deeper meanings.

Here is an overall plan of this article:

  1. Find the functional representations of the two continuations. Name them c1 and c1.
  2. Plug in c1 and c2 into the places of the two call/cc’s.
  3. Simplify the program using compiler optimizations.
  4. Perform behavioral analysis on the simplified program.

Transform and Remove Call/cc

So here we go. Let’s repeat the original program here:

(let* ((yin
        ((lambda (cc) (display #\@) cc) (call/cc (lambda (c) c))))   ; B1
       (yang
        ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))  ; B2
  (yin yang))

Note that there are two binding clauses, B1 and B2, each prints a char and binds a continuation to a variable.

B1 binds yin to the “current continuation” of the first call/cc. Let’s call this continuation c1. Its function representation is:

(lambda (k)
  (let* ((yin
          ((lambda (cc) (display #\@) cc) k))           
         (yang
          ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))          
    (yin yang)))

We can simplify it using some compiler optimization tricks. I will go very slowly here just in case you are not familiar with compiler optimizations.

First, since Scheme is a strict language, k will be a pure value (with no side-effects), so we can lift the side-effect (display #\@) out:

(lambda (k)
  (display #\@)
  (let* ((yin
          ((lambda (cc) cc) k))           
         (yang
          ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))          
    (yin yang)))

Now we can “partial evaluate” ((lambda (cc) cc) k) to k:

(lambda (k)
  (display #\@)
  (let* ((yin k)           
         (yang
          ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))          
    (yin yang)))

Then we can copy propagate the value of yin, k, to the body of let*, (yin yang). And now we have only one binding left:

(lambda (k)
  (display #\@)
  (let ((yang
         ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))
                                         ^^^^^^^^^^^^^^^^^^^^^^^^
    (k yang)))

Now we try to figure out the underlined continuation. It should be:

(lambda (j)
  (let ((yang
         ((lambda (cc) (display #\*) cc) j)))
    (k yang)))

You see why? Since the “current continuation” means “what is going to happen with this value”, it doesn’t include the computation before it, namely (display #\@). Now we do a similar optimization on this continuation: lifting out the side-effect (display #*), do some partial evaluation and copy propagation:

The result of this inner continuation is:

(lambda (j)
  (display #\*)
  (k j))

Now we can plug it back:

(lambda (k)
  (display #\@)
  (let ((yang
         ((lambda (cc) (display #\*) cc) 
          (lambda (j)
            (display #\*)
            (k j)))))
    (k yang)))

Do a similar sequence of optimizations: lifting out the first (display #*), partial evaluation, copy propagation. And we have the final result for the value of yin. Let’s name it c1 with a definition:

(define c1
 (lambda (k)
   (display #\@)
   (display #\*)
   (k (lambda (j)
        (display #\*)
        (k j)))))

Now, the original program would look like the following. The program would have by now printed a @ before binding yin to c1, so we make a sequence and include the display term in the front of it.

(begin
  (display #\@)
  (let* ((yin c1)
         (yang
          ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))
    (yin yang)))

Try it. It will behave the same as the original program. Now we do another copy propagation to get rid of the first binding:

(begin
  (display #\@)
  (let ((yang
         ((lambda (cc) (display #\*) cc) (call/cc (lambda (c) c)))))
    (c1 yang)))

Since the call/cc here will not cause any side-effects (why?), we can lift (display #*) out:

(begin
  (display #\@)
  (display #\*)
  (let ((yang
         ((lambda (cc) cc) (call/cc (lambda (c) c)))))
                           ^^^^^^^^^^^^^^^^^^^^^^^^
    (c1 yang)))

Now we just have to find what this underlined continuation is, and it will become the value of yang. It is:

(lambda (k)
  (let ((yang
         ((lambda (cc) cc) k)))
    (c1 yang)))

After our routine sequence of optimizations we have the value of yang. Let’s define it as c2:

(define c2
  (lambda (k)
    (display #\*)
    (c1 k)))

Plug c1 and c2 back into the original program:

(begin
  (display #\@)
  (display #\*)
  (let* ((yin c1)
         (yang c2))
    (yin yang)))

And do a copy propagation:

(begin
  (display #\@)
  (display #\*)
  (c1 c2))

Transformed Program

Now we have our final transformed and optimized program. Let’s put the definitions of c1 and c2 here for an overview:

(define c1
 (lambda (k)
   (display #\@)
   (display #\*)
   (k (lambda (j)
        (display #\*)
        (k j)))))

(define c2
  (lambda (k)
    (display #\*)
    (c1 k)))

(begin
  (display #\@)
  (display #\*)
  (c1 c2))

For simplicity, let’s inline c2 into the main program. Now c2 disappears.

(define c1
 (lambda (k)
   (display #\@)
   (display #\*)
   (k (lambda (j)
        (display #\*)
        (k j)))))

(begin
  (display #\@)
  (display #\*)
  (c1 (lambda (k)
        (display #\*)
        (c1 k))))

Try it. It will still work the same as the original program.

Behavioral Analysis

Now the program doesn’t contain any call/cc’s and is much easier to understand. Let’s try to figure out how it executes:

  1. First, @* will be printed, and we will invoke:
(c1 (lambda (k)
      (display #\*)
      (c1 k)))
  1. Inside the invocation, @* will be printed by the body of c1, and k is now bound to (lambda (k) (display #*) (c1 k)). So the program proceed to:
((lambda (k)
   (display #\*)
   (c1 k))
 (lambda (j)
   (display #\*)
   ((lambda (k)
      (display #\*)
      (c1 k)) 
    j)))

It will print a *, and becomes:

(c1
 (lambda (j)
   (display #\*)
   ((lambda (k)
      (display #\*)
      (c1 k)) 
    j)))

Now we have seen @*@**

  1. Notice that we are back to a call to c1! This is a good sign of recursion. But this time the argument is different. If we simplify it a little, we get:
(c1
 (lambda (j)
   (display #\*)
   (display #\*)
   (c1 j)))
  1. I think you see what’s going on here. This time, c1 is called with
(lambda (j)
   (display #\*)
   (display #\*)
   (c1 j))

which means “When called, display TWO *’s, and then behave like c1 on the argument”. If we go on, the argument to c1 will be longer and longer, and each time with an additional (display #*). It will print @***, and then @****, and then @*****, and so on.

  1. Let’s introspect a bit. Which part of the program is responsible for creating the ever-longer displays, and why? It is this piece from the definition of c1:
(k (lambda (j)
     (display #\*)
     (k j)))

Here k is c1′s argument. Notice that k is always of the form:

(lambda (j)
   (display #\*)
   (display #\*)
   ...
   (c1 j))

When k is applied, it will print the corresonding number of *’s inside it, and then behave like c1. The argument to k is:

(lambda (j)
  (display #\*)
  (k j))

What does this mean? It says: “When called, print a * and then behave like k on the argument.” This is how you get a “new k”, with one more display of *.

This will go on indifinitely. This is why you see the infinite string: @@@@****@*****@****** …

On software design patterns

In this post I try to answer the contraversial question

Are software design patterns good or bad?

This used to be a long post, but I have since condensed it into a very short one, thanks to the pictures. One picture is worth ten thousand words. Hopefully you will have understood most of Richard Gabriel’s 239-page book after looking at the following two pictures.

  1. Here are some nice design patterns. You can use them to decorate the floors in your house.

  2. Here is a masterpiece. Can you construct it using the above patterns, or any patterns at all?

I think the answer is already obvious. One cannot produce a masterpiece by just putting patterns together. Artists know that techniques are never as important as vision. That is why they spend their lives studying the nature of things such as the human body. Once they get hold of the essence, they use whatever techniques at hand to express what they want. They can also change the techniques or invent new ones as needed.

The same holds for computer programming. We need to keep in mind that design patterns are means and not ends. True programmers are never constrained by existing patterns. They observe and grasp the essence of things, inventing new techniques and patterns as needed.

How to reinvent the Y combinator

I always believed that one can never learn the essence of anything without reinventing it. Once you reinvent things, you will never forget them — because otherwise you can just re-reinvent them.

Today I found that I forgot how to derive the defintion of the Y combinator. I learned it several years ago from an online article, but now the search “Y combinator” only brings me news about start-ups (sigh). I tried for two hours, but still couldn’t make the leap from the “poor man’s Y” to “Y”. Finally, I opened my good old friend The Little Schemer. Alas. Chapter 9 tells me exactly how to reinvent Y. Thank you Dan Friedman and Matthias Felleisen!

To prevent myself from forgetting how to derive Y again, I made a slide to record my understanding of it. I hope it can be of help to people (including the future me). So here it is.

Exercise: The Y combinator derived from this tutorial only works for direct recursion, try to derive the Y combinator for mutual recursive functions, for example the following two functions even and odd.

(define even
  (lambda (x)
    (cond
     [(zero? x) #t]
     [(= 1 x) #f]
     [else (odd (sub1 x))])))

(define odd
  (lambda (x)
    (cond
     [(zero? x) #f]
     [(= 1 x) #t]
     [else (even (sub1 x))])))

A reformulation of reducibility

I found the theory of computation books very imprecise about their descriptions of Turing machines and reductions. They usually start with something precise and mathematical, for example they would define a Turing machine as a 7-tuple, but after that, everything about decidability and reduction is described with impenetrable paragraphs in natural languages.

I believe that the use of natural languages leads to most of the confusion in theory of computation because natural languages are inherently imprecise and ambiguous. There are too many ways to say the same thing. For example, you can find these sentences which mean exactly the same in the books:

  • “Run M on w”
  • “Run M on input w”
  • “Simulate M on w”

The pronouns and references are even more confusing. For example:

“Use the description of M and w to construct the TM M1 just described.”

What is “just described”, and do M and w here mean the same things as in the earlier paragraph?

Another serious problem is that natural languages are very weak at representing the notion of “interpretation”, which underlies much of theory of computation. For example, a Turing machine simulates another Turing machine, which again contains and simulates yet another one.

After some thought, I found that we could use something precise such as mathematical notations combined with programming languages to describe the concepts. As an example, I’ll show here how the notion of reduction can be defined precisely as a homomorphism which can be instantiated for reducing one problem to another.

Definition 1 (reduction): A reduction (as in theory of computation) is a homomorphism (as in universal algebra):

Reduce(TM, I) = (TM', I')

satisfying the property

TM @ I = TM' @ I'

where

  • TM = Turing machine which we reduce from
  • TM' = Turing machine which we reduce to
  • I = input string of TM
  • I' = input string of TM’
  • @ = simulation, similar to the Scheme code ((eval TM) I)
  • TM @ I = result from simulating TM on I (accept or reject)
  • TM' @ I' = result from simulating TM’ on I’ (accept or reject)

End of Definition 1.

Notice that the simulation (TM @ I) may go into an infinite loop and never produce any result. Now I show how to use this homomorphism to describe the reduction from ATM ==> HALT, where

  • ATM = the “acceptance problem” (deciding whether a Turing machine M accepts string w)
  • HALT = the “halting problem” (deciding whether a Turing machine M halts on string w)

For convenience, we let

  • DATM = “the decider of ATM”
  • DHALT = “the decider of HALT”

Now the reduction can be fully described by the following homomorphism:

Reduce(DATM, (M,w)) = (DHALT, (M',w))
where
  M' = <if (M @ w) then accept else loop>
satisfying
  DATM @ (M,w) = DHALT @ (M',w)

Yes, that’s an all-inclusive formal proof that HALT is undecidable. It even includes the notion of “reduction” itself.

Let me explain it a bit. The first line says that there exists a function (named Reduce) from the pair (DATM, (M,w)) to another pair (DHALT, (M',w)), where M' = <if (M @ w) then accept else loop> is a description of a Turing machine.

Now let’s look at the last line:

DATM @ (M,w) = DHALT @ (M',w)

It says: if we have a decider for HALT (DHALT), then we can use it to define DATM, thus deciding ATM.

Why this is a valid defintion for DATM? This is because from the definition of M'

<if (M @ w) then accept else loop>

we know that:

  • If (M @ w) accepts, M' accepts, thus DHALT @ (M',w) accepts
  • If (M @ w) rejects, M' loops, thus DHALT @ (M',w) rejects
  • If (M @ w) loops, M' loops, thus DHALT @ (M',w) rejects

Notice from the colored words that DHALT @ (M',w) will accept if and only if M accepts w. Thus this defines a decider for ATM.

So if DHALT exists, then we can have DATM. But this contradicts the fact that DATM cannot exist, so DHALT must not exist.

If you wonder how this proof corresponds to Definition 1, here is some details how it is instantiated:

  • TM = DATM (nonexistent)
  • TM' = DHALT (hypothetical)
  • I = (M,w) where M is the description of a Turing machine which we want to know whether it accepts input w.
  • I' = (M',w) where M' is <if (M @ w) then accept else loop>
  • TM @ I = DATM @ (M,w) (running decider of DATM on input (M,w))
  • TM @ I' = DHALT @ (M',w) (running decider of DHALT on (M',w))

This is a lot more concise, accurate and easier to understand than a paragraph:

F = "On input <M,w>:
  1. Construct the following machine M'
     M' = "On input x:
        1. Run M on x.
        2. If M accepts, accept.
        3. If M rejects, enter a loop."
  2. Output <M',w>."

A bug in GHC’s type system

Several days ago, I implemented an experimental type inference system with first-class polymorphism. When comparing it with other systems, I found a possible bug in GHC’s type system regarding universal quantification. The phenomemon was confirmed and reproduced by people at #haskell IRC channel for GHC versions above 7.01. The code that causes trouble is:

gen :: [forall a. a -> a]
gen = [id]
test1 = head gen 1

Obviously this program should typecheck, since:

  • id has the type forall a. a -> a.
  • A list gen containing id should have type [forall a. a -> a](as in the annotation).
  • head has the type forall a. [a] -> a.
  • head gen should have the type forall a. a -> a.
  • head gen should be able to be applied to 1.

But GHC rejected this program for a strange reason.

Couldn't match expected type `t1 -> t0'
with actual type `forall a. a -> a'
Expected type: [t1 -> t0]
Actual type: [forall a. a -> a]
In the first argument of `head', namely `gen'
In the expression: head gen 1

On the other hand, it works if (head gen) is bound at let:

test2 = let hg = head gen in hg 1

It doesn’t break the soundness of the type system since it only rejects some correct programs, but this kind of pecularities of type systems can be painful when they add up. I guess this may be caused by the interaction between GHC’s internal type system with the legacy let-polymorphism.

On point-free programming

Concatenative programming, or point-free style, is useful sometimes, but has some serious drawbacks similar to the SKI combinators. Applicative programs can be compiled into point-free style easily, but writing and reading them directly in large scale is usually a mental burden.

It only works well with functions with one or two arguments. Concatenation of functions with more than two arguments will not be so convenient. If the receiver has an argument order that is different from the sender’s output order, you will need a non-trivial permutation of argument order.

For example, if the function is defined as:

f :: Int -> String -> Bool
f x y = ...

If you want to use it as the predicate for filtering a list of strings, that’s fine. You just write something like filter (f 2) .... But what if you want to filter a list of integers? Then you will need to swap the order of the first two arguments before you can do the partial application. So you write filter (flip f 2) .... Fine. But what if the function looks like:

g :: Int -> A -> String -> Bool
g x y z = ...

And you want to filter a list of A’s? Which function do you use to switch the argument order, and you expect the reader of the program to learn it?

What about functions with four arguments. Notice that there are 4! = 24 different argument orders. How many order switchers do we need?

In order to prevent this kind of plumbing, we have to take unnecessary care when we decide the order of the parameters. This often makes the function look ugly.

Names are more convenient. Notice that even mathematics uses names for permutations (as in algebra):

(a b c d)
(b a d c)

Concatenative programming is like connecting the components of a circuit. But even electronic engineers don’t do it this way. They use net-lists with names and labels.

Names are essential and useful in most cases. Concatenative programming, although nice when used sparsingly, may not be good to serve as a major way of composition.

At this point, I found this sentence from Tao Te Ching (Chapter 1) especially relevant:

“The nameless is the origin of Heaven and Earth
The named is the mother of myriad things”

On literate programming

A friend and I had a discussion about the current trends of programming methodologies. At some point, the topic turned to Literate Programming (LP). I personally have written a non-trivial literate program (in CWEB) a few years back, but like many other people, I did not persevere. As a hindsight, there are some good reasons why people do not adopt LP.

Missing the big picture

Although we love books, they may not be the best medium for documenting programs. Programs are much like circuits or cars. They are very complex systems with interconnected parts. The interactions and constraints among the parts can’t really be understood when they are taken apart.

LP systems usually segments the program into small pieces and rearrange them into a book-like order. This is like taking a car apart when teaching car-making. There is a huge loss of structural information in this rearrangement. It may be useful to take some parts out for explanations and experiments, but they should be placed back immediately after examination. The student must have a clear view of the overall structure of the car.

If we take all the parts out and lay them along the road, the student would have trouble figuring out how they would fit and work together. In short, the student see trees and not the forest when reading a book generated by Literate Programs.

Programs are not text

LP systems usually allow programmers to fragment a program into small pieces, reorder the pieces into an order that is “convenient to the human mind”. Later on, a program (weave) generates printable documentation for the program; another program (tangle) assembles the fragments into machine-executable code.

However, these systems have a big problem — treating programs as text. LP systems usually don’t parse the programs. They just segment the code textually. This often messes up the variable scopes and causes subtle bugs. This is much like macros in early Lisps and C, extremely dangerous and error-prone.

There are other kinds of LP systems such as Literate Haskell, CoffeeScript etc. They don’t break up the procedures and don’t have the above “hygiene problem”. But because we already have the freedom to rearrange the order of procedures in any programming language, those systems are nothing more than a fancy comment facility like JavaDoc. In views of authentic literate programmers, those systems miss the essence of LP.

Human supremacy, human languages and human cognition

LP systems rely their arguments on a false belief in human supremacy, an overemphasis in human languages, and a misunderstanding of human cognition. Practice has shown that the human brain is incapable of competing with computers in anything that requires rigorous reasoning and perseverance. Natural languages are woefully inaccrurate and confusing for programming. Any programming language that tries to mimic natural languages is doomed to suffer the same “design flaws” of natural languages (for example SQL).

It is also doubtful whether the order set by literate programming is suitable for human cognition at all. Programs should be constructed in an order that matches the nature of the concept it models, and not in the order of a particular person’s way of thinking, nor in the arbitrary order set by a compiler. Once the program structure matches nature, it will immediately appeal to human understanding.

So instead of investing in LP, a probably better direction is to investigate better ways of representing programs that enable them to match the concepts more naturally. Functional Programming and Logic Programming are gradually moving toward this direction.

Painful code navigation

Once the code is weaved into a book, it will be very hard to navigate through. Several years ago, I spent some time reading Knuth’s MMIXware, a book generated from a CWEB program. I noticed how much time I spent on finding definitions and cross-references of variables. Although Knuth painstakingly constructed an index for the book, obviously I still had to turn hundreds of pages back and forth, while in IDEs I can jump to the definition of a variable with just one keystroke. One should wonder why we bother publishing code as books at all.

From these observations, it is really unclear how literature and books can serve as a good medium for programs. The world has changed and technology advanced, instead of pursuing something static and classic, we may just need to open our minds to new kinds of media.

Indentation-based syntax considered harmful

Although the idea of layout syntax—using whitespace characters to delimit blocks—has been promoted by several languages (notably Python and Haskell), I feel that this kind of syntax brings more trouble than benefits.

It takes just one keystroke to produce a serious bug

In most languages a program’s format is determined by its meaning, but in a language with layout syntax, its meaning is determined by the format. This makes programs in layout syntax fragile because a tiny change in format could result in a serious error. For example, consider these two Python programs:

# correct definition
def member(x, ls):
    for y in ls:
        if x == y:
            return True
    return False         # correct indentation
# incorrect definition
def member(x, ls):
    for y in ls:
        if x == y:
            return True
        return False     # incorrect indentation

The second definition has been “derived” from the first by an inadvertent TAB key, which indented the return statement one more level to the right. While the two definitions differ only in one indentation, they produce totally different results. The first definition is correct, while the second has serious bugs:

  1. If ls is non-empty, it will always return False whether x is an element of ls or not.
  2. If ls is empty, it will return None (instead of False) because there is no return statement after the for-loop, a “missing return statement” bug.

Although this is just a minimal example, the bug may take quite some time to show up and be fixed. In order to prevent this kind of bugs from happening, I often find myself moving the cursor up-and-down in a straight line to check the alignment of the statements.

Let’s see why this bug cannot happen in a language which does not use layout syntax. We now invent an alternative syntax for Python, so that the pervious program looks like:

def member(x, ls) {
    for y in ls {
        if x == y {
            return True
        }
    }
    return False
}

Given the correct definition, can you imagine how you could reproduce the bug with just one keystroke? It is almost impossible. To see why:

  1. The return statement can never get into the loop with a change in indentation.
  2. It takes at least two edits and one movement in the editor to move the return statement into the loop. There are two alternatives to choose from:
    • Cut return False. Move the cursor into the closing curly brace of the for-loop. Paste.
    • Delete the closing curly brace of the for-loop. Move the cursor beyond return False. Insert an closing curly brace.

Either way, you must be deliberate and precise in order to reproduce the bug. Otherwise the parser would have complained (for example, if you just delete a closing curly brace).

However, the situation is very different with layout syntax, where one TAB key press produces the same amount of change as the above three operations. The change happens quickly and the program remains grammatically correct, obscuring the presence of a bug.

The situation is a little better for Haskell, because incorrect indentations often cause type errors and the programmer will be alerted, but in either languages, it takes quite some time to fix this kind of bugs.

Unconvincing advantages

It is often claimed that layout syntax has the following advantages over curly braces:

  • Your programs become a lot shorter because less curly braces are used.
  • You write less clutter such as curly braces and semicolons, and that beautifies your code.

I found that neither of the two advantages convincing. For the first part, Python and Haskell programs are indeed several times shorter than equivalent Java or C programs, but this cannot really be creditted to layout syntax.

We need to have some blank lines even in a Python or Haskell program, between definitions and sometimes in the middle of a block. The blank lines naturally denote groups of statements. So if we count the number of additional lines introduced by curly braces, we will find that there aren’t many. Curly braces also naturally denote statement groups, so not only they don’t look bad, they are helpful.

In Python and Haskell, it is the semantic features (pattern matching, first-class functions etc.) that make the programs short, not layout syntax. If we had an alternative syntax of Java which uses layout, then Java programs would still be several times longer than equivalent Scala programs. Java programs are longer not because they use curly braces, but because they don’t have things such as first-class functions, so they have to use some tedious design patterns.

Second, layout syntax does not really save “clutter”. Even in a language with layout syntax, we may still need to write almost the same amount of (if not more) clutter. The following is a random piece of code taken from the current release of GHC. We can still see lots of curly braces and semicolons in it. I guess layout syntax actually caused trouble, so the authors of GHC decided that they will just write curly braces.

tcInstanceMethodBody skol_info tyvars dfun_ev_vars
                     meth_id local_meth_id
             meth_sig_fn specs
                     (L loc bind)
  = do  {       -- Typecheck the binding, first extending the envt
        -- so that when tcInstSig looks up the local_meth_id to find
        -- its signature, we'll find it in the environment
          let lm_bind = L loc (bind { fun_id = L loc (idName local_meth_id) })
                             -- Substitute the local_meth_name for the binder
                 -- NB: the binding is always a FunBind

    ; (ev_binds, (tc_bind, _))
               <- checkConstraints skol_info tyvars dfun_ev_vars $
          tcExtendIdEnv [local_meth_id] $
              tcPolyBinds TopLevel meth_sig_fn no_prag_fn
                 NonRecursive NonRecursive
                 [lm_bind]

        ; let full_bind = AbsBinds { abs_tvs = tyvars, abs_ev_vars = dfun_ev_vars
                                   , abs_exports = [(tyvars, meth_id, local_meth_id, specs)]
                                   , abs_ev_binds = ev_binds
                                   , abs_binds = tc_bind }

        ; return (L loc full_bind) }
  where
    no_prag_fn  _ = []      -- No pragmas for local_meth_id;
                        -- they are all for meth_id

Better ways to save clutter

Even if we do hate curly braces, there are better ways to reduce or even completely eliminate them. For a trivial “solution”, we could just use a dim color for curly braces and semicolons in the editor, so that they are less noticeable. For example the above Python program in that curly bracy syntax could look like this in your editor:

def member(x, ls) {
    for y in ls {
        if x == y {
            return True
        }
    }
    return False
}

Better still, we could use a structural editor that lets us manipulate the AST (abstract syntax tree) directly. Those editors could provide several options of denoting blocks. You can switch between colored blocks, curly braces, or nothing at all. You can switch the look of your code at any time, instantly. People have implemented such editors, for example this editor designed by Kirill Osenkov.


Re-indentation hassle

In a language that doesn’t use layout syntax (Java, C, Scheme, …), no re-indentation is really needed when the code changes. The programmer can move a block of code by a simple copy-and-paste and continue solving the real problem. Re-indentation can always be done later and can be done automatically.

But in a language that uses layout syntax, re-indentation is mandatory, and worse, it can only be done manually. Layout syntax completely disables any editor’s auto-indent function. One may think that we might be able to invent a smarter editor that can auto-indent code for those languages. This is simply impossible. This is evident in the analysis of the above example. The two programs differ only in indentation, but they have completely different meanings. Both are grammatically correct programs and the editor has no way to tell which is the one you want unless it can read your mind.

Some people say that because those languages have advanced semantics, programs are so short that we don’t need to re-indent code very often. But experiences prove to me that the need for changing and rewriting code can never be eliminated. Writing code is like writing a book, you can always find pieces that need change or even complete rewrite. Usually changes in the following category will cause re-indentation:

  • Scope changes. There are lots of examples in this category: lifting an internal function to top level or push a top-level function into an internal function, surrounding a block with let-bindings, loops, conditional statements, try-except or lifting a block out of them, factoring out duplicated patterns, lifting “invariant code” out of recursion, … to name a few. These will necessarily change the indentation of a block of code, and each line needs to be re-indented.
  • Align code. For example in Haskell, when we align the arrows (->) of a case expression or the equal signs (=) of a function definition, we will notice that we have to re-indent most of the right-hand-sides, because they often contain multi-line expressions but the editors (for example, Emacs’ align-regexp function) only move the lines that contains the arrows or equal signs.
  • Renaming. We seldom choose the best names on the first shot, and good names make programs self-explanatory, so renaming is a very important and commonplace action. But in the following simple Haskell program, if we change the name from “helloworld” to “hello” and don’t re-indent the rest of the lines, we will get a parse error.
    helloworld z = let x = 1
                       y = 2 in
                     x+y+z

    Because the code becomes the following after the renaming, and the second line will no longer be aligned to “x = …”, and that confuses the parser.

    hello z = let x = 1
                       y = 2 in
                     x+y+z

    A similar thing happens when we lengthen the name to something like “helloworldcup”. Try it yourself. From this example, I hope you see how simple things are made frustratingly complicated by layout syntax. If you haven’t been convinced, try adding more lines to the above let expression.

The interruption from re-indenting code is usually not just one or two seconds, but often tens of seconds, even minutes. The programmer has to put down real problems at hand and turn to the mind-dead re-indenting job. This disrupts the “flow”, which is essential for producing creative and elegant code.

Syntax considered harmful

I believe that syntax, although an important aspect of natural languages, should not play a significant role in programming languages. It has already brought us too much trouble and frustration and wasted too much of our energy. You can read my other blog post detailing the harm of syntax in general and why we should remove the concept of “syntax” as a whole from programming languages.

Syntax has prevented lots of new design possibilities in programming languages. You may have heard language designers say: “Hey this is a nice feature, but the syntax of my language hasn’t any room left for it.” Layout syntax pushes to this direction even more. It forces us to consciously and constantly think about syntax, drawing our attention away from semantics design. It poses certain constraints on how code must be formatted, and makes a language even less extensible. Thus I think layout syntax is the most troublesome type of syntax.

PySonar: a type inferencer and indexer for Python

pysonar2

PySonar is a type inferencer and indexer for Python. Developed during two summer internships (2009, 2010) at Google, it went through several different designs, until it incorporates a powerful type system and a control-flow-aware interprocedural analysis.

Compared to style-checking tools or IDEs, PySonar analyzes programs in deeper ways and produces more accurate results. One superficial phenomenon of this difference is that it resolves more names than IDEs (the current resolution rate is about 97% for Python’s standard library).

Demos

To get a quick feel about what PySonar can do, here is a sample analysis result for a small fraction of Python’s standard library.

What’s in there

  1. A powerful type system. In addition to the usual types you can find in programming languages, PySonar’s type system has union types and intersection types – two of the most powerful elements I have found during my research. They are rarely found in programming languages. I know of only two languages with statically checked union types: Typed Racket and Ceylon. Different from these languages, PySonar can work without any type annotations. It infers all the types by doing interprocedural analysis. Thus the type system of PySonar is actually more powerful than these languages, but because of the lack of the help from type annotations, it has certain limitations. Unlike the types in programming languages, the inferred types can’t be used to guarantee the complete absence of type errors. Also the types only capture how the functions are currently used, they may change as you make more calls to them, so the types’ use as specifications is limited. Despite of these limitations, the inferred types are still very informative. They help to produce more accurate code indexes.
  2. Control-flow aware interprocedural analysis. Because Python has very dynamic and polymorphic semantics and doesn’t contain type annotations, an intra-procedural type inference systems such as the Hindley-Milner system will not work. I actually had implemented a HM-like system in the first version of PySonar, but it didn’t work very well. As a consequence, all types are inferred by an interprocedural analysis which follows the control-flow and some other aspects of the semantics.
  3. Handling of Python’s dynamism. Static analysis for Python is hard because it has many dynamic features. They help make programs concise and flexible, but they also make automated reasoning about Python programs hard. Some of these features can be reasonably handled but some others not. For example, function or class redefinition can be handled by inferring the effective scope of the old and new definitions. PySonar also treats types as first-class entities and keep track of them (as shown in the following picture). For code that are really undecidable, PySonar attempts to report all known possibilities. For example, if a function is “conditionally defined” (e.g., defined differently in two branches of an if-statement), PySonar gives it a union type which contains all possible types it can possibly have.
  4. Medium path-sensitivity. The analysis performed by PySonar is path-sensitive to some degree, which enhances the accuracy of type inference. We can see this in an example:
    def g(x):
      if (x > 10):
        y = 1
      else:
        y = A()
        ...
        y = "hi"
      return y

    In this function g, because we can’t statically know the input value of x, we must assume that it’s possible to take either branch. In the true branch, y is assigned an int. In the false branch, y is assigned an instance of A, but is later overwritten by a string. At the end of the false branch, y can only have the type string. The output type should be the union of the two branches, so it gives function g the type int -> {int, string}. If the analysis were path-insensitive, g would be typed int -> {int, string, A}. Notice the superfluous A in the output type.

    PySonar’s path sensitivity is limited. Completely path-sensitive analysis will not merge the types even after the two branches merge, thus not creating the union types. Fully path-sensitive analysis is more expensive and harder to compute although it is more accurate.

  5. High accuracy semantic indexing
    PySonar originated as part of Google’s Grok project, an Eclipse-like code browser used internally at Google. Because of that, generating semantic indexes was the main purpose of PySonar although it may be put to other uses. PySonar parses the code into an AST and performs type inference on it, and at the same time it builds indexes that respects scopes and types. Because it performs interprocedural analysis, it is often able to find the definitions of attributes inside function parameters. This works across functions, classes and modules. The following image shows that it can accurately locate the field x.z which refers to the “z” fields in classes B1 and C1, but not A1.

Rough Edges

Because of Python’s complexity, there may be some subtle language constructs implemented incorrectly and produce weird results. If you see those, or see names that you hope to be resolved but not, please file a bug on my GitHub repository.

Availability

The code is opensource from my GitHub repository.

Users

I normally don’t keep track of users of my code unless they contact me, but it will keep me motivated improving it if you send me a thank-you letter or buy me a coffee. I may provide you additional help if you have trouble with the code.

Here are the only users I know of:

  • Google. Google uses PySonar 1.0 to index millions of lines of Python code, serving internal code search and analysis services such as Grok and Code Search
  • SourceGraph. SourceGraph is a semantic code search engine. They use PySonar to index hundreds of thousands of opensource Python repositories. They started using PySonar 1.0 as the Python analysis for their site. I recently joined them and finished integrating PySonar 2.0