RubySonar: a static analyzer for Ruby

(This is a reposting of a blog post I wrote on Sourcegraph‘s website)

Since the last enhancement to our Python analysis, we have been working on improving the Ruby part of Sourcegraph by building a similar static analysis tool for Ruby. The result is a new open-source project RubySonar. With its help, now we are happy to see significantly improved results of Ruby repos which in general finds over ten times more symbols, provides more examples and more accurate jump-to-definition. It also provides local variable highlighting which was not available in our previous Ruby results. With the improved speed of analysis, we can now process the full Ruby standard library, Ruby on Rails, and Ruby apps such as Homebrew.

RubySonar’s analysis is interprocedural and is sensitive to both data-flow and control-flow, which makes it highly accurate. Although we currently don’t show the types, RubSonar internally uses the same type inference technique of PySonar, and thus can resolve some of the difficult cases that can challenge a good Ruby IDE.

Although Ruby and Python seem to be very similar languages, lots of work has been done to adapt PySonar’s code to Ruby. Still more work has to be done before RubySonar can achieve the same coverage of PySonar2, but so far the Ruby results are a lot more usable. We have been carefully keeping things simple and easy to extend, so we are optimistic about further improvements to the Ruby analysis.

On pureness

It is usually a pleasure to use a functional programming language, but it is doubtful whether we should care about the “pureness” of the language. To write programs in a completely pure functional programming language (e.g. Haskell) is much like living in a wired world.

There are no electromagnetic waves nor sound waves in such a world, so you don’t have wifi, cell phones, satellite televisions, radios, etc. You don’t even have light or sound. In other words, everything is blind and deaf.

All information must pass through wires or pipes, connected by switch boxes which we call “monads”. You must carefully route all the wires and pipes and connect them correctly before any information processing device, including your eyes and ears, can properly work.

So I think a language will need parts of it not to be pure, in order to naturally represent the effects such as electromagnetic waves. Normally those effects are called “benign effects“.

Undecidability proof of the halting problem without diagonalization

As a TA for a graduate level Theory of Computation course, I don’t understand why we have to teach the diagonalization proof of the undecidability of the halting problem. It is much easier if we just use a normal argument about functions in a programming language. Basically, functions correspond to Turing machines in a theory of computation context. In order to improve the situation, I came up with a simpler formulation based on the view of “Turing machines as functions”.

Suppose that we have this function Accept(M,w) which takes two arguments, a function M and its input w, and can tell you whether or not M(w) will output True. Basically, suppose that Accept is the hypothetical solver for the halting problem. To connect to usual theory of computing terminology, the fuction M corresponds to a Turing machine, True corresponds to reaching an accept state, and False corresponds to reject state. Note that although the argument M may go into infinite loops, never will Accept.  Accept always returns True or False in a finite amount of time.  Accept determines this without actually running M because M may go into an infinite loop and never return. This way of finding properties of programs without running them is usually called static analysis. It has the same essence as fortune-telling and all other sciences.

Now we come to our simple contradiction: does the following expression (call it F) return True or False?

Accept(λm.not(Accept(m,m)), λm.not(Accept(m,m)))

It turns out that this question cannot be answered at all! If F returns True, then when we actually apply the function λm.not(Accept(m,m)) to the argument λm.not(Accept(m,m)), it should return True, right? But when we apply it, we get

not(Accept(λm.not(Accept(m,m)), λm.not(Accept(m,m))))

Notice that it is exactly the negation of the original expression F. This means that F should return False (because the “actual run” returns False). On the other hand, if F returns False, then when we actually apply the function on its argument, it will return True. Thus the contradiction. QED.

Some may argue that I’m implicitly using “diagonalization” here. Well, you can say that they are equivalent in effect, but the equivalence doesn’t mean that “diagonalization” is a good word for describing what I’m doing. Actually I’m just making a loop in a “circuit”. I don’t see the diagonal, then why call it that way? The diagonal argument and the “loop argument” may have the same essence, but they are very different in their comprehensibility. I feel that the loop way is a much easier to understand than diagonalization.

An interesting final question may be, what is the connection between the above expressions with the Y combinator?

λf.(λx.f(x x)) (λx.f(x x))

A mind map guide for TeXmacs

TeXmacs was my best kept secret for homework assignments and papers with lots of mathematical formulas. It is a state-of-the-art typesetting system with unparalleled beauty. Many people think that it is just a “frontend” of TeX/LaTeX (like LyX) but they are very wrong, because TeXmacs is fundamentally superior than TeX. TeXmacs is a fully independent software, with the same or even better typographical quality than TeX but with much more intuitive and beautifully designed interface.

The editor environment is just amazing. It is 100% WYSIWYG. Yep, 100%. That means, the screen shows exactly what you would print out on paper. I don’t know of any word processor (including MS Word) that can do this today. If you haven’t used TeXmacs before, it would refresh your mind about what the word WYSIWYG really mean.

I know exactly what you would argue now, TeXers, but listen—with all the beauty, you don’t lose detailed control of document structure. You will be surprised to find out that it is also WYTIWYG (What You Think Is What You Get), and it has exploited this notion to the extreme. You will sense a “human centered” design in it.

So, ready for a try? Just download it (free) and dive in. Over the years, I have accumulated lots of useful knowledge regarding its use and have collected them into a mind map. Click on the following image and you can see all my best tricks with this software. Have fun!

On software design patterns

In this post I try to answer the contraversial question

Are software design patterns good or bad?

This used to be a long post, but I have since condensed it into a very short one, thanks to the pictures. One picture is worth ten thousand words. Hopefully you will have understood most of Richard Gabriel’s 239-page book after looking at the following two pictures.

  1. Here are some nice design patterns. You can use them to decorate the floors in your house.

  2. Here is a masterpiece. Can you construct it using the above patterns, or any patterns at all?

I think the answer is already obvious. One cannot produce a masterpiece by just putting patterns together. Artists know that techniques are never as important as vision. That is why they spend their lives studying the nature of things such as the human body. Once they get hold of the essence, they use whatever techniques at hand to express what they want. They can also change the techniques or invent new ones as needed.

The same holds for computer programming. We need to keep in mind that design patterns are means and not ends. True programmers are never constrained by existing patterns. They observe and grasp the essence of things, inventing new techniques and patterns as needed.

Concurrent stack does not exist

Some weeks ago I read this thought-provoking article by Nir Shavit Data Structures in the Multicore Age. The topic is about efficiency of concurrent data structures. There have been many thinkings going on after the reading, but the most interesting point to me is that the author started trying to use a “concurrent stack”, but ended up happily using a pool.

“In the end, we gave up the stack’s LIFO ordering in the name of performance.”

Now there can be several interesting questions to ask. Can we give up some property of a data structure if it is crucial to the correctness of the program? If we can give up the LIFO ordering, then why did we think we need a stack? But the most interesting question is probably:

Does a “concurrent stack” really exist?

I think the answer is No — a “concurrent stack” hasn’t really existed and will never exist. Here is why:

  1. If the threads are allowed to push and pop concurrently, then the “stack” can’t really maintain a LIFO order, because the order of the “ins” and “outs” is then indeterminate and contingent on the relative execution speeds of the threads.
  2. If the execution of two threads strictly interleave. Each time a “pusher” pushes an element, a “popper” immediately pops it out, then this is a FIFO order, a queue.
  3. If a pusher pushes all the elements before a popper starts to pop all of them out, then this is indeed a LIFO order, a stack. But notice that there is no concurrency in this case — the executions of the threads must be completely sequential, one after another.
  4. If two threads interleave randomly, or there are more than two threads accessing the “stack” at the same time, then nothing can be said about the access order.

From the above, we can see that there is a fundamental conflict between the two notions, “concurrency” and a “stack”.

If a “stack” can be accessed concurrently, then there is no way we can maintain a LIFO order. On the other hand, if we enforce a LIFO order, then the stack cannot be accessed concurrently.

If we have realized that the essence of a stack is a continuation, and a continuation (by itself) is sequential, then it is no surprise we arrive at this conclusion.

Since a LIFO order is essential to a stack, we can’t call the data structure a “stack” if we can’t maintain a LIFO order.

We can’t call a data structure a “stack” just because it has the methods named “push” and “pop” — we have to look at what the methods actually do.

Even if we continue to think that we are using a stack, the threads are in effect just distributing messages, with the operations “push = send” and “pop = receive”. So in essence this data structure is a pool. This exactly justifies the author’s relaxation to a pool, although no actual relaxation happens — the data structure has been a pool from the beginning. It was just disguised as a stack and less efficient.

So we see that the concept of a “concurrent stack” does not really exist. We also see that some data structures we have in the sequential world may not have concurrent counterparts.

How to reinvent the Y combinator

I always believed that one can never learn the essence of anything without reinventing it. Once you reinvent things, you will never forget them — because otherwise you can just re-reinvent them.

Today I found that I forgot how to derive the defintion of the Y combinator. I learned it several years ago from an online article, but now the search “Y combinator” only brings me news about start-ups (sigh). I tried for two hours, but still couldn’t make the leap from the “poor man’s Y” to “Y”. Finally, I opened my good old friend The Little Schemer. Alas. Chapter 9 tells me exactly how to reinvent Y. Thank you Dan Friedman and Matthias Felleisen!

To prevent myself from forgetting how to derive Y again, I made a slide to record my understanding of it. I hope it can be of help to people (including the future me). So here it is.

Exercise: The Y combinator derived from this tutorial only works for direct recursion, try to derive the Y combinator for mutual recursive functions, for example the following two functions even and odd.

(define even
  (lambda (x)
    (cond
     [(zero? x) #t]
     [(= 1 x) #f]
     [else (odd (sub1 x))])))

(define odd
  (lambda (x)
    (cond
     [(zero? x) #f]
     [(= 1 x) #t]
     [else (even (sub1 x))])))

A reformulation of reducibility

I found the theory of computation books very imprecise about their descriptions of Turing machines and reductions. They usually start with something precise and mathematical, for example they would define a Turing machine as a 7-tuple, but after that, everything about decidability and reduction is described with impenetrable paragraphs in natural languages.

I believe that the use of natural languages leads to most of the confusion in theory of computation because natural languages are inherently imprecise and ambiguous. There are too many ways to say the same thing. For example, you can find these sentences which mean exactly the same in the books:

  • “Run M on w”
  • “Run M on input w”
  • “Simulate M on w”

The pronouns and references are even more confusing. For example:

“Use the description of M and w to construct the TM M1 just described.”

What is “just described”, and do M and w here mean the same things as in the earlier paragraph?

Another serious problem is that natural languages are very weak at representing the notion of “interpretation”, which underlies much of theory of computation. For example, a Turing machine simulates another Turing machine, which again contains and simulates yet another one.

After some thought, I found that we could use something precise such as mathematical notations combined with programming languages to describe the concepts. As an example, I’ll show here how the notion of reduction can be defined precisely as a homomorphism which can be instantiated for reducing one problem to another.

Definition 1 (reduction): A reduction (as in theory of computation) is a homomorphism (as in universal algebra):

Reduce(TM, I) = (TM', I')

satisfying the property

TM @ I = TM' @ I'

where

  • TM = Turing machine which we reduce from
  • TM' = Turing machine which we reduce to
  • I = input string of TM
  • I' = input string of TM’
  • @ = simulation, similar to the Scheme code ((eval TM) I)
  • TM @ I = result from simulating TM on I (accept or reject)
  • TM' @ I' = result from simulating TM’ on I’ (accept or reject)

End of Definition 1.

Notice that the simulation (TM @ I) may go into an infinite loop and never produce any result. Now I show how to use this homomorphism to describe the reduction from ATM ==> HALT, where

  • ATM = the “acceptance problem” (deciding whether a Turing machine M accepts string w)
  • HALT = the “halting problem” (deciding whether a Turing machine M halts on string w)

For convenience, we let

  • DATM = “the decider of ATM”
  • DHALT = “the decider of HALT”

Now the reduction can be fully described by the following homomorphism:

Reduce(DATM, (M,w)) = (DHALT, (M',w))
where
  M' = <if (M @ w) then accept else loop>
satisfying
  DATM @ (M,w) = DHALT @ (M',w)

Yes, that’s an all-inclusive formal proof that HALT is undecidable. It even includes the notion of “reduction” itself.

Let me explain it a bit. The first line says that there exists a function (named Reduce) from the pair (DATM, (M,w)) to another pair (DHALT, (M',w)), where M' = <if (M @ w) then accept else loop> is a description of a Turing machine.

Now let’s look at the last line:

DATM @ (M,w) = DHALT @ (M',w)

It says: if we have a decider for HALT (DHALT), then we can use it to define DATM, thus deciding ATM.

Why this is a valid defintion for DATM? This is because from the definition of M'

<if (M @ w) then accept else loop>

we know that:

  • If (M @ w) accepts, M' accepts, thus DHALT @ (M',w) accepts
  • If (M @ w) rejects, M' loops, thus DHALT @ (M',w) rejects
  • If (M @ w) loops, M' loops, thus DHALT @ (M',w) rejects

Notice from the colored words that DHALT @ (M',w) will accept if and only if M accepts w. Thus this defines a decider for ATM.

So if DHALT exists, then we can have DATM. But this contradicts the fact that DATM cannot exist, so DHALT must not exist.

If you wonder how this proof corresponds to Definition 1, here is some details how it is instantiated:

  • TM = DATM (nonexistent)
  • TM' = DHALT (hypothetical)
  • I = (M,w) where M is the description of a Turing machine which we want to know whether it accepts input w.
  • I' = (M',w) where M' is <if (M @ w) then accept else loop>
  • TM @ I = DATM @ (M,w) (running decider of DATM on input (M,w))
  • TM @ I' = DHALT @ (M',w) (running decider of DHALT on (M',w))

This is a lot more concise, accurate and easier to understand than a paragraph:

F = "On input <M,w>:
  1. Construct the following machine M'
     M' = "On input x:
        1. Run M on x.
        2. If M accepts, accept.
        3. If M rejects, enter a loop."
  2. Output <M',w>."

On point-free programming

Concatenative programming, or point-free style, is useful sometimes, but has some serious drawbacks similar to the SKI combinators. Applicative programs can be compiled into point-free style easily, but writing and reading them directly in large scale is usually a mental burden.

It only works well with functions with one or two arguments. Concatenation of functions with more than two arguments will not be so convenient. If the receiver has an argument order that is different from the sender’s output order, you will need a non-trivial permutation of argument order.

For example, if the function is defined as:

f :: Int -> String -> Bool
f x y = ...

If you want to use it as the predicate for filtering a list of strings, that’s fine. You just write something like filter (f 2) .... But what if you want to filter a list of integers? Then you will need to swap the order of the first two arguments before you can do the partial application. So you write filter (flip f 2) .... Fine. But what if the function looks like:

g :: Int -> A -> String -> Bool
g x y z = ...

And you want to filter a list of A’s? Which function do you use to switch the argument order, and you expect the reader of the program to learn it?

What about functions with four arguments. Notice that there are 4! = 24 different argument orders. How many order switchers do we need?

In order to prevent this kind of plumbing, we have to take unnecessary care when we decide the order of the parameters. This often makes the function look ugly.

Names are more convenient. Notice that even mathematics uses names for permutations (as in algebra):

(a b c d)
(b a d c)

Concatenative programming is like connecting the components of a circuit. But even electronic engineers don’t do it this way. They use net-lists with names and labels.

Names are essential and useful in most cases. Concatenative programming, although nice when used sparsingly, may not be good to serve as a major way of composition.

At this point, I found this sentence from Tao Te Ching (Chapter 1) especially relevant:

“The nameless is the origin of Heaven and Earth
The named is the mother of myriad things”

On literate programming

A friend and I had a discussion about the current trends of programming methodologies. At some point, the topic turned to Literate Programming (LP). I personally have written a non-trivial literate program (in CWEB) a few years back, but like many other people, I did not persevere. As a hindsight, there are some good reasons why people do not adopt LP.

Missing the big picture

Although we love books, they may not be the best medium for documenting programs. Programs are much like circuits or cars. They are very complex systems with interconnected parts. The interactions and constraints among the parts can’t really be understood when they are taken apart.

LP systems usually segments the program into small pieces and rearrange them into a book-like order. This is like taking a car apart when teaching car-making. There is a huge loss of structural information in this rearrangement. It may be useful to take some parts out for explanations and experiments, but they should be placed back immediately after examination. The student must have a clear view of the overall structure of the car.

If we take all the parts out and lay them along the road, the student would have trouble figuring out how they would fit and work together. In short, the student see trees and not the forest when reading a book generated by Literate Programs.

Programs are not text

LP systems usually allow programmers to fragment a program into small pieces, reorder the pieces into an order that is “convenient to the human mind”. Later on, a program (weave) generates printable documentation for the program; another program (tangle) assembles the fragments into machine-executable code.

However, these systems have a big problem — treating programs as text. LP systems usually don’t parse the programs. They just segment the code textually. This often messes up the variable scopes and causes subtle bugs. This is much like macros in early Lisps and C, extremely dangerous and error-prone.

There are other kinds of LP systems such as Literate Haskell, CoffeeScript etc. They don’t break up the procedures and don’t have the above “hygiene problem”. But because we already have the freedom to rearrange the order of procedures in any programming language, those systems are nothing more than a fancy comment facility like JavaDoc. In views of authentic literate programmers, those systems miss the essence of LP.

Human supremacy, human languages and human cognition

LP systems rely their arguments on a false belief in human supremacy, an overemphasis in human languages, and a misunderstanding of human cognition. Practice has shown that the human brain is incapable of competing with computers in anything that requires rigorous reasoning and perseverance. Natural languages are woefully inaccrurate and confusing for programming. Any programming language that tries to mimic natural languages is doomed to suffer the same “design flaws” of natural languages (for example SQL).

It is also doubtful whether the order set by literate programming is suitable for human cognition at all. Programs should be constructed in an order that matches the nature of the concept it models, and not in the order of a particular person’s way of thinking, nor in the arbitrary order set by a compiler. Once the program structure matches nature, it will immediately appeal to human understanding.

So instead of investing in LP, a probably better direction is to investigate better ways of representing programs that enable them to match the concepts more naturally. Functional Programming and Logic Programming are gradually moving toward this direction.

Painful code navigation

Once the code is weaved into a book, it will be very hard to navigate through. Several years ago, I spent some time reading Knuth’s MMIXware, a book generated from a CWEB program. I noticed how much time I spent on finding definitions and cross-references of variables. Although Knuth painstakingly constructed an index for the book, obviously I still had to turn hundreds of pages back and forth, while in IDEs I can jump to the definition of a variable with just one keystroke. One should wonder why we bother publishing code as books at all.

From these observations, it is really unclear how literature and books can serve as a good medium for programs. The world has changed and technology advanced, instead of pursuing something static and classic, we may just need to open our minds to new kinds of media.

Indentation-based syntax considered harmful

Although the idea of layout syntax—using whitespace characters to delimit blocks—has been promoted by several languages (notably Python and Haskell), I feel that this kind of syntax brings more trouble than benefits.

It takes just one keystroke to produce a serious bug

In most languages a program’s format is determined by its meaning, but in a language with layout syntax, its meaning is determined by the format. This makes programs in layout syntax fragile because a tiny change in format could result in a serious error. For example, consider these two Python programs:

# correct definition
def member(x, ls):
    for y in ls:
        if x == y:
            return True
    return False         # correct indentation
# incorrect definition
def member(x, ls):
    for y in ls:
        if x == y:
            return True
        return False     # incorrect indentation

The second definition has been “derived” from the first by an inadvertent TAB key, which indented the return statement one more level to the right. While the two definitions differ only in one indentation, they produce totally different results. The first definition is correct, while the second has serious bugs:

  1. If ls is non-empty, it will always return False whether x is an element of ls or not.
  2. If ls is empty, it will return None (instead of False) because there is no return statement after the for-loop, a “missing return statement” bug.

Although this is just a minimal example, the bug may take quite some time to show up and be fixed. In order to prevent this kind of bugs from happening, I often find myself moving the cursor up-and-down in a straight line to check the alignment of the statements.

Let’s see why this bug cannot happen in a language which does not use layout syntax. We now invent an alternative syntax for Python, so that the pervious program looks like:

def member(x, ls) {
    for y in ls {
        if x == y {
            return True
        }
    }
    return False
}

Given the correct definition, can you imagine how you could reproduce the bug with just one keystroke? It is almost impossible. To see why:

  1. The return statement can never get into the loop with a change in indentation.
  2. It takes at least two edits and one movement in the editor to move the return statement into the loop. There are two alternatives to choose from:
    • Cut return False. Move the cursor into the closing curly brace of the for-loop. Paste.
    • Delete the closing curly brace of the for-loop. Move the cursor beyond return False. Insert an closing curly brace.

Either way, you must be deliberate and precise in order to reproduce the bug. Otherwise the parser would have complained (for example, if you just delete a closing curly brace).

However, the situation is very different with layout syntax, where one TAB key press produces the same amount of change as the above three operations. The change happens quickly and the program remains grammatically correct, obscuring the presence of a bug.

The situation is a little better for Haskell, because incorrect indentations often cause type errors and the programmer will be alerted, but in either languages, it takes quite some time to fix this kind of bugs.

Unconvincing advantages

It is often claimed that layout syntax has the following advantages over curly braces:

  • Your programs become a lot shorter because less curly braces are used.
  • You write less clutter such as curly braces and semicolons, and that beautifies your code.

I found that neither of the two advantages convincing. For the first part, Python and Haskell programs are indeed several times shorter than equivalent Java or C programs, but this cannot really be creditted to layout syntax.

We need to have some blank lines even in a Python or Haskell program, between definitions and sometimes in the middle of a block. The blank lines naturally denote groups of statements. So if we count the number of additional lines introduced by curly braces, we will find that there aren’t many. Curly braces also naturally denote statement groups, so not only they don’t look bad, they are helpful.

In Python and Haskell, it is the semantic features (pattern matching, first-class functions etc.) that make the programs short, not layout syntax. If we had an alternative syntax of Java which uses layout, then Java programs would still be several times longer than equivalent Scala programs. Java programs are longer not because they use curly braces, but because they don’t have things such as first-class functions, so they have to use some tedious design patterns.

Second, layout syntax does not really save “clutter”. Even in a language with layout syntax, we may still need to write almost the same amount of (if not more) clutter. The following is a random piece of code taken from the current release of GHC. We can still see lots of curly braces and semicolons in it. I guess layout syntax actually caused trouble, so the authors of GHC decided that they will just write curly braces.

tcInstanceMethodBody skol_info tyvars dfun_ev_vars
                     meth_id local_meth_id
             meth_sig_fn specs
                     (L loc bind)
  = do  {       -- Typecheck the binding, first extending the envt
        -- so that when tcInstSig looks up the local_meth_id to find
        -- its signature, we'll find it in the environment
          let lm_bind = L loc (bind { fun_id = L loc (idName local_meth_id) })
                             -- Substitute the local_meth_name for the binder
                 -- NB: the binding is always a FunBind

    ; (ev_binds, (tc_bind, _))
               <- checkConstraints skol_info tyvars dfun_ev_vars $
          tcExtendIdEnv [local_meth_id] $
              tcPolyBinds TopLevel meth_sig_fn no_prag_fn
                 NonRecursive NonRecursive
                 [lm_bind]

        ; let full_bind = AbsBinds { abs_tvs = tyvars, abs_ev_vars = dfun_ev_vars
                                   , abs_exports = [(tyvars, meth_id, local_meth_id, specs)]
                                   , abs_ev_binds = ev_binds
                                   , abs_binds = tc_bind }

        ; return (L loc full_bind) }
  where
    no_prag_fn  _ = []      -- No pragmas for local_meth_id;
                        -- they are all for meth_id

Better ways to save clutter

Even if we do hate curly braces, there are better ways to reduce or even completely eliminate them. For a trivial “solution”, we could just use a dim color for curly braces and semicolons in the editor, so that they are less noticeable. For example the above Python program in that curly bracy syntax could look like this in your editor:

def member(x, ls) {
    for y in ls {
        if x == y {
            return True
        }
    }
    return False
}

Better still, we could use a structural editor that lets us manipulate the AST (abstract syntax tree) directly. Those editors could provide several options of denoting blocks. You can switch between colored blocks, curly braces, or nothing at all. You can switch the look of your code at any time, instantly. People have implemented such editors, for example this editor designed by Kirill Osenkov.


Re-indentation hassle

In a language that doesn’t use layout syntax (Java, C, Scheme, …), no re-indentation is really needed when the code changes. The programmer can move a block of code by a simple copy-and-paste and continue solving the real problem. Re-indentation can always be done later and can be done automatically.

But in a language that uses layout syntax, re-indentation is mandatory, and worse, it can only be done manually. Layout syntax completely disables any editor’s auto-indent function. One may think that we might be able to invent a smarter editor that can auto-indent code for those languages. This is simply impossible. This is evident in the analysis of the above example. The two programs differ only in indentation, but they have completely different meanings. Both are grammatically correct programs and the editor has no way to tell which is the one you want unless it can read your mind.

Some people say that because those languages have advanced semantics, programs are so short that we don’t need to re-indent code very often. But experiences prove to me that the need for changing and rewriting code can never be eliminated. Writing code is like writing a book, you can always find pieces that need change or even complete rewrite. Usually changes in the following category will cause re-indentation:

  • Scope changes. There are lots of examples in this category: lifting an internal function to top level or push a top-level function into an internal function, surrounding a block with let-bindings, loops, conditional statements, try-except or lifting a block out of them, factoring out duplicated patterns, lifting “invariant code” out of recursion, … to name a few. These will necessarily change the indentation of a block of code, and each line needs to be re-indented.
  • Align code. For example in Haskell, when we align the arrows (->) of a case expression or the equal signs (=) of a function definition, we will notice that we have to re-indent most of the right-hand-sides, because they often contain multi-line expressions but the editors (for example, Emacs’ align-regexp function) only move the lines that contains the arrows or equal signs.
  • Renaming. We seldom choose the best names on the first shot, and good names make programs self-explanatory, so renaming is a very important and commonplace action. But in the following simple Haskell program, if we change the name from “helloworld” to “hello” and don’t re-indent the rest of the lines, we will get a parse error.
    helloworld z = let x = 1
                       y = 2 in
                     x+y+z

    Because the code becomes the following after the renaming, and the second line will no longer be aligned to “x = …”, and that confuses the parser.

    hello z = let x = 1
                       y = 2 in
                     x+y+z

    A similar thing happens when we lengthen the name to something like “helloworldcup”. Try it yourself. From this example, I hope you see how simple things are made frustratingly complicated by layout syntax. If you haven’t been convinced, try adding more lines to the above let expression.

The interruption from re-indenting code is usually not just one or two seconds, but often tens of seconds, even minutes. The programmer has to put down real problems at hand and turn to the mind-dead re-indenting job. This disrupts the “flow”, which is essential for producing creative and elegant code.

Syntax considered harmful

I believe that syntax, although an important aspect of natural languages, should not play a significant role in programming languages. It has already brought us too much trouble and frustration and wasted too much of our energy. You can read my other blog post detailing the harm of syntax in general and why we should remove the concept of “syntax” as a whole from programming languages.

Syntax has prevented lots of new design possibilities in programming languages. You may have heard language designers say: “Hey this is a nice feature, but the syntax of my language hasn’t any room left for it.” Layout syntax pushes to this direction even more. It forces us to consciously and constantly think about syntax, drawing our attention away from semantics design. It poses certain constraints on how code must be formatted, and makes a language even less extensible. Thus I think layout syntax is the most troublesome type of syntax.

PySonar: a type inferencer and indexer for Python

pysonar2

PySonar is a type inferencer and indexer for Python. Developed during two summer internships (2009, 2010) at Google, it went through several different designs, until it incorporates a powerful type system and a control-flow-aware interprocedural analysis.

Compared to style-checking tools or IDEs, PySonar analyzes programs in deeper ways and produces more accurate results. One superficial phenomenon of this difference is that it resolves more names than IDEs (the current resolution rate is about 97% for Python’s standard library).

Demos

To get a quick feel about what PySonar can do, here is a sample analysis result for a small fraction of Python’s standard library.

What’s in there

  1. A powerful type system. In addition to the usual types you can find in programming languages, PySonar’s type system has union types and intersection types – two of the most powerful elements I have found during my research. They are rarely found in programming languages. I know of only two languages with statically checked union types: Typed Racket and Ceylon. Different from these languages, PySonar can work without any type annotations. It infers all the types by doing interprocedural analysis. Thus the type system of PySonar is actually more powerful than these languages, but because of the lack of the help from type annotations, it has certain limitations. Unlike the types in programming languages, the inferred types can’t be used to guarantee the complete absence of type errors. Also the types only capture how the functions are currently used, they may change as you make more calls to them, so the types’ use as specifications is limited. Despite of these limitations, the inferred types are still very informative. They help to produce more accurate code indexes.
  2. Control-flow aware interprocedural analysis. Because Python has very dynamic and polymorphic semantics and doesn’t contain type annotations, an intra-procedural type inference systems such as the Hindley-Milner system will not work. I actually had implemented a HM-like system in the first version of PySonar, but it didn’t work very well. As a consequence, all types are inferred by an interprocedural analysis which follows the control-flow and some other aspects of the semantics.
  3. Handling of Python’s dynamism. Static analysis for Python is hard because it has many dynamic features. They help make programs concise and flexible, but they also make automated reasoning about Python programs hard. Some of these features can be reasonably handled but some others not. For example, function or class redefinition can be handled by inferring the effective scope of the old and new definitions. PySonar also treats types as first-class entities and keep track of them (as shown in the following picture). For code that are really undecidable, PySonar attempts to report all known possibilities. For example, if a function is “conditionally defined” (e.g., defined differently in two branches of an if-statement), PySonar gives it a union type which contains all possible types it can possibly have.
  4. Medium path-sensitivity. The analysis performed by PySonar is path-sensitive to some degree, which enhances the accuracy of type inference. We can see this in an example:
    def g(x):
      if (x > 10):
        y = 1
      else:
        y = A()
        ...
        y = "hi"
      return y

    In this function g, because we can’t statically know the input value of x, we must assume that it’s possible to take either branch. In the true branch, y is assigned an int. In the false branch, y is assigned an instance of A, but is later overwritten by a string. At the end of the false branch, y can only have the type string. The output type should be the union of the two branches, so it gives function g the type int -> {int, string}. If the analysis were path-insensitive, g would be typed int -> {int, string, A}. Notice the superfluous A in the output type.

    PySonar’s path sensitivity is limited. Completely path-sensitive analysis will not merge the types even after the two branches merge, thus not creating the union types. Fully path-sensitive analysis is more expensive and harder to compute although it is more accurate.

  5. High accuracy semantic indexing
    PySonar originated as part of Google’s Grok project, an Eclipse-like code browser used internally at Google. Because of that, generating semantic indexes was the main purpose of PySonar although it may be put to other uses. PySonar parses the code into an AST and performs type inference on it, and at the same time it builds indexes that respects scopes and types. Because it performs interprocedural analysis, it is often able to find the definitions of attributes inside function parameters. This works across functions, classes and modules. The following image shows that it can accurately locate the field x.z which refers to the “z” fields in classes B1 and C1, but not A1.

Rough Edges

Because of Python’s complexity, there may be some subtle language constructs implemented incorrectly and produce weird results. If you see those, or see names that you hope to be resolved but not, please file a bug on my GitHub repository.

Availability

The code is opensource from my GitHub repository.

Users

I normally don’t keep track of users of my code unless they contact me, but it will keep me motivated improving it if you send me a thank-you letter or buy me a coffee. I may provide you additional help if you have trouble with the code.

Here are the only users I know of:

  • Google. Google uses PySonar 1.0 to index millions of lines of Python code, serving internal code search and analysis services such as Grok and Code Search
  • SourceGraph. SourceGraph is a semantic code search engine. They use PySonar to index hundreds of thousands of opensource Python repositories. They started using PySonar 1.0 as the Python analysis for their site. I recently joined them and finished integrating PySonar 2.0