RubySonar: a type inferencer and indexer for Ruby

I have built a similar static analysis tool for Ruby. The result is a new open-source project RubySonar. RubySonar can now process the full Ruby standard library, Ruby on Rails, and Ruby apps such as Homebrew.

RubySonar’s analysis is inter-procedural and is sensitive to both data-flow and control-flow, which makes it highly accurate. RubSonar uses the same type inference technique of PySonar, and thus can resolve some of the difficult cases that can challenge a good Ruby IDE.

Comments Off on RubySonar: a type inferencer and indexer for Ruby

Posted by on February 3, 2014 in programming languages, static analysis


Undecidability proof of the halting problem using lambda calculus

As a teaching assistant for a graduate level “Theory of Computation” course, I don’t understand why we use Turing machines as a model for all the computability and complexity theorems. It is much easier if we just use arguments on functions in a usual programming language. One good choice of such a language is the lambda calculus.

Basically, a function in lambda calculus is equivalent to a Turing machine. The “halting” of a Turing machine is equivalent to the corresponding lambda calculus term reducing to its “normal form”. From this viewpoint, I came up with the following undecidability proof of the halting problem using the lambda calculus.

Proof. Suppose that we have a function Halting(M,w), a hypothetical solver for the halting problem. Halting takes two arguments, a function M and its input w, and it tells you whether or not M(w) will output True. Note that although M may go into infinite loops, never will our magical solver HaltingHalting always returns True or False within finite amount of time.

Now we show a simple contradiction, which proves that the magical solver Halting cannot really exist. This question is: Does the following expression E returns True or False?

E = Halting(λm.not(Halting(m,m)), λm.not(Halting(m,m)))

It turns out that this question cannot be answered. If E returns True, then we apply the function λm.not(Halting(m,m)) to its argument λm.not(Halting(m,m)), and we get

not(Halting(λm.not(Halting(m,m)), λm.not(Halting(m,m))))

Alas, this is exactly the negation of the original expression E, which means E should be False. This is a contradiction (or call it a “paradox” if you like), which shows that the halting problem solver Halting cannot exist, which means that the halting problem cannot be solved.


To connect back to usual computing theory terminology (as in Sipser’s classic textbook), the functions Halting and M corresponds to two Turing machines, one of which taking the other as an argument. True corresponds to a Turing machine reaching its accept state, and False corresponds to reaching the reject state.

I’m surprised how much shorter this proof is as compared to [Church 1935, 1936] and [Turing 1936]. Actually what I did in the proof was just making a loop in a circuit. Almost every logical paradox has a loop involved–pretty much the same kind of loop, the Möbius strip. Every logic question expects the answer to be one of two sides–either True or False, but the Möbius strip has only one side!

If you hope to learn more on related stuff, a nice book by Neil Jones shows that you can prove every theorem in theory of computation (including computability and complexity) using a simple programming language. It’s not really necessary to use Turing machines for theory of computation. Turing machines make things awkward and complicated.

Now, a fun exercise question may be, what is the connection between the above expressions E with the Y combinator?

λf.(λx.f(x x)) (λx.f(x x))

Comments Off on Undecidability proof of the halting problem using lambda calculus

Posted by on October 25, 2012 in lambda calculus, programming languages, proofs, theory of computation


How to reinvent the Y combinator

I don’t believe in the doctrine “Don’t reinvent the wheels”. I believe that one can never learn the essence of anything without reinventing it (intentionally or not). Once you reinvent a thing, you can never forget it — because otherwise you can just reinvent it again.

Today I found that I forgot how to derive the definition of the Y combinator. I learned it several years ago from an online article, but now the search term “Y combinator” only brings news about startups (sigh). I attempted it for two hours, but still couldn’t make the leap from the “poor man’s Y” to “Y”. Finally, I opened my good old friend The Little Schemer. Alas. Chapter 9 tells me exactly how to reinvent Y. Thank you Dan Friedman and Matthias Felleisen!

To prevent myself from forgetting how to derive Y again, I made a slide to record my understanding of it. It is slightly different from the ways of the recent version of The Little Schemer, but I think it’s easier to understand. I hope it can help some people (including the future me). So here it is.

Exercise: The Y combinator derived from this tutorial only works for direct recursion, try to derive the Y combinator for mutual recursive functions, for example the following two functions even and odd.

(define even
  (lambda (x)
     [(zero? x) #t]
     [(= 1 x) #f]
     [else (odd (sub1 x))])))

(define odd
  (lambda (x)
     [(zero? x) #f]
     [(= 1 x) #t]
     [else (even (sub1 x))])))
Comments Off on How to reinvent the Y combinator

Posted by on April 9, 2012 in lambda calculus, programming languages, semantics, tutorials


ydiff: a structural program comparison tool

(Click on the above picture to see it in action. See the end of the post for more demos)


I have been imagining a world where programs are not represented as text, but as data structures. They will be edited not with text editors, but with structural editors, which create and manipulate the abstract syntax trees (AST) directly. In such a programming environment, the line-by-line diff utilities and version control tools will stop working, because there are no longer “lines” or “words” in programs.

ydiff is a proof-of-concept for handling this situation. It is a diff tool designed for “structural programming”.

Currently ydiff takes pains to parse program text into ASTs, but hopefully in the future programs will be stored directly as data structures so that the parsing step can be skipped. This will enable this kind of tool to extend to all programming languages effortlessly.


  • Language-aware. ydiff parses programs, understands basic language constructs and will not make non-sensical comparisons. For example it will not compare a string “10000” with an integer 10000 even though they look very similar. Also, it tries to match functions with the same name before it attempts to destruct and compare functions of different names.
  • Format insensitive. The comparison result will not be affected by different number of white spaces, line breaks or indentation. For example, ydiff will not produce a large diff just because you surrounded a block of code with if (condition) {…}.
  • Moved code detection. ydiff can find refactored code — renamed, moved, reordered, wrapped, lifted, combined or fragmented code. Refactored code can be detected however deep they are into the structures.
  • Human-friendly output. The output of ydiff is designed for human understanding. The interactive UI helps the user navigate and understand changes efficiently.

These properties make ydiff helpful for understanding changes. It may also be possibly used for detecting plagiarism in programming classes or copyright infringement of code. For large-scale use cases you may be more interested in MOSS, but ydiff is fundamentally more accurate because it parses programs.


Here are some interactive HTML demos with a pretty nice UI design. The left and right windows are always locked in their relative position. A mouse click on changed, moved or unchanged nodes will highlight the matched nodes and scroll the other window to match. After that, the windows will be locked into their new relative position for browsing.

Okay, here are the demos.

  • Scheme Demo1. ydiff’s algorithm diffing itself (between the first version on GitHub and the latest version).
  • Scheme Demo 2. Comparison of the original miniKanren from Professor Dan Friendman and the version I modified in order to support condc, a “negation operator”. Pay attention to the function unify, whose major part is moved into unify-good.
  • Emacs Lisp. Comparison of two versions (v20 and v23) of Taylor Campbell’s paredit-mode, a structural editing mode of emacs for Lisp programs.
  • Clojure. Compare the first commit of Typed Clojure with its recent version.
  • Arbitrary S-expressions. Trying to find a bug in an optimizing pass of my Scheme compiler by comparing the IRs (represented as S-expressions).
  • Python. ydiff has a completely separate implementation in Python (named “PyDiff”), which can diff two Python programs. This is a comparison of two implementations of a small list library that I wrote, which implements Lisp-style lists. The first implementation uses recursive procedures while the second uses Python’s generator syntax and is iterative. Pay some attention to append, whose code is moved inside another function appendAll.
  • Javascript. Comparison between two major revisions of the UI code of ydiff itself.
  • C++ demo1 and C++ demo2. There are two demos for C++. The first demo compares two versions of the d8 Javascript debugger from the V8 project(v3404 from 2009 and v8424 from 2011). The second demo compares V8’s simulators for two different processors (MIPS and ARM).The d8 demo is especially interesting because by clicking on the lines of the method Shell::Initializein the old version, it can be clearly observed that its body has been distributed into several procedures in the new version:

    Also the major part of Shell::Main is moved into the helper Shell::RunMain.

Get the code

ydiff is an open source project. You can follow its development on github: yinwang0/ydiff or get its source code from there.

Comments Off on ydiff: a structural program comparison tool

Posted by on January 3, 2012 in algorithms, programming languages, tools


A minimal boot sector tutorial

The natural first step of building an operating system is to find a way to run programs on “bare hardware”. The task turns out to be quite easy despite its daunting first impression. After researching on the net and trying out various tutorials, I found all of them overly complicated. Most of them assume interfacing with a C-like language and implementing a Unix-like system, but C and Unix are not the only story about operating systems design. A boot sector tutorial should teach nothing more than how to boot a computer.

After learning the good parts from the tutorials and applying my own simplifications, I arrived at my first boot sector. It is very simple and does very little — it just boots the machine and displays a colorful banner — but it illustrates the only things you need to know for booting a computer.

The code is very short — only 21 lines of code excluding comments and blank lines.

org 7C00H                      ; the program will be loaded at 7C00H

  mov eax, string_start
  mov ch, 1                    ; ch contains color of text
  mov ebx, 0B8000H + 718H      ; B8000H is VGA memory
                               ; 718H is offset to approx center

  mov cl, [eax]                ; load char into cl
  mov [ebx], cx                ; store [color:char] from cx into VGA
  add ch, 1                    ; change color to (ch+1) mod 16
  and ch, 0x0F
  add eax, 1                   ; advance string pointer
  add ebx, 2                   ; advance VGA pointer
  cmp eax, string_end          ; until the end of string
  jg stop
  jmp print

  jmp stop                     ; infinite loop after printing

  string_start db 'My colorful new OS!'
  string_end equ $

  times 510-($-$$) db 0        ; pad remainder of boot sector with 0s
  dw 0xAA55                    ; standard PC boot signature

It is in NASM syntax and needs nasm to be assembled into machine code. After that it can be booted by QEMU, a processor emulator. The only two necessary command lines are (assuming the code is stored in a file named myfirst.asm):

nasm -f bin -o myfirst.bin myfirst.asm
qemu -hda myfirst.bin

Of course, you can also burn the disk image (myfirst.bin) onto a CD and boot a real machine from it.

Comments Off on A minimal boot sector tutorial

Posted by on September 4, 2011 in architecture, operating systems, programming languages, tutorials


PySonar: a type inferencer and indexer for Python


PySonar is a type inferencer and indexer for Python. It includes a powerful type system and a sophisticated inter-procedural analysis. Compared to style-checking tools or IDEs, PySonar analyzes programs in deeper ways and produces more accurate results. PySonar resolves more names than typical IDEs. The current resolution rate is about 97% for Python’s standard library.


To get a quick feel about what PySonar can do, here is a sample analysis result for a small fraction of Python’s standard library.

What’s in there

  1. A powerful type system. In addition to the usual types you can find in programming languages, PySonar’s type system has union types and intersection types — two of the most powerful elements I have found during my PL research. They are rarely found in programming languages. I know of only two languages with statically checked union types: Typed Racket and Ceylon. Different from these languages, PySonar can work without any type annotations. It infers all the types by doing inter-procedural analysis.
  2. Control-flow aware interprocedural analysis. Because Python has very dynamic and polymorphic semantics and doesn’t contain type annotations, a modular type inference system such as the Hindley-Milner system will not work. I actually implemented a HM-like system in the first version of PySonar, but it didn’t work well. As a consequence, all types are inferred by an inter-procedural analysis which follows the control-flow and some other aspects of the semantics.
  3. Handling of Python’s dynamism. Static analysis for Python is hard because it has many dynamic features. They help make programs concise and flexible, but they also make automated reasoning about Python programs hard. Some of these features can be reasonably handled but some others not. For code that are undecidable, PySonar attempts to report all known possibilities. For example, it can infer union types which contains all possible types it can possibly have:
  4. High accuracy semantic indexing
    PySonar can build code indexes that respects scopes and types. Because it performs inter-procedural analysis, it is often able to find the definitions of attributes inside function parameters. This works across functions, classes and modules. The following image shows that it can accurately locate the field x.z which refers to the “z” fields in classes B1 and C1, but not A1.


The code is open source from my GitHub repository.


Here are some of PySonar’s users:

  • Google. Google uses PySonar 1.0 to index millions of lines of Python code, serving internal code search and analysis services such as Grok and Code Search
  • SourceGraph. SourceGraph is a semantic code search engine. They use PySonar to index hundreds of thousands of opensource Python repositories. They started using PySonar 1.0 as the Python analysis for their site. I recently joined them and finished integrating PySonar 2.0
Comments Off on PySonar: a type inferencer and indexer for Python

Posted by on September 12, 2010 in algorithms, programming languages, semantics, static analysis, types


Tags: , , , ,