ydiff: a structural program comparison tool

(Click on the above picture to see it in action. See the end of the post for more demos)

Motivation

I have been imagining a world where programs are not represented as text, but as data structures. They will be edited not with text editors, but with structural editors, which create and manipulate the abstract syntax trees (AST) directly. In such a programming environment, the line-by-line diff utilities and version control tools will stop working, because there are no longer “lines” or “words” in programs.

ydiff is a proof-of-concept for handling this situation. It is a diff tool designed for “structural programming”.

Currently ydiff takes pains to parse program text into ASTs, but hopefully in the future programs will be stored directly as data structures so that the parsing step can be skipped. This will enable this kind of tool to extend to all programming languages effortlessly.

Features

  • Language-aware. ydiff parses programs, understands basic language constructs and will not make non-sensical comparisons. For example it will not compare a string “10000” with an integer 10000 even though they look very similar. Also, it tries to match functions with the same name before it attempts to destruct and compare functions of different names.
  • Format insensitive. The comparison result will not be affected by different number of white spaces, line breaks or indentation. For example, ydiff will not produce a large diff just because you surrounded a block of code with if (condition) {…}.
  • Moved code detection. ydiff can find refactored code — renamed, moved, reordered, wrapped, lifted, combined or fragmented code. Refactored code can be detected however deep they are into the structures.
  • Human-friendly output. The output of ydiff is designed for human understanding. The interactive UI helps the user navigate and understand changes efficiently.

These properties make ydiff helpful for understanding changes. It may also be possibly used for detecting plagiarism in programming classes or copyright infringement of code. For large-scale use cases you may be more interested in MOSS, but ydiff is fundamentally more accurate because it parses programs.

Supported Languages

Currently ydiff can parse the following languages:

  • Lisp family languages (Scheme, Emacs Lisp, Common Lisp, Racket, Clojure)
  • Python (forked into its own project)
  • Javascript (experimental, deprecated)
  • C++ (experimental, deprecated)

The supporting part of a language is just a parser for that language. It produces a unified parse tree data structure. I have designed a parser combinator library to make parser-writing less painful, but they are still quite tricky to write. Hopefully in the future, the programs will be edited by structural editors and stored as data structures, so no parsers will be needed any more.

(Update: because of their unnecessary complexity, I discontinued the work on parsers for languages like JavaScript and C++)

Demos

Before digging further into the design details, you are invited to take a look at the demos. They are interactive HTML files with a pretty nice UI design. The left and right windows are always locked in their relative position. A mouse click on changed, moved or unchanged nodes will highlight the matched nodes and scroll the other window to match. After that, the windows will be locked into their new relative position for browsing.

Okay, here are the demos.

  • Scheme Demo1. ydiff’s algorithm diffing itself (between the first version on GitHub and the latest version).
  • Scheme Demo 2. Comparison of the original miniKanren from Professor Dan Friendman and the version I modified in order to support condc, a “negation operator”. Pay attention to the function unify, whose major part is moved into unify-good.
  • Emacs Lisp. Comparison of two versions (v20 and v23) of Taylor Campbell’s paredit-mode, a structural editing mode of emacs for Lisp programs.
  • Clojure. Compare the first commit of Typed Clojure with its recent version.
  • Arbitrary S-expressions. Trying to find a bug in an optimizing pass of my Scheme compiler by comparing the IRs (represented as S-expressions).
  • Python. ydiff has a completely separate implementation in Python (named “PyDiff”), which can diff two Python programs. This is a comparison of two implementations of a small list library that I wrote, which implements Lisp-style lists. The first implementation uses recursive procedures while the second uses Python’s generator syntax and is iterative. Pay some attention to append, whose code is moved inside another function appendAll.
  • Javascript. Comparison between two major revisions of the UI code of ydiff itself.
  • C++ demo1 and C++ demo2. There are two demos for C++. The first demo compares two versions of the d8 Javascript debugger from the V8 project(v3404 from 2009 and v8424 from 2011). The second demo compares V8’s simulators for two different processors (MIPS and ARM).The d8 demo is especially interesting because by clicking on the lines of the method Shell::Initializein the old version, it can be clearly observed that its body has been distributed into several procedures in the new version:
    Shell::Initialize
    Shell::CreateGlobalTemplate
    Shell::RenewEvaluationContext
    Shell::InstallUtilityScript

    Also the major part of Shell::Main is moved into the helper Shell::RunMain.

Algorithms and Technical Details

There are basically two main ideas inside ydiff: Tree Editing Distance and Substructure Extraction, plus the use of the neat idea of parser combinators (similar to Parsec).

  • Tree Editing Distance. Tree Editing Distance measures difference between two trees. In our case, it is the abstract syntax tree. When we encounter two trees, how do we compare them? If the first nodes of them appear to be the same, then we simply proceed to the next right? But what if the first nodes are different? There are three possibilities:
    1. The programmer inserted a node into the first tree.
    2. The programmer deleted a node from the first tree.
    3. The programmer modified a node in the first tree.

    We have to consider those possibilities for each level of nodes, so this naturally becomes a dynamic programming problem. Normally dynamic programming is done by cleverly find out the order to evaluate the problem space, but ydiff uses another important technique: memoization. There is a two dimensional hash table which stores the “delta” (changes) of two nodes, indexed by the two nodes themselves. The dynamic programming is then done in a recursive way, instead of the usual iterative way. The hash table automatically reduces the cost of recursion to the magnitude of iteration, by avoiding redundant evaluation of the nodes that we already compared.

  • Substructure Extraction. Tree Editing Distance can only find the “edits” — insertions, deletions and modifications, but it cannot find refactorization of code. If a piece of code is lifted out of a scope, the Tree Editing Distance function will produce two big changes: a deletion of a big node (for example, an if-statement), and an insertion of a slightly smaller node (with the “if (condition) { … }” wrap stripped off). Notice that although we only made a small change to the program, the resulting delta can be large. What we really want is that it shows a deletion of the wrap (“if (condition) { … }”), but a movement of the parts inside the wrap. This is called Substructure Extraction.Substructure Extraction works by recursively compare the substructures of two nodes, finding possible matches between one node and the subnodes of another. This substructural comparison is only done on the pairs of the resulting insertions and deletions from the Tree Editing Distance function. This little trick captures almost all possible refactorizations between two programs, however deep they are. This is demonstrated in the following picture, where ydiff successfully detects the movement of the function append into another function appendAll.
  • Parser Combinators. The parsers of ydiff are written with the help of a parser combinator library (in Scheme). In addition to the common features of Parsec, it detects left recursion and reports a “stack trace” of the left recursion, so that the parser writer can find the origin of left-recursion and fix the problem accordingly. This helps a lot when constructing the parsers. For examples, I constructed the JavaScript parser in one day, and the C++ parser in just two days. There are corner cases missing from the C++ parser and especially the macro part, but it parses significant pieces of code correctly (for example some code from Google’s V8 project code).

Get the code

ydiff is an open source project. You can follow its development on github: yinwang0/ydiff or get its source code from there.

About these ads

4 Comments

  1. Andreas

    Wow! That’s a step into the right direction, IMHO.

    Greetings,
    Andreas

  2. this is great work

  3. Can it analyze unified diff as well?

    • I’m not sure if I understand your question. ydiff is independent of the existing ways of diff programs and formats. It doesn’t work on text lines at all. It processes well structured programs and parses them.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: