PySonar: a type inferencer and indexer for Python

pysonar2

PySonar is a type inferencer and indexer for Python. Developed during two summer internships (2009, 2010) at Google, it went through several different designs, until it incorporates a powerful type system and a control-flow-aware interprocedural analysis.

Compared to style-checking tools or IDEs, PySonar analyzes programs in deeper ways and produces more accurate results. One superficial phenomenon of this difference is that it resolves more names than IDEs (the current resolution rate is about 97% for Python’s standard library).

Demos

To get a quick feel about what PySonar can do, here is a sample analysis result for a small fraction of Python’s standard library.

What’s in there

  1. A powerful type system. In addition to the usual types you can find in programming languages, PySonar’s type system has union types and intersection types – two of the most powerful elements I have found during my research. They are rarely found in programming languages. I know of only two languages with statically checked union types: Typed Racket and Ceylon. Different from these languages, PySonar can work without any type annotations. It infers all the types by doing interprocedural analysis. Thus the type system of PySonar is actually more powerful than these languages, but because of the lack of the help from type annotations, it has certain limitations. Unlike the types in programming languages, the inferred types can’t be used to guarantee the complete absence of type errors. Also the types only capture how the functions are currently used, they may change as you make more calls to them, so the types’ use as specifications is limited. Despite of these limitations, the inferred types are still very informative. They help to produce more accurate code indexes.
  2. Control-flow aware interprocedural analysis. Because Python has very dynamic and polymorphic semantics and doesn’t contain type annotations, an intra-procedural type inference systems such as the Hindley-Milner system will not work. I actually had implemented a HM-like system in the first version of PySonar, but it didn’t work very well. As a consequence, all types are inferred by an interprocedural analysis which follows the control-flow and some other aspects of the semantics.
  3. Handling of Python’s dynamism. Static analysis for Python is hard because it has many dynamic features. They help make programs concise and flexible, but they also make automated reasoning about Python programs hard. Some of these features can be reasonably handled but some others not. For example, function or class redefinition can be handled by inferring the effective scope of the old and new definitions. PySonar also treats types as first-class entities and keep track of them (as shown in the following picture). For code that are really undecidable, PySonar attempts to report all known possibilities. For example, if a function is “conditionally defined” (e.g., defined differently in two branches of an if-statement), PySonar gives it a union type which contains all possible types it can possibly have.
  4. Medium path-sensitivity. The analysis performed by PySonar is path-sensitive to some degree, which enhances the accuracy of type inference. We can see this in an example:
    def g(x):
      if (x > 10):
        y = 1
      else:
        y = A()
        ...
        y = "hi"
      return y

    In this function g, because we can’t statically know the input value of x, we must assume that it’s possible to take either branch. In the true branch, y is assigned an int. In the false branch, y is assigned an instance of A, but is later overwritten by a string. At the end of the false branch, y can only have the type string. The output type should be the union of the two branches, so it gives function g the type int -> {int, string}. If the analysis were path-insensitive, g would be typed int -> {int, string, A}. Notice the superfluous A in the output type.

    PySonar’s path sensitivity is limited. Completely path-sensitive analysis will not merge the types even after the two branches merge, thus not creating the union types. Fully path-sensitive analysis is more expensive and harder to compute although it is more accurate.

  5. High accuracy semantic indexing
    PySonar originated as part of Google’s Grok project, an Eclipse-like code browser used internally at Google. Because of that, generating semantic indexes was the main purpose of PySonar although it may be put to other uses. PySonar parses the code into an AST and performs type inference on it, and at the same time it builds indexes that respects scopes and types. Because it performs interprocedural analysis, it is often able to find the definitions of attributes inside function parameters. This works across functions, classes and modules. The following image shows that it can accurately locate the field x.z which refers to the “z” fields in classes B1 and C1, but not A1.

Rough Edges

Because of Python’s complexity, there may be some subtle language constructs implemented incorrectly and produce weird results. If you see those, or see names that you hope to be resolved but not, please file a bug on my GitHub repository.

Availability

The code is opensource from my GitHub repository.

Users

I normally don’t keep track of users of my code unless they contact me, but it will keep me motivated improving it if you send me a thank-you letter or buy me a coffee. I may provide you additional help if you have trouble with the code.

Here are the only users I know of:

  • Google. Google uses PySonar 1.0 to index millions of lines of Python code, serving internal code search and analysis services such as Grok and Code Search
  • SourceGraph. SourceGraph is a semantic code search engine. They use PySonar to index hundreds of thousands of opensource Python repositories. They started using PySonar 1.0 as the Python analysis for their site. I recently joined them and finished integrating PySonar 2.0
About these ads

24 Comments

  1. Oh my goodness. That’s going to be useful. Eagerly awaiting the open-source release!

  2. duchao

    gotya.So now you are in google,right?

    • No. I only worked as an intern for Google this summer.

  3. 你复活了!!!

  4. Alex Lamaison

    I’m impressed! Is the source code available? I’m a PhD student working on Refactoring for Python and I would love to compare and contrast our techniques.

    Yours,

    Alex

  5. jeff

    I am also interested in PySonar — I’m inheriting a python project and being able to run static analysis on it would be great!

  6. Soonho Kong

    Hello Yin, thank you for this posting. I found that the second link is not working now. Can you fix that? Thanks again.

    • Those two links are for identical files. The second link is for people who hasn’t access to Dropbox ;-) So if you have access to the first link, you don’t need to visit the second.

  7. Llywelyn

    Do you have any idea when (if) the source code might be made available?

    • I have no idea whether it will be open-sourced this time, but the first version of the analyzer (written in 2009) was merged into Jython and is available from jython.com. But it doesn’t infer types for function parameters and has certain ugliness. The newer one had it almost completely rewritten and simplified, and can infer most of the types. I have no idea whether it will be open-sourced or not. I’m considering writing an open-source version in Python. The general structure of the thing is the same though. You can check it out through svn if you don’t mind look at the first version. It was in the package called “indexer” there. The command line is:

      svn co https://jython.svn.sourceforge.net/svnroot/jython/trunk jython

  8. Yin,
    I am working in the distributed computing lab at Evergreen (blogs.evergreen.edu/oslab) and I’m interested in experimenting with indexing and refactoring huge sets of code. I have been studying your indexer and really like it (I find its type inferencing to be very fascinating). I think I would like to use your python version of the indexer rather than the one inside Jython. I will be experimenting with it over the coming days and will see what I can do.

    It seems like I could perhaps combine its inferencing/indexing with your ydiff’s code comparison along with other tools to analyze complexity, dependencies, etc to produce some kind of organized output that gives a nice overview of lots of code (maybe in xml or something, that could then be read by web browsers and perhaps editors). Let me know if you have any advice or thoughts for this.
    Jay

    • Jay,

      It is my pleasure that you like my projects. The Python version of PySonar still has lots of work to be done, although the parts which I consider the most important are already there (primitives, function and union types, calls, sequencing, branches, termination check, etc.). One important thing missing is classes (product types). The scoping of classes and methods caused some major headache to me when I did the Jython version, but I finally have them pretty nicely implemented. If you would like to add classes to it, feel free to consult the Jython version. I will be adding it any way, but I’m spending most of my time on another project.

      The blog comments are of limited functionality. Please feel free to send me email: yinwang0@gmail.com to continue the discussion.

  9. Languages that make types truly first class allow programmers to lie about the types of their variables:

    foo(Type T, T X) { … }


    foo(T, X); // no guarantee that the type of X is equal to the value of T; and it is halting problem hard to prove that the type of X should be the value of T

    • What you are describing here is a parametric polymorphic type. When foo is applied as foo(T, X), we will know the actual type T, thus we will be able to check the type of X.

  10. Yin, could you share us the feeling of being the internship in Google?
    Working in Google, or doing research in university, which one may be more suitable to you ?

    • As far as I can tell, working in a company (e.g. Google) is very different from doing research. The ways of research are also very different from university to university. So it depends on which company/university you are at. You are welcome to send me email for more details (yinwang0@gmail.com).

  11. wangyi

    How long do you manage to finish such a complex work? I heard about abstract interpretation when I take the course of static analysis. I always wondering wouldn’t it be a good idea if such interpreter can be embedded into the concrete(real) interpreter so that it can utilize some information from it. And I see so called “concrete-symbolic” testing seems to implement such an idea. But I think it must be very tough to implement that… How long take you to learn all these stuff?

  12. I’ve submit this web page to HackNews, should let more people know about PySonar :)

  13. szaialt

    Can you help me?

    java -jar pysonar2-master/target/pysonar-2.0-SNAPSHOT.jar
    Exception in thread “main” java.lang.UnsupportedClassVersionError: org/yinwang/pysonar/demos/Demo : Unsupported major.minor version 51.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:643)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
    at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
    Could not find the main class: org.yinwang.pysonar.demos.Demo. Program will exit.

    • Yin Wang

      Can you post this to the github issue and let me know your JRE version, and the rest of the output since the command line?

      Thanks.

      • szaialt

        How could I post this to the github?

        java -version
        java version “1.6.0_30″
        OpenJDK Runtime Environment (IcedTea6 1.13.1) (6b30-1.13.1-1ubuntu2~0.12.04.1)
        OpenJDK Client VM (build 23.25-b01, mixed mode, sharing)

    • Yin Wang

      It’s now clear already. You need Java 7.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: