PySonar is a static analyzer for Python. It uses a similar technique (abstract interpretation) as Coverity or CodeSonar, but does it in a much simpler and more principled way. Although it does not gather as much information as them, it produces results that are surprisingly accurate in the realms that it works on. To the best of my knowledge, it is the state-of-the-art of Python static analysis.
Demo
To get a quick feel about what kind of information it provides, here is a demo generated by PySonar demonstrating most of its features.
Features
- A static type system.
It is no surprise that a dynamic language can be statically typed. In fact, all languages can be statically typed. It is just a matter of how accurate the approximations are. PySonar’s type system has a blend of features that match Python’s semantics: structural subtyping (duck typing), record types, union types, parametric types, recursive types. Types are first-class citizens — it can analyze code that operates on types (e.g., take types as input or produce types as output). Because Python doesn’t have type annotations and its flexible semantics precludes unification-based type inference, all types in PySonar are inferred by interprocedual analysis.
- Treatment of Python’s dynamism.Static analysis for Python is hard because it has many “dynamic” features. They help to make programs concise, but they also make automated reasoning about Python programs hard. Fortunately, some of these features can be reasonably handled. For example, function or class redefinition can be handled by inferring the effective scope of the old and new definitions. For code that are really undecidable, PySonar uses a universal honest answer: “I don’t know.” Well, not quite so. It attempts to report all known possibilities. For example, if a function is “conditionally defined” (e.g., defined differently in two branches of an if-statement) and the condition is undecidable, then PySonar gives it a union type which contains all possible types it can possibly have. By doing that, PySonar reduces false negative rates.

- Path-sensitive analysis.The analysis performed by PySonar is path-sensitive, which enhances the accuracy of type inference and reduces false positive rates. We can see this in an example:
def g(x): if (x > 10): y = 1 else: y = A() ... y = "hi" return yIn this function
g, because we can’t statically know the input value ofx, we must assume that it’s possible to take either branch. In the true branch,yis assigned anint. In the false branch,yis assigned an instance ofA, but is later overwritten by astring. At the end of the false branch,ycan only have the typestring. The output type should be the union of the two branches, so the path-sensitive analysis gives functiongthe typeint -> {int, string}.If the analysis were path-insensitive,gwould be typedint -> {int, string, A}. Notice the extraAin the output type. Since it is not possible forgto output a value of typeA, some call sites ofgmay rely on this fact and will not even test whether the output is of typeA. At these call sites, a path-insensitive analysis will report a false positive type error, while a path-sensitive analysis will know that there is no problem. - Accurate semantic indexing.
Generating code indexes for an Eclipse-like code browser was the main purpose of the project (although I discovered other interesting uses later). PySonar parses the code into an AST and performs type inference on it, and at the same time it builds indexes that respects scopes and types. Because it performs inter-procedural analysis, it is able to find the definitions of attributes inside function parameters.The following image shows that it can accurately locate the fieldx.zwhich refers to the “z” fields in classesB1andC1, but notA1.
- Semantic error reports.PySonar catches type errors statically. The type system is conservative — it reports possible type errors on possible inputs even if some of them may not actually happen. This is a usual trade-off to ensure soundness type systems, and PySonar is no exception. An interesting fact is that because of Python’s duck typing, most of the time type errors appear as “trying to access undefined field”. It also reports some other errors that can be easily catched, such as “unreachable code”.

Limitations
- It does whole-program analysis and has exponential worst-case time complexity. Fortunately, like the Hindley-Milner system, the worst case happens very rarely.
- It requires all source code to be available. This could be a problem for closed-source projects.
Availability
There had been two versions of PySonar. Version 1 is open-sourced and merged into Jython. It is available from Jython’s indexer package. Version 1 hasn’t interprocedual flow analysis, so some variables cannot be resolved. Version 2 has all the features described above and lots of clean-ups, but it is not open-sourced.

Oh my goodness. That’s going to be useful. Eagerly awaiting the open-source release!
gotya.So now you are in google,right?
No. I only worked as an intern for Google this summer.
你复活了!!!
I’m impressed! Is the source code available? I’m a PhD student working on Refactoring for Python and I would love to compare and contrast our techniques.
Yours,
Alex
I am also interested in PySonar — I’m inheriting a python project and being able to run static analysis on it would be great!
Hello Yin, thank you for this posting. I found that the second link is not working now. Can you fix that? Thanks again.
Those two links are for identical files. The second link is for people who hasn’t access to Dropbox ;-) So if you have access to the first link, you don’t need to visit the second.
Do you have any idea when (if) the source code might be made available?
I have no idea whether it will be open-sourced this time, but the first version of the analyzer (written in 2009) was merged into Jython and is available from jython.com. But it doesn’t infer types for function parameters and has certain ugliness. The newer one had it almost completely rewritten and simplified, and can infer most of the types. I have no idea whether it will be open-sourced or not. I’m considering writing an open-source version in Python. The general structure of the thing is the same though. You can check it out through svn if you don’t mind look at the first version. It was in the package called “indexer” there. The command line is:
svn co https://jython.svn.sourceforge.net/svnroot/jython/trunk jython
Yin,
I am working in the distributed computing lab at Evergreen (blogs.evergreen.edu/oslab) and I’m interested in experimenting with indexing and refactoring huge sets of code. I have been studying your indexer and really like it (I find its type inferencing to be very fascinating). I think I would like to use your python version of the indexer rather than the one inside Jython. I will be experimenting with it over the coming days and will see what I can do.
It seems like I could perhaps combine its inferencing/indexing with your ydiff’s code comparison along with other tools to analyze complexity, dependencies, etc to produce some kind of organized output that gives a nice overview of lots of code (maybe in xml or something, that could then be read by web browsers and perhaps editors). Let me know if you have any advice or thoughts for this.
Jay
Jay,
It is my pleasure that you like my projects. The Python version of PySonar still has lots of work to be done, although the parts which I consider the most important are already there (primitives, function and union types, calls, sequencing, branches, termination check, etc.). One important thing missing is classes (product types). The scoping of classes and methods caused some major headache to me when I did the Jython version, but I finally have them pretty nicely implemented. If you would like to add classes to it, feel free to consult the Jython version. I will be adding it any way, but I’m spending most of my time on another project.
The blog comments are of limited functionality. Please feel free to send me email: yinwang0@gmail.com to continue the discussion.
Languages that make types truly first class allow programmers to lie about the types of their variables:
foo(Type T, T X) { … }
…
foo(T, X); // no guarantee that the type of X is equal to the value of T; and it is halting problem hard to prove that the type of X should be the value of T
What you are describing here is a parametric polymorphic type. When foo is applied as foo(T, X), we will know the actual type T, thus we will be able to check the type of X.
Yin, could you share us the feeling of being the internship in Google?
Working in Google, or doing research in university, which one may be more suitable to you ?
As far as I can tell, working in a company (e.g. Google) is very different from doing research. The ways of research are also very different from university to university. So it depends on which company/university you are at. You are welcome to send me email for more details (yinwang0@gmail.com).
How long do you manage to finish such a complex work? I heard about abstract interpretation when I take the course of static analysis. I always wondering wouldn’t it be a good idea if such interpreter can be embedded into the concrete(real) interpreter so that it can utilize some information from it. And I see so called “concrete-symbolic” testing seems to implement such an idea. But I think it must be very tough to implement that… How long take you to learn all these stuff?