RSS

Category Archives: programming languages

RubySonar: a type inferencer and indexer for Ruby

I have built a similar static analysis tool for Ruby. The result is a new open-source project RubySonar. RubySonar can now process the full Ruby standard library, Ruby on Rails, and Ruby apps such as Homebrew.

RubySonar’s analysis is inter-procedural and is sensitive to both data-flow and control-flow, which makes it highly accurate. RubSonar uses the same type inference technique of PySonar, and thus can resolve some of the difficult cases that can challenge a good Ruby IDE.

Advertisements
 
Comments Off on RubySonar: a type inferencer and indexer for Ruby

Posted by on February 3, 2014 in programming languages, static analysis

 

Programs may not be proofs

haskell-curry

Dear Haskell Curry and W.A. Howard,

Like many others, I have high appreciation of the beauty in your works, but I think your discoveries about the Curry-Howard correspondence have been taken too far by the programming languages research community. Following several of your examples showing the similarity between types and logic theorems, some of the researchers started to believe that all programs are just proofs and their types the corresponding theorems. In their own words: “Types are theorems. Programs are proofs.”

I think it’s just the other way around: “Proofs are programs, but programs may not be proofs”. All proofs are programs, but only some of the programs are proofs. In mathematical terms, programs are a strict superset of proofs. Many programs are not proofs simply because proving things is not why they are made. They exist to serve other purposes. Some of them (such as operating systems) may not even terminate, thus failing the most important criteria of being a proof, but they are perfectly legitimate and important programs. Programs are more physical than math and logic—they are real things similar to flowers, trees, animals (or their excrement), cars, paintings, buildings, mountains, the earth and the sun. Their existence is beyond reason. Calling all of them proofs is a bit far-fetched in my opinion :-)

Sometimes we can derive theorems from programs and say something about their properties, but way too many times we can’t. This is the similar to the fact that sometimes we can derive math theorems from the real world but we don’t always succeed. When we fail to predict the future, we can still live it. Similarly, when math and logic fail to predict the behavior of programs, they may still run without problems. It would be nice if more programs are designed in nice ways such that we can predict how they will behave, but there is a limit as to how far we can see into their future. Thanks to this limit, because if all the future can be predicted, programs may not worth existing and life may not worth living.

 
Comments Off on Programs may not be proofs

Posted by on January 17, 2014 in logics, philosophy, programming languages, proofs

 

Purely functional languages and monads

In general, functional languages are pleasant to work with because they support first-class functions, which are a very powerful modeling tool. But if you pursue the extreme — to use a purely functional language, you get adverse effects similar to OO design patterns. In a conventional OO language, having to use OO design patterns makes hard things even harder; but in a purely functional language, having to use pure functions to model side-effects makes trivial things hard and hard things impossible. So speaking of the two evils, OOP is the lesser one because at least easy things are still easy in them.

The problem with pure FP is: there exists things that are not pure.

Side-effects are real

Electrical engineers probably have the best understanding of purely functional languages, because electric wires are pure by default (if you are not paranoid enough to count heat as a side-effect). Purely functional languages are the analogous of combinational logic circuits. They are very useful and are an important part of a system, but you can’t use them alone to build up complex systems. This is why people invented flip-flops and sequential logic circuits. If you look at the design of the flip-flops, you will start to appreciate the ingenuity in them. They create memory out of “pure circuits”. And the memory is where the side-effects come from.

So pureness comes for free, but side-effects took efforts to invent. Unfortunately a large portion of the physical capabilities provided by the sequential circuits are disregarded by purely functional languages. Instead, purely functional languages are obsessed in simulating the effects on their own. There is a difference between the physical and the simulated. The simulated can’t be as efficient as the physical.

For a simple example, many efficient data structures rely on mutations to create connections between their components (think about circular data structures and networks). In such structures, side-effects act very much like a physical force to hold the parts together, and update them as needed. To implement this in a purely functional language, you have to use indirect ways. Every time you change it, you need to create a new data structure which shares most of the old data structure. But then you got into the trouble of keeping track of where the “current structure” is. It is moving in the dataflow and you have to pass it on. Had you used side-effects, the structure can stay at the same location, so you won’t need to worry that you will lose track of it. You don’t need to pass it on. This will save lots of administrative code.

Also sharing the old structures induces extra levels of indirection in the data structure, which causes significant overhead. Thus purely functional data structures can’t really match the performance provided by its impure counterpart. Purely functional data structures also cause lots of object creation and thus stress the garbage collector.

Some people say that purely functional data structures will reduce contention when there is lots of concurrency, but notice that if the observers of the old data structure “purely update” it, they will hold a different structure other than it’s current state. After that point, the universe is “forked” and they will live in completely different worlds. You will lose consistency. If you really want all threads to see the same data structure, then the contention is unavoidable, because you need a channel for the information to pass through. Nothing can save you from the contention because there is data dependency.

Everything is persistent in purely functional data structures, but how often do you need the outdated ones? In impure data structures, you still have the choice of making copies when necessary and achieve the same persistence effect, just with better performance.

Monads are design patterns

Purely functional languages also complicates the programs and ensue a huge cognitive cost. Modeling side-effects using pure functions is in the same mentality of wrapping functions inside objects as in OOP. They are both over-engineering and cause unnecessary manual labor.

One of the major “design patterns” in purely functional languages is called “monads”, a highly stylized way of structuring side-effects. If you look deep into them, monads make programs complicated and hard to write, and monad transformers are in essence a hack to get around monads’ limitations — they are not principled ways of composing monads. Representing side-effects using monads is as convoluted as writing interpreters or compilers using visitor patterns. For this matter, I wrote a short article some time ago which not many people have read:

To write programs in a purely functional programming language is much like living in a wired world. In such a world, there are no electromagnetic waves nor sound waves, so you don’t have wifi, cell phones, satellite TV, radios, etc. You don’t even have light or sound. In other words, everything is blind and deaf.

All information must pass through wires or pipes, connected by switch boxes which we call “monads”. You must carefully route all the wires and pipes and connect them correctly before any information processing device, including your eyes and ears, can properly work.

Trivial things in other languages (such as random number generators or circular data structures) become non-trivial in a purely functional language. Easy things often become research problems when you try to write them using monads, so you often see papers titled similar to “A Monadic Approach to a-solved-problem”.

About this I have an interesting story to tell. Once upon a time, my PhD advisor Amr Sabry tried to reimplement Dan Friedman’s miniKanren (a logic programming language) in Haskell, but he couldn’t figure out how to compose the monads. He asked for help from Oleg Kiselyov, arguably the world’s most knowledgeable person about getting around Haskell’s type system. And if you don’t know, Amr Sabry is probably the world’s most knowledgeable person about purely functional programming languages and side-effects. His paper What is a Purely Functional Language is often referred to as the official definition of “pureness”. After solving the problem with Oleg’s help, they coauthored a Functional Pearl paper. Ironically, Dan Friedman, the original author of that piece of code, didn’t have any such trouble writing it in Scheme in the first place. Certainly there is no reason Amr should be able to figure out how to compose the monads. He and Oleg just wrote the Haskell code for fun, but this story tells me something about monads: they make things unnecessarily complicated.

How did Dan Friedman write the code so easily? He just directly passed the states in-and-out and not using any monads, or you can say that he used monads’ essence without actually using them. Following Dan’s style, I rewrote miniKanren, added constraint logic programming and a highly sophisticated negation operator to it. All this is done within three weeks during my first semester as a PhD student, together with Dan’s B521, Amr’s B522, other course loads and teaching duties. I would certainly be bogged down had I used monads, and I don’t see any point of translating it into a monadic style. It is just so much simpler without monads.

(Correction to some historical facts by request from Prof. Friedman: He’d like to give a lot of the credits of the current miniKanren code’s simplicity to Chung-chieh Shan, who at one moment simplified the code to its current style.)

Equational reasoning can’t save the world

On the safety side, purely functional languages haven’t much advantage either. Some people claim that the value of monads is that they explicitly delimit the side-effects and support equational reasoning, but programs are not always as easy as algebra formulas such as a(b+c)=ab+a*c. If all programs are that easy, we would not need monads at all. Monads don’t really make your program easier to reason about.

Pure functions always return the same results when they are given the same input, but every monadic function has an extra “implicit” argument which is different at every call to the same function, so although it is still true that “pure functions always return the same results when they are given the same input”, the problem is, you never get the same input because of that always-changing extra argument! You don’t know what’s inside that argument. That argument is called the “state”.

There is no way you can statically know the values inside state monads, for the same reason that you don’t know the values inside the heap. This is because the state monads are in essence the same as the heap. The heap maps memory addresses to values, and state monads map variables to values. If you have written static analysis, it will be clear that monadic code is in essence putting parts of a static analysis into the user’s code. In other words, every monadic code is reimplementing part of a static analysis. So monadic code is as hard to analyze as if you use side-effects, only that it takes a lot more work to write. You can write horribly side-effective code with monads which nobody can understand and no static analysis tool can help. There is no such thing monads can make easier which can’t be done by static analysis. Static analysis researchers know this very well.

You can write pure functions in any language

Monads are contagious. Once the code gets into monads, it is not easy to get out. Having to explicitly specify side-effects in the types is similar to having to explicitly declare exceptions in Java. You must either “handle” it, or you must declare that you have passed it on. Why should programmers write them while they can be easily inferred by static analysis? Static analysis use what is essentially monads (or even more advanced techniques), and they take the burden of writing monadic code away from programmers.

Of course overusing side-effects will make programs harder to analyze, but you can reduce side-effects and write pure functions even in an impure language. For example, the following C function is a pure function, satisfying every requirement of the definition of “pureness”.

int f(int x) {
    int y = 0;
    int z = 0;
    y = 2 * x;
    z = y + 1;
    return z / 3;
}

Advanced static analysis tools would have no trouble figuring out things about this kind of code. They would know that this function is pure without you writing any annotations, and they can do a lot more for you than this. So pure functions don’t just belong to purely functional languages. You can write pure functions in any language (including assembly), but the point is, you should be allowed to use side-effects too, especially when they make things easier.

Thinking critically about mathematics as a language

Looking back into history, the dogma from mathematics is the driving force behind purely functional languages. Mathematical functions are simple and beautiful, but unfortunately they work well only when the thing you are trying to model is pure by nature. Purely functional language proponents like using buzzwords like “category theory”, and call those who don’t understand it “uninitiated”. I know a considerable amount of category theory. Even the category theorists themselves call it “abstract nonsense”, because it is to a large extent a grotesque way of saying what other mathematicians already know, in much the same sense that GoF design patterns are a grotesque way of saying what most decent programmers already know. So category theory is the analogous of design patterns in mathematics.

If you read Gottlob Frege’s article Function and conceptyou will be surprised that most mathematicians got functions wrong before his writing, and that was just a little more than a hundred years ago. Mathematics have done lots of things wrong with its language. This has been pointed out a long time ago by Gerald Susman in his Structure and Interpretation of Classical Mechanics and recently in his InfoQ talk. There is a lot of truth in his words. There is no reason programming language designers should blindly follow the ways of mathematics, because it is just another quirky language.

What is functional programming, really?

The above is not to disagree with functional programming in general. On the contrary, functional programming in general is highly valued. I just disagree with the dogma of the “purely functional” ones. Impure functional languages such as Scheme and ML haven’t these problems. They have “benign side-effects”. In fact, in Scheme functions are called “procedures”, not “functions”. This is because its designers know that they are not functions in the mathematical sense, and they intended to make them not necessarily pure. The purely functional language community often try to steal the word “functional programming” from traditional functional languages (Lisp, Scheme, ML), as if only purely functional languages deserve the name. This is not fair and this is harmful. We should be able to use the word “functional programming language” to refer to any language with correct implementation of first-class functions.

Don’t fall in love with your model

Everything starts to do harm when they are pursued to the extreme. Purely functional programming tries to fit the world into its model, but the world works in a completely independent way. It is wrong to think of everything as a nail when you have a hammer. Only by observing the reality can we get out of the religions that are limiting us. Don’t fit the world to your model. Fit your model to the world.

 
Comments Off on Purely functional languages and monads

Posted by on November 16, 2013 in functional programming, programming languages

 

psydiff: a structural comparison tool for Python

(click on the above picture for a demo)

Psydiff is a structural comparison tool for Python, written in Python itself. The main algorithm and UI of psydiff is almost the same as ydiff.

If interested, you can see a demo of it here (psydiff comparing itself):

http://www.yinwang.org/resources/pydiff1-pydiff2.html

All the source code can be downloaded from my GitHub repo:

http://github.com/yinwang0/psydiff

It’s still in early stage of development. I appreciate your bug reports, feature requests or contributions.

 
Comments Off on psydiff: a structural comparison tool for Python

Posted by on July 6, 2013 in programming languages, software, tools

 

Null reference may not be a mistake

null

The null pointer is considered to be a “billion-dollar mistake“. I have been wondering why there is such a notion until I saw the video where Tony Hoare claims it to be his mistake. In fact, he didn’t really say that null pointer should not be used.

From this video, you can see that introducing null reference is not really a mistake. On the contrary, null references are helpful and sometimes indispensable. The mistake is not in the existence of the null pointers, but in how the type system treats them. Unfortunately, most languages (C++, Java, C#, …) don’t treat them correctly.

Every class type A of Java is in fact a union type {A, null}, because you can use null where an A object is expected. {A, null} is almost equivalent to the Maybe type of Haskell, where null corresponds to Nothing of Haskell. So the trouble really is that an annotation like {String, null} should be distinguished from String, so that it will be clear whether null can possibly be its value.

Unfortunately most languages don’t provide a convenient union type that you can put String and null together (Typed Racket is an exception). If Java is to have union types, we can say something like:

{String, null} find1() {
  if (...) {
    return "okay";
  } else {
    return null;
  }
}

This is saying: find1 may return a name which is a String, or it may return nothing. Because of the union type {String, null}, the type system knows that you should check for null when you have called find(), so it will force you to write a null check:

String s = find();  
if (s != null) {
  x = s.length();
}

In comparison, if we define a slightly different function find2, with a different return type:

String find2() {
    ...
    return "okay";
}

From the return type we know that find2 will never return null, so the type checker can let you you use the String without checking:

String s = find();
x = s.length();

 
Comments Off on Null reference may not be a mistake

Posted by on June 3, 2013 in programming languages, types

 

Back to the future of databases

Why do we need databases? What a stupid question. I already heard some people say. But it is a legitimate question, and here is an answer that not many people know.

First of all, why can’t we just write programs that operate on objects? The answer is, obviously, we don’t have enough memory to hold all the data. But why can’t we just swap out the objects to disk and load them back when needed? The answer is yes we can, but not in Unix, because Unix manages memory as pages, not as objects. There are systems who lived before Unix that manage memory as objects, and perform object-granularity persistence. That is a feature ahead of its time, and is until today far more advanced than the current state-of-the-art. Here are some pictures of such systems:

IBM System/38

system-38

Lisp Machine

lisp-machine

Oberon

Oberon

Those systems don’t really need databases (in its usual sense). Data integration was seamless and transparent to the programmer. You don’t need to know the existence of a “disk”, a “file system”, or a “database”. You can just pretend that you can allocate infinite number of objects and work on them in the most natural way. Unfortunately most of those systems were either terribly expensive or had problems in other aspects of their design. Finally, they seemed to have died out.

Good ideas never die. Nobody uses those systems today, but this is not to say that there is nothing we can learn from their design. On the contrary, some of their ways are far superior than the current state-of-the-art of Unix-based systems. Unix will never reach that level of elegance and power.

But any how, Unix rules the world. We can live with it, but it is just mediocre. Please don’t believe everything that the modern Operating Systems books tell you. Sometimes you have to look further into the past for the future. As Einstein said, “Nothing is more needed to overcome the modernist’s snobbishness.”

Unix used to be free, but you get what you pay for. Although there is a thing called “virtual memory”, your programs can’t just allocate objects and then operate on them without any knowledge about the “file system”. Nothing works out-of-the-box in Unix. In fact it is far from that. Unix and its “philosophy” is a constant source of trouble. It is more like a “non-operating system” than an “operating system”. It leaves too much work for you to do, and leaves more than enough rope to hang yourself.

Unix builds its reputation and authority by blaming the users. If you don’t know how to use me, you are an idiot! This is the same trick that the weavers played on the emperor: If you can’t see the clothes, you are either stupid or incompetent. What a powerful way to cover the ass of any crap!

Unix’s incapability is why people “invented” databases. The combination “Unix + databases” is supposed to be a cheap replacement for those advanced systems where programs don’t need to know the existence of such a second-level data storage layer. But because of some irreparable design issues, Unix still can’t achieve that even with the help of databases. The databases, until today, still relies on Unix’s low-tech mechanisms such as memory mapped files to store data, which causes complications and performance issues.

However, databases were somehow considered a big thing, and people who made it became the richest men in the world. Consequently, you have to take database classes if you want a computer science degree. So here is an ultimate cheat sheet for those who really want to know what a database is. You will not need to sit through a semester’s course if you remember the few things that I put below. Trust me, many students got A+’s because I told them this ;-)

Every “row” in a database “table” is a data structure, much like a “struct” in C, or a “class” in Java. A table is then an array (or list) of such data structures. The “keys” of a database table are in essence “persistent memory addresses”. When serializing an object, we can’t just put the memory address of an object onto disk, because the address may not be the same when the object is reloaded into memory. This is why we need “keys”. In a sense, “keys” are a more general notion than “addresses” — addresses are just keys that are integers.

There is a white lie in the above paragraph – I didn’t mention that there is some redundancy in a database table in comparison to a serialized data structure. Some data is duplicated across multiple rows because a RDBMS table row has fixed width, so it can’t store variable length data such as arrays. What can you do then? By recognizing that the table is the only thing that can “grow” in a relational database, an obvious solution is to turn the array 90 degrees clockwise, and make each element a row in another table! But how do you connect from where the array is originated from? You add the key of the object to each row of this new “array table”. See how the key is duplicated? This is why people “invented” column-based databases (such as Vertica, HBase etc) for “compressing” these keys. But what they achieved was essentially making the tables slightly closer to the format of serialized data structures.

You create the problem, and then you solve it. And you call this two inventions.

Keys are persistent pointers. Whenever you need to dereference a pointer, you do a “join” in the database, so “join” is equivalent to “following pointers”.

A database “schema” is in essence a “structure type”, like the struct definition in C. For example, the schema created by the following SQL statement

CREATE TABLE Students ( sid CHAR(20),
                        name CHAR(20),
                        login CHAR(20),
                        age INTEGER,
                        gpa REAL )

is equivalent to the C struct

struct student {
  char* sid;
  char* name;
  char* login;
  int age;
  double gpa;
}

(Note that I use a SQL declaration here just because I don’t want to draw a picture of the schema. This equivalence of a relational schema with a structure type has nothing to do with SQL.)

That’s almost the whole story. You have addresses, pointers, dereference operation, structure types, arrays/lists of structures, so now you can implement things like linked lists, graphs etc. With them, you can implement some complicated algorithms such as A* search in a database. You just need to take a data structure class, and then translate what you learned there into a database language like SQL.

But SQL is a crappy language. It wasn’t designed for programmers. It was designed for manual input by human operators (usually non-technical people like accountants). You type in a “query”, and the computer prints out a “response”. That is why it is called a “query language”. This language does its job for human operators, but it was then abused beyond its capabilities. It was interfaced with computer programs to write serious programs. I doubt if those people knew what they were doing, but it just happened, like many other silly things. There are just so many things you can’t express in that language. The result is a dumb and fragile system held together by band-aids. You have to be very careful otherwise you lose blood.

If you really want to learn SQL, here is the cheat sheet for it:

The query

SELECT Book.title
 FROM Book
 WHERE price > 100

is equivalent to the Lisp expression

(map (lambda (b) b.title)
     (filter (lambda (p) (> p 100)) Book)

This program is then sent to the “database engine” for execution. That is, we move the program to the data, instead of loading the data to the program. And that’s also the principle behind MapReduce. Have you noticed how easy this can be done with Lisp? You just send the code to the interpreters running on remote machines!

The problem with SQL is that you need yet another layer of language before programs can operate the database. SQL is a weak and quirky language. It is not Turing-complete and at some places it doesn’t even “compose”. You need to combine it with a decent language before you can write serious programs. Your programs send SQL commands to the database to store and load data, just like a human operator would do. This is a very low-tech way of data integration. It is error-prone, inefficient and subject to security risks such as “SQL injection”.

Indeed, there is a “good component” in SQL, because it has some “relational programming” features. However, the other word for “relational programming” is “logic programming”, where languages like Prolog and Datalog excel. They are both more expressive and more elegant than SQL. Considering that Prolog and Datalog appeared much earlier than SQL (1972, 1977 v.s. 1986), I would say that SQL is a step backwards.

Maybe you have seen, for some weird reasons we are still in the Dark Ages of computer programming. We are not supposed to be here since better designed systems already existed. It would be foolish to dismiss them as failures. They are just ahead of their times. By looking to the past, we see a way back to the future.

 
Comments Off on Back to the future of databases

Posted by on April 5, 2013 in data structures, programming languages

 

On pureness

It is usually a pleasure to use a functional programming language, but it is doubtful whether we should care about the “pureness” of the language. To write programs in a completely pure functional programming language (e.g. Haskell) is much like living in a wired world.

There are no electromagnetic waves nor sound waves in such a world, so you don’t have wifi, cell phones, satellite televisions, radios, etc. You don’t even have light or sound. In other words, everything is blind and deaf.

All information must pass through wires or pipes, connected by switch boxes which we call “monads”. You must carefully route all the wires and pipes and connect them correctly before any information processing device, including your eyes and ears, can properly work.

So I think a language will need parts of it not to be pure, in order to naturally represent the effects such as electromagnetic waves. Normally those effects are called “benign effects“.

 
Comments Off on On pureness

Posted by on March 31, 2013 in opinions, philosophy, programming languages, semantics