RSS

Monthly Archives: April 2013

How to reduce the contagious power of ‘const’

For the purpose of optimization and (possibly?) safety, the C++ programming language decided that the users should manually specify ‘const’ annotations for variables that they know will/should never be modified. While I admit the possible usefulness of this annotation, I often see it misused, and this often results in a contagious effect where unnecessarily many variables and parameters are annotated as ‘const’.

In this post, I reason philosophically about the true meaning of the ‘const’ annotation, and propose some tentative principles regarding the use of it. By following them, I hope to reduce the contagious power of ‘const’ to the minimum.

First, the rule for ‘const’ annotations for variables:

Rule 1. Declare a variable as ‘const’ if you want it to be inlined, or you don’t want it to be modified.

By this principle, these examples are legitimate:

const int max = 1000;
const char* name = "permanent name";

These declarations can be global variables as well as local variables. Declaring them as ‘const’ can facilitate the inlining opportunities of them, as well as prevent unwanted modifications.

Now we just have one other rule. This one concerns the ‘const’ annotations on functions:

Rule 2. As a function, be self-disciplined about what you take, but be relaxed on what you give.

This means, put ‘const’ on the parameters whenever you are sure that you will by no chance modify the input argument, but never put ‘const’ on the return type unless the compiler asks you to.

The reason behind this may take a little bit of thinking. By putting ‘const’ on the input parameter, the function is basically declaring: “Whoever passes me this argument, I promise not to modify it.” “I’m side-effect free on this parameter.” “I’m pure on this parameter.”

This is a good thing, because you give the callers of the function more freedom of what they can give you. If you don’t declare your “pureness” (on the parameter), some callers may not be able to pass you certain data because they don’t want their data to be accidentally modified by you. They need their original values intact, because they are going to use them after you return.

So in general, this is a good thing to do in C++:

  void foo(const char* in);

But there is a catch here: whether you can put ‘const’ here depends on the function body and all the function calls inside. Whenever you put ‘const’ on the parameter type, you have to make sure that any functions you call don’t modify this parameter either. That is to say this is a transitive property. For example, something like this will not work:

  void foo(const char* in) {
    bar(in);
  }

  void bar(char* in) { ... }

That is because bar may modify its parameter, which is also foo’s parameter.

Is this to say that you should declare bar as void bar(const char* in)? It all depends on whether bar actually modifies its argument or not. If it does, then there is no way you can use ‘const’, consequently you cannot declare foo as taking a const char either. Then the type of foo should be “void foo(char in)”, not having the ‘const’.

There is no way you should use ‘const’ in foo then, because the helper bar modifies the data. You have to say this honestly: “I may modify the contents of in, because one of the functions I call modifies it.”

This is where it can get painful in C++, because the other functions you call may be written by other people and they may not follow the same principles here. They may not have ‘const’ on their parameter types even though they haven’t modified the parameter in their function body. This is not their fault, because adding those ‘const’ annotations is so boring. Some automatic inference can certainly be helpful in these cases, but unfortunately inference for ‘const’ is not provided by C++.

On the other hand, why should you NOT put ‘const’ on the return type? This is because if you do, then you are basically announcing: “Whoever calls me cannot modify the value that I return.”

This like saying this in your will: “I will give my house to my son, but he is not allowed to modify it.” Well, the law may give you the right to specify this, but what good can this do to your son? You are limiting the happiness you can provide. More over, can this kind of restriction really be enforced? Who has the time or right to keep an eye on your son and restrict his action on the house? This is just impossible to work.

Coming back to the context of programming, what’s the point of not allowing the callers to modify what you GIVE them? By returning, you are giving up control of the return value to the caller. It should be up to the caller, and not you, to decide whether to modify the return value or not. Even if you put ‘const’ annotations in the return type, they are usually ignored by the callers. They just use a const_cast whenever the ‘const’ gets in their way. This makes the ‘const’ annotation on return types virtually useless.

If you put ‘const’ on the return type and the callers respect it, then it will contaminate the whole data path wherever the return value goes. This is nasty and unnecessary.

So in general, this is not a good thing to do:

  const char* bar();

Because your caller would have to be like this:

void baz() {
  const char* ret = bar();
  bizaar(ret);
}

And the receiver of the return value (bizaar) must be defined as something like:

  void bizaar(const char* in) {
    bizaar2(in);
  }

And now bizaar2 needs to be declared as:

  void bizaar2(const char* in);

And so on… You see how the contagion started? Because bar() returned a const value, which is unnecessarily restrictive. So in general, it is not a good idea to return ‘const’.

I’ll leave this as an thought exercise: When can we reasonably use ‘const’ on the return value?

 
Comments Off on How to reduce the contagious power of ‘const’

Posted by on April 8, 2013 in programming, types

 

Back to the future of databases

Why do we need databases? What a stupid question. I already heard some people say. But it is a legitimate question, and here is an answer that not many people know.

First of all, why can’t we just write programs that operate on objects? The answer is, obviously, we don’t have enough memory to hold all the data. But why can’t we just swap out the objects to disk and load them back when needed? The answer is yes we can, but not in Unix, because Unix manages memory as pages, not as objects. There are systems who lived before Unix that manage memory as objects, and perform object-granularity persistence. That is a feature ahead of its time, and is until today far more advanced than the current state-of-the-art. Here are some pictures of such systems:

IBM System/38

system-38

Lisp Machine

lisp-machine

Oberon

Oberon

Those systems don’t really need databases (in its usual sense). Data integration was seamless and transparent to the programmer. You don’t need to know the existence of a “disk”, a “file system”, or a “database”. You can just pretend that you can allocate infinite number of objects and work on them in the most natural way. Unfortunately most of those systems were either terribly expensive or had problems in other aspects of their design. Finally, they seemed to have died out.

Good ideas never die. Nobody uses those systems today, but this is not to say that there is nothing we can learn from their design. On the contrary, some of their ways are far superior than the current state-of-the-art of Unix-based systems. Unix will never reach that level of elegance and power.

But any how, Unix rules the world. We can live with it, but it is just mediocre. Please don’t believe everything that the modern Operating Systems books tell you. Sometimes you have to look further into the past for the future. As Einstein said, “Nothing is more needed to overcome the modernist’s snobbishness.”

Unix used to be free, but you get what you pay for. Although there is a thing called “virtual memory”, your programs can’t just allocate objects and then operate on them without any knowledge about the “file system”. Nothing works out-of-the-box in Unix. In fact it is far from that. Unix and its “philosophy” is a constant source of trouble. It is more like a “non-operating system” than an “operating system”. It leaves too much work for you to do, and leaves more than enough rope to hang yourself.

Unix builds its reputation and authority by blaming the users. If you don’t know how to use me, you are an idiot! This is the same trick that the weavers played on the emperor: If you can’t see the clothes, you are either stupid or incompetent. What a powerful way to cover the ass of any crap!

Unix’s incapability is why people “invented” databases. The combination “Unix + databases” is supposed to be a cheap replacement for those advanced systems where programs don’t need to know the existence of such a second-level data storage layer. But because of some irreparable design issues, Unix still can’t achieve that even with the help of databases. The databases, until today, still relies on Unix’s low-tech mechanisms such as memory mapped files to store data, which causes complications and performance issues.

However, databases were somehow considered a big thing, and people who made it became the richest men in the world. Consequently, you have to take database classes if you want a computer science degree. So here is an ultimate cheat sheet for those who really want to know what a database is. You will not need to sit through a semester’s course if you remember the few things that I put below. Trust me, many students got A+’s because I told them this ;-)

Every “row” in a database “table” is a data structure, much like a “struct” in C, or a “class” in Java. A table is then an array (or list) of such data structures. The “keys” of a database table are in essence “persistent memory addresses”. When serializing an object, we can’t just put the memory address of an object onto disk, because the address may not be the same when the object is reloaded into memory. This is why we need “keys”. In a sense, “keys” are a more general notion than “addresses” — addresses are just keys that are integers.

There is a white lie in the above paragraph – I didn’t mention that there is some redundancy in a database table in comparison to a serialized data structure. Some data is duplicated across multiple rows because a RDBMS table row has fixed width, so it can’t store variable length data such as arrays. What can you do then? By recognizing that the table is the only thing that can “grow” in a relational database, an obvious solution is to turn the array 90 degrees clockwise, and make each element a row in another table! But how do you connect from where the array is originated from? You add the key of the object to each row of this new “array table”. See how the key is duplicated? This is why people “invented” column-based databases (such as Vertica, HBase etc) for “compressing” these keys. But what they achieved was essentially making the tables slightly closer to the format of serialized data structures.

You create the problem, and then you solve it. And you call this two inventions.

Keys are persistent pointers. Whenever you need to dereference a pointer, you do a “join” in the database, so “join” is equivalent to “following pointers”.

A database “schema” is in essence a “structure type”, like the struct definition in C. For example, the schema created by the following SQL statement

CREATE TABLE Students ( sid CHAR(20),
                        name CHAR(20),
                        login CHAR(20),
                        age INTEGER,
                        gpa REAL )

is equivalent to the C struct

struct student {
  char* sid;
  char* name;
  char* login;
  int age;
  double gpa;
}

(Note that I use a SQL declaration here just because I don’t want to draw a picture of the schema. This equivalence of a relational schema with a structure type has nothing to do with SQL.)

That’s almost the whole story. You have addresses, pointers, dereference operation, structure types, arrays/lists of structures, so now you can implement things like linked lists, graphs etc. With them, you can implement some complicated algorithms such as A* search in a database. You just need to take a data structure class, and then translate what you learned there into a database language like SQL.

But SQL is a crappy language. It wasn’t designed for programmers. It was designed for manual input by human operators (usually non-technical people like accountants). You type in a “query”, and the computer prints out a “response”. That is why it is called a “query language”. This language does its job for human operators, but it was then abused beyond its capabilities. It was interfaced with computer programs to write serious programs. I doubt if those people knew what they were doing, but it just happened, like many other silly things. There are just so many things you can’t express in that language. The result is a dumb and fragile system held together by band-aids. You have to be very careful otherwise you lose blood.

If you really want to learn SQL, here is the cheat sheet for it:

The query

SELECT Book.title
 FROM Book
 WHERE price > 100

is equivalent to the Lisp expression

(map (lambda (b) b.title)
     (filter (lambda (p) (> p 100)) Book)

This program is then sent to the “database engine” for execution. That is, we move the program to the data, instead of loading the data to the program. And that’s also the principle behind MapReduce. Have you noticed how easy this can be done with Lisp? You just send the code to the interpreters running on remote machines!

The problem with SQL is that you need yet another layer of language before programs can operate the database. SQL is a weak and quirky language. It is not Turing-complete and at some places it doesn’t even “compose”. You need to combine it with a decent language before you can write serious programs. Your programs send SQL commands to the database to store and load data, just like a human operator would do. This is a very low-tech way of data integration. It is error-prone, inefficient and subject to security risks such as “SQL injection”.

Indeed, there is a “good component” in SQL, because it has some “relational programming” features. However, the other word for “relational programming” is “logic programming”, where languages like Prolog and Datalog excel. They are both more expressive and more elegant than SQL. Considering that Prolog and Datalog appeared much earlier than SQL (1972, 1977 v.s. 1986), I would say that SQL is a step backwards.

Maybe you have seen, for some weird reasons we are still in the Dark Ages of computer programming. We are not supposed to be here since better designed systems already existed. It would be foolish to dismiss them as failures. They are just ahead of their times. By looking to the past, we see a way back to the future.

 
Comments Off on Back to the future of databases

Posted by on April 5, 2013 in data structures, programming languages

 

Why is indexing faster than binary search

We all know that indexing into an array takes only O(1) time, while searching for a number in a sorted array takes O(n) time with linear search, and O(log n) time with binary search. But why is indexing so fast? What is the secret sauce?

The reason is really about the nature of indexing — how it is implemented in a circuit. In order to illustrate this, let me show you an “addressing circuit”.

addressing cuicuit

Here, A and B are the two-bit address lines, they represent the indices: 00, 01, 10, 11. The output Z, Y, X and W are the selectors of the items in the array. Notice that an output selector is enabled only when both of the input lines of the corresponding AND gate is “1”.

Now, ignore the input B and just look at A. See how its signal flows through the direct wires and the inverters. We can make the following observations:

  • When A is “1”, then the AND gate of X and W will receive a “1” on one of their input ports, while the AND gate of Z and Y will receive a “0” on one of their input puts.
  • On the other hand, if A is “0”, then the AND gate of X and W will receive a “0” on one of their input ports, while the AND gate of Z and Y will receive a “1” on one of their input puts.

From the above, I hope you have seen that the value of A partially selects half of the AND gates — it is either the set {X, W} or {Z, Y}. By “partially select”, I mean they are not fully selected, because we haven’t taken B into account. At this point, I hope you have noticed that A is in fact doing one step of a “binary search”.

Now we do a similar thing, but this time focus on B and ignore A. You should see a similar thing: depending on the value of B, either we partially select {Y, W}, or we partially select {Z, X}. So we can also think of B as doing one step of a “binary search”.

Now, we see that A and B are each a step of a binary search, and it is interesting to see that B’s selection will cut A’s selection evenly, whether A is 0 or 1. We can say the same thing vice versa: A’s selection will cut B’s selection evenly, whether A is 0 or 1.

Also notice that the selection of A and B can happen at the same time. That means, when they work simultaneously, it takes just O(1) for a binary search through an array of length 4. If we generalize this circuit to N bits of input, then within O(1) time, we can do a binary search through an array of length 2N.

This explains why indexing an array is faster than binary search, because it is a parallel binary search where (log n) steps happen at the same time.

 
Comments Off on Why is indexing faster than binary search

Posted by on April 2, 2013 in algorithms, architecture, concurrency