Deep learning, computer vision and autonomous cars are today’s hottest technologies. For several years, deep learning researchers have been frequently on the news, declaring that they have achieved “human-level” or even “superhuman” computer vision–they made computers recognize objects better than humans.
As a consequence, they made people believe that it is now prime time for developing autonomous cars and other intelligent machines. Companies are rushing into the autonomous car space, each claiming that they will achieve fully autonomous cars within a short number of years.
After studying deep learning technologies and frameworks carefully, what I have learned personally is that not only are autonomous cars infeasible using today’s technologies, but there is a lot of falsehood in the computer vision (deep learning) field itself.
Simply put, there is no real “human-level computer vision”. What the field has been advertising as “human-level” or “superhuman” is simply not true. Their claims are based on unrealistic and wrong standards. I consider today’s so-called “human-level vision” the emperor’s new clothes. This phenomenon is outrageous.
To be straightforward, let me start with the most fishy part, the top-5 accuracy.
Be aware of “top-5 accuracy”
If you look at recent computer vision papers, you will find that their claimed “human-level accuracy” has been based on the so-called “top-5 accuracy“. What is top-5? In common words, this means for each image to be recognized, the subject machine or human are given five chances to get the correct answer.
You can find top-5’s official definition on ILSVRC’s website. ILSVRC stands for ImageNet Large Scale Visual Recognition Challenge, the currently most recognized competition in the computer vision field. All recent accuracy numbers are coming from their data set and competition rules.
The first deep learning model claiming to achieve human-level performance was ResNet, developed by Microsoft Research. Soon afterwards, a few other “superhuman” vision models were developed, based on similar methodology and the same measuring standard, top-5.
ResNet is claimed to have an error rate of 4.49% (thus an accuracy of 95.51%). They also claim that human-level error rate was 5.1%, thus ResNet is super human-level. What most people do not notice is that all these fantastic numbers are using top-5, often implicitly.
Very few people would question how the error rates were measured. They never ask what “top-5” means. Often they were not even told about top-5 at all. Most outsiders took for granted that the tests were run with the simplest method they had in mind–count the number of correct answers, then divide that by the total number of tests.
Surprisingly, that is not the case. The accuracy is measured with top-5. That is, for each test image, you are given five chances, not just one, to get the correct answer.
What is wrong with top-5 then? Let me illustrate with an example. If I give you a picture of a car, with top-5 rules you can arrive at your answer this way:
- Is it an orange?
- No? Then I guess it is a coffee mug.
- No again? Well, a horse?
- No? It looks like a mobile phone.
- What?? Okay, final guess, it is a car!
Yes, it is a car. You are correct. Congratulations!
Five chances, you got the right answer, then this counts as a correct recognition. Given hundreds of pictures, then we can calculate your error rate. For ResNet, this error rate was 4.49%. But if you give ResNet only one chance to recognize each image, the error rate was 19.38%. The latter is called the “top-1 error rate”. Notice the big difference.
Even an ordinary person can see how ridiculous the top-5 standard is, but it is now widely used as the accuracy standard in computer vision. It is used by ResNet and all other recent deep learning vision models who claim to perform at super human-level. Whenever they compare with humans, they use top-5. I have never seen a top-1 comparison with humans.
Computer vision people were eager to publicize top-5 accuracy numbers. They create news, give talks, do interviews, claim “super human-level”, “90+% accuracy”, often without even mentioning the word “top-5”. With top-5, computer vision appears to have dramatic improvements over recent years.
Based on top-5 numbers, lots of people have taken for granted that computers have superseded human vision, getting better and better. If computers can recognize objects more accurately than humans, we can then rely on them to make judgements when driving cars. Fewer accidents, safer roads, no more tedious driving jobs, wasn’t that fantastic?
If we could make it then that would be nice, but unfortunately that is not going to happen, because computers have not really superseded human recognition capabilities, not even close. Why? Because top-5 is wrong.
Top-5 accuracy is wrong
I felt wrong when I first saw top-5. I couldn’t believe that anybody would measure accuracy this way. “What? Five chances?” I asked around the people I know, some of them computer vision experts working at top-notch companies investing big into AI. They told me this and that reasons why top-5 has been used, but I was never convinced. I was pretty sure that they just told me what they had been told. They didn’t think critically. It didn’t take me long to convince myself that top-5 accuracy is wrong, ridiculously wrong.
Top-5 is a very fuzzy way of measuring accuracy. It blurs the difference between good and bad recognizers. Let me make an analogy. If you are a professor making final exam rules, for each question you give the students five chances to get the correct answer. What will happen? You will have trouble telling good student from bad ones. You are simply giving the bad students chances to appear better, much better than they really are.
Good students (humans) need just one chance to arrive at the correct answer. They don’t need five. They never asked for them. They look at the thing. If they know what it is, then that’s it. If you give them five chances, they only use one. They waste the other four chances because you forced them to play with top-5 rules. Their top-5 accuracy is the same as their top-1 accuracy.
On the other hand, the bad students (neural nets) can’t get the correct answer for the first try, so you give them five chances. They utilized all five chances. Thus their top-5 accuracy is much higher than their top-1 accuracy. With top-5, neural nets appear to have high accuracy, some even better than humans. But if you test them with top-1, they could be much inferior.
Should you give everybody five chances? Usually no. The real world is not a research competition like ILSVRC. It is a cruel game of life. Many real-world situations give you only one chance to recognize an object, and that one chance often means life or death. Eating, walking across the street, driving… You are often not allowed to make even one mistake in your recognition. You are not allowed to be fuzzy.
When you drive, can you afford to recognize a roadblock as “a whiteboard, or a cheesecake, or cherry jam, or…”? Having five options usually means that you don’t know anything at all.
Have you realized how ridiculous and wrong top-5 is? To me, this is a scam. Every student with basic math or computer science training should have noticed that it is wrong, but nobody questioned it. Computer vision researchers happily adopted it. Computer vision as a field is cheating and brainwashing the world’s people to believe that they have achieved “human-level recognition”, while they never actually did. They just manipulated the accuracy standards to make the numbers look good.
You may have heard of reasons why they need to use top-5, but after some thought, you will find those are just excuses. Whatever problems top-5 was meant to solve, they had better alternatives. Even an undergraduate student can find better ways, but the advanced researchers, some with decades of experience, all chose top-5. The real purpose of top-5 is to make the accuracy numbers look good, and only by using top-5 can they claim “human-level performance”.
Human-level vision has a long way to go
How far are machines from achieving true human-level vision? Very far. If you observe your own vision system a little carefully, you may realize that human vision is fundamentally different from neural nets. I’d like to talk about that in a following post.
Has anybody questioned how the human error rate (5.1%, as often appears in publications) was measured? Who are the human test subjects, how many of them were tested, and how was the result measured? I got no reasonable scientific description of this aspect either.
What about the representing power of the test data? All the images in ImageNet are clear photos taken in good lighting conditions. How will the machine behave under low-light, reflective, blurry or blocking conditions? Unknown. Does image classification (telling the names of objects) represent all of “human vision”? Is it enough to drive a car if you just know the names of everything? …
Too many unanswered questions.
Wrong accuracy standard, questionable data. See something, say nothing. This is why I consider “human-level computer vision” the modern emperor’s new clothes. Computer vision field made some good progress, but saying “human-level” or “superhuman” is cheating.
The whole field of autonomous cars is based on shaky ground, and thus pretty much doomed. We can still do something useful with neural nets, but not such mission-critical applications that require true human-level vision.
I have a lot more to say on this topic, but I want to keep this post short and concise. It’s time for everybody to do their own research.
(Disclaimer: I’m writing all this article out of my own conscience. This article does not in any way represent views of my employer, Intel Corporation, or anybody else working at Intel. The views, the writing and the research into the falsehood of the computer vision field are all of my own personal activity and have not been supported by Intel Corporation in any way. None of the people I directly or indirectly mentioned in this article work at Intel Corporation.)