High school math in navigation application deployment

[Saša Bistrović] - High school math in navigation application deployment - header v2

If you have been following entries at https://blog.mireo.hr/, you had an opportunity to get a grasp on one of Mireo’s core business divisions. We would like to introduce some aspects of the other major business division, the Navigation application development.

Although some, or perhaps even many, of the functional requirements for navigation software are similar to the functional requirements already mentioned in previous blogs, one fundamental difference between navigation software and other software we make is the hardware it is going to be run upon.

The Fleet tracking solutions, including space-time as its base component, are run on some “mainframe” servers, while the UI of the solution is presented to users in a web browser of their choice. So with these type of solutions, we have powerful resources at our disposal, we can choose the types of servers we want, the speed of processors at minimal requirements, the size of RAM, the size and the type of permanent storage we are going to use. In addition, we have designed our solutions to be scalable, so we can always add a new server to support specific deployment of our solution.

On the other hand, with navigation software, things could not be more different. First of all, we have no influence over the devices (hardware) which will be the platform running our navigation software: at best, we know which operating system is going to be used on a device.

Secondly, these devices are typically less computationally powerful than a server we would typically use for our non-navigation solutions. And what devices are we talking about?

Well, in the beginning, those were mostly PND-s, typically having single core 400-800 MHz processor, 64 to 128 MB RAM and usually no restriction of permanent storage: you could insert even 2GB SD card to hold your map data. In those days, all these devices were running on various versions of Windows CE operating system, even though Microsoft was no longer supporting this particular OS even back then, device manufacturers loved it. For a long time, our navigation software had to run using no more than 32MB of RAM, regardless of the maps it used (i.e. for maps of whole Europe that took 2 GB of storage in compressed form).

Some of these devices could only run navigation and nothing else, some did have some sort of music application (MP3 player) as well, but usually once you started one of the applications, that was it, you had to stop it in order to start the other one.

In those same days, there were some in-car devices also running on Windows CE, running Radio application, some music/video application as well, and you could switch among those running applications. Still, from the computational power aspect, they had pretty much the same hardware as PNDs did.

Over the course of time, the share of those in-car devices was rising, and they have eventually evolved to Infotainment systems. The hardware was also progressing, not only in regard to the power of processor or quantity of memory, but also some hardware components such as graphics chips were being added to them.

As the era of smartphones began, the PNDs gradually went extinct. The era of smartphones also introduced new operating systems into arena, namely iOS and Android: nearing its end, even some PNDs were being made on Android platform. But the share of old Windows CE platform in Infotainment systems was dropping, and share of Android and various embedded Linux platforms was raising. So was the computational power of them, making our life a bit easier.

But there is also the third aspect of the hardware, or perhaps we should better say devices, that can impact our navigation software, that simply is not an issue with server solutions. Calling it quality is perhaps not the best possible term, but the process of producing a server, and installing an operating system on it is not quite the same as the process of producing devices that will be used as Infotainment or PND. The latter group of devices can significantly differ in hardware components they will have, along with the drivers that will be used on particular device. Even the components of the OS installed to 2 similar types of devices need not be exactly the same, and all above mentioned can have significant impact on the behavior of the navigation application. And it is never a pleasant one.

So, to summarize the last aspect, it would be naive to believe if your navigation software runs nice and smooth on a device type A, and somebody brings device type B that is run on the same operating system, that it will run equally smoothly. No sir.

Over the course of time, our navigation applications have been deployed on hundreds, maybe even thousands of different types of devices. How does that work? Well, sometimes our partners procure some devices, they install our navigation to them, test how it runs, all runs fine, and they simply sell them and buy licenses from Mireo. We never even see (or know of) any such device.

Sometimes, there is a problem and our partners send us a sample device so we can install it ourselves and look at the problem. And sometimes, there are requirements for minor changes in the application behavior for a specific device, so we get sample device and we can immediately test these specific modifications on target hardware. For either reason, we investigate the specific case, fix any specific problem and deliver working solutions.

And then sometimes, after all of this, after we have thoroughly tested application on specific device, we are contacted with something like “application sometimes crashes when launched”. What goes without saying is “you guys have a nasty bug, so please fix it”. That is when nightmare begins.

So we know that this same navigation software has been tested on hundreds of different device types (of the same OS), that WE have thoroughly tested it on this same particular device type, and never run into any problem. Could it really be a bug in navigation software?

Well, it always can be. But end-user application bugs usually have a distinct quality of reproducibility: when you repeat a certain scenario of end-user actions, your bug surfaces. So there always is a possibility that some very unlikely unfolding of user actions, in some specific set of circumstances (specific geographical position for example) leads to a particular situation not perfectly handled by the application, but it is reproducible. Sometimes, it is quite difficult to reproduce it, but it is nevertheless reproducible.

A statement “sometimes crashes during boot”, with the “sometimes” being a key word here, indicates lacking of the reproducibility quality of the “bug”. Is it still possible that it actually is a bug in the navigation itself? Well, yes, it is still possible that we have not completely narrowed down the reproducibility pattern, but there is also another possibility here: it is not up to navigation software itself, it is up to its interaction with the rest of the system, meaning the operating system itself. But as a certain fiction character in a similar situation would say, you have to be realistic ‘bout these things. And reality is that it is our application which is crashing.

So, our application sometimes crashes, and all we have is a trace file saying that there was an attempt to malloc 10 345 548 768 of bytes that naturally failed (particular device has 256 MB of RAM, navigation is generously allowed to use 92). Well guys, you tried to allocate 10 GB of data and now you want to blame someone else?

We try to reproduce it, and fail. We communicate this, and are told that this will not happen unless video application is running some movie or something at the time of the navigation launch.

So OK, we put some video spot downloaded from internet on memory stick, stick it to the device, video reproduction application starts to run it, and then we start our navigation.

And nothing happens. Zip. Nada. We try and try, but nothing.

So again, we communicate this back. Days have already passed, and now we get a video where the incident has been recorded: we can see a movie running in the background, we can see somebody pressing a button on device to launch navigation, we can see it being launched, and we can see the system dialog saying that our application has made a violation at such and such address, and that it will be shut down.

Can we reproduce it? No. So we ask for the exactly same movie they were running during their tests. And they send it.

Can we reproduce it? Not after 16 hours of trying. Tomorrow? Another 16 hours and nothing. We contact the customer, and they say it is hard to reproduce, it does take days to reproduce.

So we finally reproduce it. After so many days. Well, at least we reproduced it. And it really way during application startup, we tried letting application run for hours, even overnight, that never caused any problem.

We cannot connect to device to debug, and besides, we can hardly reproduce the problem, so the next move is to place some trace statements along the application startup to see how far we will get if the application crashes. And now try to reproduce it: well, this time at least it doesn’t take days to reproduce it and seeing where it happened does narrow possibilities.

After few iterations, a relatively small piece of our startup code has been left to examine, so we examine it line by line. And there is absolutely nothing wrong there. After all, if there was something wrong, wouldn’t it would manifest on all those other devices as well?

What is it? What does this code do? Well, it does prepare bitmap images of tilted circles (i.e. ellipses) to be drawn behind the vehicle, when navigation route is being active. What does it do? It does some line scanning, and it does allocate some memory for those scan lines, and number of bytes it needs is calculated as radius 100 multiplied by absolute value of a cosine of an angle. So we print those numbers to the trace file. Now we can see that even when application doesn’t crash, it sometimes allocates something like 100 000 bytes for a scanline, which is obviously wrong, but not large enough to cause allocation to fail (and consequently to crash whole application).

So how can we have a bug there? Browse our code once more, make sure, and, well we do it all right.

Is it possible that cosine is not working correctly?

As I propose this for the first time, the faces of the rest of the team quite clearly reveal they might be thinking along the lines of our fiction character, but times are desperate enough to seriously consider it: one of us is already on another continent as part of joint task force to solve the problem. The joint under extreme strain, as we can imagine.

So we make as simple application that does nothing much, but furiously calculate cosine of randomly generated angle: if the result is out of codomain bounds, we pop message box stating the angle and the result of calling the cosine for this angle.

First, we run it without the movie in the background, and it runs forever. Then we start the movie, and while movie is running we start this test application, and BINGO. It may take 15 seconds, it may take several minutes (of intense cosine evaluation), but sooner or later, we get “unreasonable” result.

Form this point on, things are suddenly feather light easy. We send complete solution to the manufacturer, and very soon explanation is communicated back: the device has a floating point coprocessor (since video software uses FFT heavily), but its driver does not function properly when more than one application wants a cosine of the angle simultaneously. So they have reconfigured the OS in a way that only the movie application uses the coprocessor, all other applications do not use floating point coprocessor for floating point operations. That was the last of the issue on this device.

Any morals to this story?  This particular experience was not the only experience of such kind, but perhaps the hardest and the weirdest. Sometimes you have to find the root cause of the problem, even when you are not the one who can provide the solution for it. That is difficult, but when you roll up your sleeves and dedicate yourself to it, you can do it.

If you have not recognized the fiction character but do want to meet him, start with the The Blade Itself, the first book in The First Law trilogy of Joe Abercrombie, but do not stop there.