The Road to Fatdog64 800 Alpha

Today, the first of Fatdog64 800 series is released to the wild - the Alpha release. Despite being labelled "alpha", this release has been tested for a few months and has been used day-to-day in real production machines (in varying degrees) for about two months by all of us, in the team.

It has not been a smooth ride all along. Living in the "bleeding edge" means that you really need to prepare to bleed (that's why we don't update the base on every release - it would be downright impossible). Latest packages don't always build, and when they do, they don't always run, and they do, they don't always run stably, and when they do, they don't always work, and when they do, they don't always work correctly, and when they do, they don't always provide good performance, and so on, and so on - you get the picture.

But we have finally arrived. It may not be perfect, and it never will, but for us - it is good enough for day-to-day usage; so the decision to release.

This blog post documents a few of the stumbling blocks that we passed through on our way to the Alpha release. By sharing the knowledge, I hope that others on the same journey can avoid them.

Warning: the information that comes after this is going to be very technical. Don't worry if you don't understand it - just use whatever that you do.




The particular bug that took as considerable time (weeks) to solve was this: it's a bug that causes the desktop to unpredictable, involutary exit to the console.

And it turns out, this problem has __multiple__ underlying causes. Each cause ends up with the X server exiting (sometimes gracefully and sometimes not) so that we're dropped back the console.

Sometimes there crash messages in Xorg.0.log, sometimes don't. Sometimes the exit is immediate (X goes down and we're back in the console), sometimes it's gradual (applications start to fail one by one, before X itself finally gives up its ghost and dies).




Bug #1: fontconfig doesn't like its cache to be tampered with.

This bug happens only when running with the RAM layer (savefile=ram:device:xxx, in Puppy's parlance this is pupmode=13). When anything that uses fontconfig starts (e.g X, gtk2, etc), fontconfig will be initialised and its first action is to scan all the font directories and makes its cache.

When we run using the RAM layer, these caches are stored in the RAM layer, and eventually will be "merged-down" to the actual savelayer (copying the files from the RAM layer to the savelayer, and then removing the copy on the RAM layer, and then refreshing the aufs layered filesystem so that the copy on the RAM layer gets re-surfaced on the root of the filesystem).

This has worked for the longest time, but we found out that in Fatdog 800 this isn't the case. Even the process of copying the files from RAM layer to savelayer (not even deleting them) triggered a cascade of failure in fontconfig, which eventually resulted a crash in all higher-level libraries and applications that uses this.

Fixes #1: We still don't really know what changed - this could be a change in the kernel, fontconfig itself, glibc, or others (remember, in 800, with a new base, **all things were updated** so we can't easily isolate one component from another), but once we know what triggered the collapse, we worked around it by making sure that fontconfig caches are not touched during the merging process.




Bug #2: Xorg server crash on radeon-based systems when DRI3 and glamor is enabled (the default settings).

This has been a long running bug due to Mesa (open-source OpenGL 3D library) changing its infrastructure to support newer, more powerful radeon cards, changing both the acceleration API (DRI2 to DRI3), acceleration methods (EXA to glamor), memory management (GEM, GBM, etc).

But the bug isn't in Mesa alone. Eventually Mesa needs to interface with the video driver, so co-operation with xf86-video-ati driver is needed.

And then there is the kernel too, the radeon DRM driver from the kernel.

There are multiple components and every component is a potential source of failure (which in this case they all contributed to the problem one way or another).

To make it worse, the problem is completely unpredictable. The system can run hours without a glitch before the desktop crashed, or it can crash in next hour after booting. It brought a completely un-reliable experience.

There is no good solution for this because every component keeps changing, so updating one component could very well break another. All we can do is watch bugzillas, forums, mailing lists, and listen to possible solutions.

Fixes #2: I think in this case, we got lucky. Most of the bugs were fixed in mesa 18.2.3, and the final bug was fixed in xf86-video-ati git-master, one commit after 18.1.0. We're going to stick with this combination for a while!




Bug #3: After an unpredictable amount of time, Xorg server will crash (due to failed assert), giving up messages similar to this:



[xcb] Unknown sequence number while processing queue
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
pidgin: xcb_io.c:259: poll_for_event: Assertion `!xcb_xlib_threads_sequence_lost' failed.


There is one unanswered report here: https://bugzilla.redhat.com/show_bug.cgi?id=1582478

This initially happened on ROX-Filer when it was worked heavily, so naturally we thought the problem was with ROX-Filer. But then it started to happen elsewhere (other GTK applications), so it could be anywhere in gtk2, glib, glibc, aufs kernel module, or even the kernel itself.

We scoured for solutions for similar problems, and we got some solution like this: https://forum.freecadweb.org/viewtopic.php?t=28054

But they don't work and they don't make sense. FreeCAD is a 3D-heavy application, so disabling DRI3 and (in some others links, disabling 3D hardware-acceleration) doesn't make sense for applications like ROX-Filer which is purely 2D and doesn't make use of any acceleration at all.

This was especially difficult to pinpoint because it happened randomly; and it was one of the thing I would have filed into "unsolved" files, were it not for SFR. SFR found a way to reproduce this problem reliably (by clicking the "refresh" button on ROX-Filer a few hundred times - and he even provided a automation script so we don't have to buy a new mouse after every experiment ).

Once we can reproduce this, re-building the libraries with debug symbols and running them under "gdb" quickly pointed out that the problem is within libxcb - a library at the bottom of the Xorg stack.

Bisecting on libxcb, we found that the problem is caused by a particular code commit that tries to "fix" another bug when dealing with Vulkan drivers:

https://cgit.freedesktop.org/xcb/libxcb/commit/?id=fad81b63422105f9345215ab2716c4b804ec7986

But the way it was fix is, in my opinion, is incorrect (it was reading stuff when it shouldn't - this kind of thing should be protected with a mutex or we'll end up with a race).

Fixes #3: So we reverted this commit and poof! - the problem disappears. I tested clicking the button to about 16,000 times and no more crash.

This was the last bug we squashed.




So there. One symptom, three underlying problems, that can be triggered at different semi-random times, resulting in a totally different error messages and behaviour, confusing all of us.

We falsely declared victory after the first and the second were squashed - only to be humiliated when the crash happened again, in a slightly different way. By the time we squashed the last bug, we were wary enough __not__ to declare it fixed until a few days later and it was finally confirmed that we've finally made it through.

All in all, we spent more than a month to solve all of them. Now that we've past through them, I hope others can avoid the same mistake.

Meanwhile, enjoy Fatdog64 800 Alpha!

Forum announcement here: http://www.murga-linux.com/puppy/viewtopic.php?t=114719

Release Notes here: http://distro.ibiblio.org/fatdog/web/800a.html


Posted on 16 Nov 2018, 19:57 - Categories: Fatdog64 Linux
Edit - Delete


Comments:

Posted on 8 Feb 2019, 12:43 by jamesbond
"Fixes #1"
We have since found out the cause of problem #1. It was because when we copied the RAM layer to the save layer, we used rsync --in-place --no-whole-file, which was supposed to help it work faster on slow USB flashdrive (when that flashdrive is used as a save layer).

Unfortunately these two options causes rsync to write to the files multiple times in the process of re-arrangement makes the files looked corrupted, which confuses fontconfig and crashed it.

We found this out because there were other programs that started to act out funny too; so in the end we decided to remove this "optimisation". If you need faster operation, don't buy the bottom of the stock, get something better. (as Barry found out here and here). They aren't much more expensive as they used to.
Delete



Add Comment

Title
Author
 
Content
Show Smilies
Security Code 1658093
Mascot of Fatdog64
Password (to protect your identity)