Skip to content

segmentation violation on unix port #97

@uraich

Description

@uraich
Contributor

When trying to use the SDL driver on the unix port I get a segmentation violation:
import SDL
SDL.init()
crashes.

Activity

embeddedt

embeddedt commented on Sep 22, 2020

@embeddedt
Member

Do you have an Nvidia graphics card in your computer? The Unix port has issues with Nvidia cards which we haven't been able to track down. See #46.

amirgon

amirgon commented on Sep 22, 2020

@amirgon
Collaborator

Hi @uraich !
Since we are unable to reproduce your problem on our side, we would need your help debugging this.
Could you provide the stack trace of the crash? You can obtain it by running Micropython under gdb. Something like gdb --args micropython ...

uraich

uraich commented on Sep 22, 2020

@uraich
ContributorAuthor

I think embeddedt gave the answer: I do have an NVidia graphics card

amirgon

amirgon commented on Sep 22, 2020

@amirgon
Collaborator

I think embeddedt gave the answer: I do have an NVidia graphics card

@embeddedt Do you have an NVidia graphics card? Would you consider diving into this once again?

I can suggest the following:

  • Run it with valgrind, perhaps there is some memory corruption
  • Try to obtain the sources or at least the debug symbols of libnvidia-glcore and get a more meaningful stack trace than this one
  • Try to ask on nvidia forums, or contact nvidia support
  • Open a ticket on nvidia issue tracker
  • Just for the test, we can try to change the SDL driver back to use SDL thread instead of Micropython thread. I believe the original problem was related to callbacks (we want to run Micropython callback on Micropython thread) so if this is the issue we can still use SDL thread but be carefully trigger callbacks from Micropython thread.
uraich

uraich commented on Sep 22, 2020

@uraich
ContributorAuthor

This is what I see when I run lv_micropython in gdb
image

embeddedt

embeddedt commented on Sep 22, 2020

@embeddedt
Member

Is this the same sequence of steps you ran to get it to segfault? It doesn't appear to have crashed yet.

uraich

uraich commented on Sep 22, 2020

@uraich
ContributorAuthor

Yes, the same sequence. Without gdb I see this:
image

amirgon

amirgon commented on Sep 22, 2020

@amirgon
Collaborator

@uraich SIGUSR1 is used internally in lv_micropython and should be ignored.
Please run in gdb (before run):

handle SIGUSR1 nostop noprint pass
uraich

uraich commented on Sep 22, 2020

@uraich
ContributorAuthor

Correct!, So it is in the nvidia-glcore library
image

embeddedt

embeddedt commented on Sep 24, 2020

@embeddedt
Member

@amirgon Yes; I have an Nvidia card.

I've been reading some documents about SDL, and it appears that in order to be compliant with its requirements, we need to ensure that all SDL rendering is handled on our initial main thread. It appears that calling SDL functions from other threads is known to cause issues.

Is SDL always invoked from a specific thread, or can it be invoked by any thread depending on what MicroPython is doing?

amirgon

amirgon commented on Sep 24, 2020

@amirgon
Collaborator

we need to ensure that all SDL rendering is handled on our initial main thread

@embeddedt Do you mean, from the same thread that initialized SDL?

Is SDL always invoked from a specific thread, or can it be invoked by any thread depending on what MicroPython is doing?

I think that SDL is initialized and rendered from the same thread all the time.

Here is how it works:

  • mp_init_SDL is called from Micropython main thread when we call SDL.init() from Micropython.
    It calls monitor_init and initializes SDL.

monitor_init(args[ARG_w].u_int, args[ARG_h].u_int);

  • mp_init_SDL creates a new thread tick_thread, but this thread does not do the rendering directly. It only schedules a call to Micropython:

STATIC int tick_thread(void * data)
{
(void)data;
while(monitor_active()) {
SDL_Delay(LV_TICK_RATE); /*Sleep for LV_TICK_RATE millisecond*/
lv_tick_inc(LV_TICK_RATE); /*Tell LittelvGL that LV_TICK_RATE milliseconds were elapsed*/
mp_sched_schedule((mp_obj_t)&mp_lv_task_handler_obj, mp_const_none);
pthread_kill(mp_thread, SIGUSR1); // interrupt REPL blocking input. See handle_sigusr1
}
return 0;
}

  • When Micropython is ready it performs scheduled tasks and calls mp_lv_task_handler which performs LVGL and SDL rendering:

STATIC mp_obj_t mp_lv_task_handler(mp_obj_t arg)
{
if (monitor_active()) monitor_sdl_refr_core();
lv_task_handler();
return mp_const_none;
}

There is an open question here.

When Micropython performs scheduled tasks, is it doing it always from the same thread?
I think it is... but just to make sure it's worth adding some printing of Thread-ID to mp_lv_task_handler.

Looking at Micropython code, it's not entirely clear.
mp_handle_pending is the function in Micropython that runs scheduled tasks, but it is called in different places, specifically by MP_HAL_RETRY_SYSCALL which itself is also called in different places

embeddedt

embeddedt commented on Sep 24, 2020

@embeddedt
Member

Do you mean, from the same thread that initialized SDL?

Unfortunately, it's even stricter than that. It looks like SDL operations always need to be done on the initial main thread (i.e. the one which main(argc, argv) runs in at the start of the program). Doing them on a single thread consistently isn't enough.

Is the "Micropython main thread" the same thread as main(argc, argv), or does MicroPython spawn its own thread initially and use that for the rest of the program's lifetime?

amirgon

amirgon commented on Sep 24, 2020

@amirgon
Collaborator

Is the "Micropython main thread" the same thread as main(argc, argv), or does MicroPython spawn its own thread initially and use that for the rest of the program's lifetime?

Looking at main.c I don't see any explicit creation of a new thread.
Also in the stack trace above it's clear that the call to SDL refresh is from the same thread main was invoked.

But to make sure, I suggest printing thread-id and checking if it's the same even when the problem happens.

amirgon

amirgon commented on Sep 24, 2020

@amirgon
Collaborator

Another idea -
@uraich - Could you try running it with gdb again until it crashes, and show the stack trace of all threads?
We would be able to tell if there are other threads in the process and what they are doing.

gdb command:

thread apply all bt
embeddedt

embeddedt commented on Sep 24, 2020

@embeddedt
Member

No time to debug this right now, but assuming that Thread 1 is the main thread, it looks like we aren't violating any SDL requirements.

Thread 2 (Thread 0x7fffeffed700 (LWP 8085)):
#0  0x00007ffff7bc7c70 in __GI___nanosleep (
    requested_time=requested_time@entry=0x7fffeffece60, 
    remaining=remaining@entry=0x7fffeffece50)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x00007ffff71afad5 in SDL_Delay_REAL (ms=<optimized out>)
    at /tmp/SDL2-2.0.10/src/timer/unix/SDL_systimer.c:211
#2  0x0000555555653129 in ?? ()
#3  0x00007ffff71134ac in SDL_RunThread (data=0x555555d1d7e0)
    at /tmp/SDL2-2.0.10/src/thread/SDL_thread.c:283
#4  0x00007ffff71aa0a9 in RunThread (data=<optimized out>)
    at /tmp/SDL2-2.0.10/src/thread/pthread/SDL_systhread.c:79
#5  0x00007ffff7bbd6db in start_thread (arg=0x7fffeffed700)
    at pthread_create.c:463
#6  0x00007ffff6dc1a3f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff7faf740 (LWP 8081)):
#0  0x00007ffff1fe1447 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
#1  0x00007ffff1fe1ac3 in ?? ()
   from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
--Type <RET> for more, q to quit, c to continue without paging--
#2  0x00007ffff1fb9c9e in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
#3  0x00007ffff1fc7e9b in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
#4  0x00007ffff1fd10dc in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
#5  0x00007ffff70f1d40 in GL_RunCommandQueue (renderer=0x555555c80d40, cmd=0x555555d1a550, vertices=0x555555d1a590, vertsize=<optimized out>) at /tmp/SDL2-2.0.10/src/render/opengl/SDL_render_gl.c:1270
#6  0x00007ffff70e9e11 in FlushRenderCommands (renderer=0x555555c80d40) at /tmp/SDL2-2.0.10/src/render/SDL_render.c:218
#7  SDL_RenderPresent_REAL (renderer=0x555555c80d40) at /tmp/SDL2-2.0.10/src/render/SDL_render.c:3130
#8  0x0000555555652f37 in ?? ()
#9  0x0000555555653169 in ?? ()
#10 0x00005555555b7954 in ?? ()
#11 0x00005555555b8d7a in ?? ()
#12 0x00005555555b8e90 in ?? ()
#13 0x00005555555d4d51 in ?? ()
#14 0x000055555565399a in ?? ()
#15 0x00005555555d4935 in ?? ()
#16 0x00007ffff6cc1b97 in __libc_start_main (main=0x5555555a537b <main>, argc=1, argv=0x7fffffffdbe8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdbd8)
    at ../csu/libc-start.c:310
#17 0x00005555555a53da in ?? ()

29 remaining items

embeddedt

embeddedt commented on Dec 24, 2020

@embeddedt
Member

Unfortunately that was the flag I used! I will have to keep thinking about it - I have never seen a problem like this on a host system, only on embedded systems where things get corrupted easily.

amirgon

amirgon commented on Dec 24, 2020

@amirgon
Collaborator

That's a true mystery!
What could make two applications to consistently behave so differently, assuming they run the same code, compiled with the same flags and load the same libraries...

Sounds like a good question for Stack Overflow.

embeddedt

embeddedt commented on Dec 25, 2020

@embeddedt
Member

Indeed. I even tried dumping the ELF files to see whether the sections are any different, but they both have the same sections.

Merry Christmas!

amirgon

amirgon commented on Dec 25, 2020

@amirgon
Collaborator

One more thing you could try is to run it with debugger from the beginning and see/trace everything that's being called. There could be some code that automatically runs before main such as C constructor or some signal handler, some library initialization code etc.
Then you could compare the traces between your standalone application and Micropython.

Merry Christmas to you too!

X-Ryl669

X-Ryl669 commented on Feb 2, 2022

@X-Ryl669

There's a bunch of code that's executed in functions marked as attribute((constructor)) in shared libraries. So it's very difficult to figure out the reason of the difference by looking only at the main's file.

Typically, I had an issue like this once and it was due to nvidia's OpenGL library taking the address of all the underlying system OpenGL's library function it didn't implement. A higher level code (GLFW IIRC) was swapping some OpenGL functions with it's own, and so when calling some OpenGL code, when NVidia's code was calling the system's function, the wrong function was called. At that time, I solved the issue by changing the library loading order so that the higher level code was loaded first.

A good test would be to remove libraries one by one until finding the culprit. You can try LD_BIND_NOW='' ./test to force lazy loading libraries that can be loaded this way. Or you can list all libraries with ldd and then objdump them all to find all the symbols in DL_INIT sections. Then place some GDB breakpoint on them (warning, there are many of them), and launch your crashing application.
It'll hit each function in some specific order.
You can directly exit the constructor function without executing it by setting the $pc register on the ret instruction (or stack unwinding if you stopped after the stack setup). You can use the return command of gdb here to skip executing the function, since most constructor function return void.
You'll maybe be able to pinpoint what function is causing the crash (if the crash does not happen after you've skipped function XXX, then you'll have to look what function XXX does, if you have the source code for it).

embeddedt

embeddedt commented on Feb 2, 2022

@embeddedt
Member

Thanks, @X-Ryl669, for this information. It was a very helpful explanation and makes a lot of sense. I can see why the proprietary Nvidia driver is frowned upon by Linux users, as this function-swapping approach sounds quite fragile.

In the meantime, while playing around with various environment variables to try and get to the root of the issue, I have just found a workaround that is probably good enough for the time being: launching MicroPython with __GLX_VENDOR_LIBRARY_NAME=mesa LIBGL_ALWAYS_SOFTWARE=1 will skip the Nvidia implementation entirely and use software rendering, thus avoiding the crash. On my i3-4150, I don't see any noticeable performance loss in advanced_demo.py compared to when I tested with the Nouveau driver.

X-Ryl669

X-Ryl669 commented on Feb 3, 2022

@X-Ryl669

For the few things that SDL is doing with OpenGL, it's clear there no benefit for an advanced linux driver, basic software driver will work too.

embeddedt

embeddedt commented on Jun 14, 2022

@embeddedt
Member

I've reopened this for now since #215 didn't actually fix it, but does it need to stay open or should we just close it since there haven't been any further reports/issues?

amirgon

amirgon commented on Jun 14, 2022

@amirgon
Collaborator

I vote for keeping this open, as a reminder that this is not resolved yet.
It's easier to forget about closed issues.

I'm actually not sure why it was automatically closed with #215, I probably did something wrong.

embeddedt

embeddedt commented on Jun 15, 2022

@embeddedt
Member

The PR description had the phrase "fix #97" in it. That will make GitHub close the issue if it's merged. Unfortunately it's not smart enough to check the wording around that to see if it's a question or a statement. 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @uraich@X-Ryl669@amirgon@embeddedt

        Issue actions

          segmentation violation on unix port · Issue #97 · lvgl/lv_binding_micropython