Repeatable Simulator Freeze and/or Crash Only with Release WASM

turbofandude · March 1, 2024, 11:13pm

Version: SU15 - 1.37.5.0*

Frequency: Consistently

Severity: Blocker

Context: TFDi Design MD-11 (development version)

Bug description:
Recently, we discovered that our WASM causes the simulator to completely freeze upon loading. Commenting/reordering the code around causes arbitrary access violations in the WASM at otherwise working locations. In the case of the freeze, the loading screen finishes (usually) and you’ll get the first second or two of simulator visuals, then it’s a complete freeze. Sound continues, but no interaction or redrawing.

This only occurs with a Release mode WASM. WASM Debug Mode in the simulator itself makes no difference. The exact same code works as expected in Debug mode, and works correctly in both Release and Debug when compiled in x64 (for our standalone environment/P3D).

Some of the recent changes included large data allocations and AI traffic data acquisition, both of which I have tried removing to no avail. Since this problem began, I have been 100% unable to use a Release-built WASM without disabling enormous (and previously working) sections of code (reducing the aircraft to an unflyably-simplified state).

I have tried with SU14 and SU15 beta and the results are identical. My SDK is up to date and I am compiling from Windows 11 and Visual Studio 2022.I have also had another developer on a different machine test and he experienced the same results.

My current suspicion is that some runtime optimizations in release mode are generating incorrect instructions or silently changing the expected behavior. I have not ruled out that our code may be at fault, but even disassembly analysis has revealed strange recursive functions we didn’t write or weird iterator behavior.

Repro steps: Load the aircraft with the attached release WASM. Observe that it does not work when the same code in Debug does.

Attachments: I have sent our installer and an activation key to the PrivateContent group. Please take care to install the package via the installer first, then apply the Release WASM I sent. The installed version does not include the afflicted WASM but has the rest of the required files.

Private Attachments: tfdi_design_md11_package.zip (send to group)

turbofandude · March 4, 2024, 10:25pm

We continued debugging and finally managed to isolate what line/code is causing the freeze. Coindentially, it’s the same code that the debugger previously blamed for an access violation. We were able to determine with a log that “EventCreate 2d” was executed hundreds of thousands of times when the events container size was 0.

The code is as follows:

for (std::map<int, std::weak_ptr<SharedEvent::Allocator>>::iterator it = events.begin(); it != events.end(); it++)
{
	Logger::D(L"EventCreate 2d");

	std::shared_ptr<SharedEvent::Allocator> itobj = it->second.lock();
	std::wstring olname = itobj->name;
	std::transform(olname.begin(), olname.end(), olname.begin(), ::towlower);

	if (lname == olname && evtID == itobj->eventID && flag == itobj->evtFlag)
	{
		Logger::D(L"EventCreate 2e");
		obj = itobj;
		break;
	}
}

Logger::D(L"EventCreate 2f");

Section 2e was never hit (as expected) and neither was 2f (or beyond). What’s curious about this is the .size() method of the events container returned 0 - the code is aware it’s empty. Despite this, the iterator never reports being at the end.

We also then tested by adding a manual exclusion when the size is 0 to see if maybe it just needed to have at least one element to work. In that case, it just looped infinitely after adding the first element (with size() == 1 instead).

We’ve been experiencing random freezes with recent released builds that we can’t recreate in our test environment, either - I’m beginning to wonder if there is a bug with iterators specifically.

I look forward to your response.

Edit: Attached WASM with this specific freeze.
md11host.zip (3.9 MB)

Edit 2: I rewrote the loop in question using a for(int i = 0; i < events.size(); i++) then used std::advance to access it and unsurprisingly, it worked. The initialization routine continued until it encountered the next iterator loop where it then hung forever again. I wanted to rule out memory corruption on our part by trying to get this problem to occur in another space in memory. I feel that the problem is becoming clear here.

FlyingRaccoon · March 5, 2024, 10:01am

Hello @turbofandude

We were able to replicate this issue and it was added to our backlog.

Regards,
Sylvain

turbofandude · March 5, 2024, 6:00pm

Thank you for confirming it, this puts my mind at ease a bit. I recognize there are other issues in the list, of course, but given that this is preventing us from releasing builds, is there a timeline we can expect to see the fix? Thanks in advance.

turbofandude · March 8, 2024, 11:54pm

@FlyingRaccoon Just wanted to see if there was an ETA or a workaround we could use in the mean time. Thank you!

FlyingRaccoon · March 11, 2024, 12:31pm

Hello @turbofandude

No news at the moment, I’ll update this topic as soon as I have more information.

Regards,
Sylvain

turbofandude · March 19, 2024, 9:19pm

Just as an update to this thread, we’ve sent over full source of the project to aid in debugging this.

turbofandude · April 22, 2024, 6:40pm

Just wanted to follow up and report that we’re now seeing this manifested in other areas than just iterators. A recent build in Release mode is showing altitude restrictions in our FMC a factor of 10 larger than they exist in the database (FL500 instead of FL50). Debug mode reports the restrictions properly.

FlyingRaccoon · May 15, 2024, 11:54am

Hello @turbofandude

The problem you originally reported was caused by a stack overflow. You used more than the 65536 available bytes.
We were not handling this correctly and this could lead to infinite loading.

This was addressed in the Beta 1.37.15.0, Wasm module stack overflow will now trigger an access violation exception and flag the module as dirty.
You will have a 0xc00000005 exception in either a call to _innative_internal_env_chkstk or an access to a statically allocated element.
It his then up to you to ensure you stay under the stack size limit.

Regards,
Sylvain

turbofandude · May 18, 2024, 12:17am

I suppose that is an improvement. This still does not explain why debug C++ to WASM compilations do not exhibit the same behavior, but at least I know where to begin should the issue re-occur.

tracernz · May 18, 2024, 4:58am

The optimiser may use additional stack e.g. to unroll loops.

FlyingRaccoon · May 21, 2024, 2:56pm

Wasm memory management is slightly different in Debug.
A larger data section is allocated and it is likely that the stackoverflow was not overwriting data in debug as opposed to release.

Regards,
Sylvain

turbofandude · May 29, 2024, 1:08am

I just wanted to follow-up again. With the latest SDK, it crashes during load in both Debug and Release with no real evidence of why. Where the crash happens leads me to believe it’s the stack issue again. Unless you changed how memory is allocated specifically in debug mode in the latest SDK, this seems odd.

Additionally, when I compile and run it in x64 with a 32KB stack size, I am able to open and load it without issue.

Just to recap, since I initially opened this complaint:

I sent full source to our entire aircraft
I was told the issue had been identified
I waited more than a month
I was then told that my code is the problem afterall because I exhausted the 65K stack memory, which is currently 5% of the 1MB stack memory by default on Windows
I am now having a hard time detecting where this issue is coming from as it will not crash in a regular x64 environment. Changing it in C# (the language the application we test externally with is in) is not as straight-forward.

So, I leave off with these two questions. Is there a reason we cannot simply increase the stack size slightly to make more complex programming a bit easier?

Second, what do you recommend for us to do? Testing for this limit is hard in Windows/x64 as it doesn’t exhibit this behavior and changing it in a C# application (which is where we test) is not an officially supported option. I have tried measuring the change in local function addresses during initialization, etc. but again, without testing every single function, it’ll be hard to tell as a properly functioning runtime will reclaim stack memory.

You’ve seen our code - although we did have some statically allocated memory (that we’ve sinced moved), it does not seem excessive. Can you offer us any guidance on where you saw our WASM allocating the most memory?

EPellissier · May 29, 2024, 5:16pm

When crashing with the Debug version of your WASM module, do you get information on where the issue may come from when looking at the call stack?

I thought my team had tested your WASM module in both Release & Debug version while tracking the initial issue (I’ll check with them) so I am surprised you are still seeing crashes: I guess this new report means we need to check again with the source code we have - but our results may not be meaningful if you have changed your code since you sent it.

Anyway I’ll try to get back to you on this topic when we have run another test pass.

To answer your questions:

There may be a way to increase the stack size by specifying it as an additional option to clang when compiling to WASM - we haven’t had time to ensure this wouldn’t break anything which is why we haven’t talked about it yet. Plus, we need to set a limit to this anyway.
“What do you recommend for us to do?” - since your module crashes in Debug you are supposed to have all the required information to understand the issue (call stack, symbols…). I know that even this information is not always sufficient to understand what’s going on - especially if the issue is on our side - so please let us know what kind of information may help.

Best regards,

Eric / Asobo

turbofandude · May 31, 2024, 10:17pm

The issue I’m seeing is that the crash, although traceable, happens in otherwise working code. A way to identify where (or, at least, when) memory allocation is happening would help (i.e. if I could query available stack space after some function or initialization).

Ideally, a higher stack size limit would resolve the issue. The current SDK is working fine for me and our customers, so for now, I plan to continue using it until I have a need to update.

That said, I would like to permanently solve it. We are nearing the end of the first cycle of development on this project, so time is limited. When things slow down a bit, I will dive deeper into this again and try to figure out what’s happening.

In the mean time, any further advice or findings you come across would be appreciated. Thank you for being responsive about this.

Arzop · June 3, 2024, 7:58am

Hi,

Everything should be in place for you to be able to track and solve the problem. Previously, Debug build had a very (very) big stack size (for Edit&Continue purpose) but a guard page has been added to detect Stack Overflow in a Debug build. Adding some stack size is not a magic solution that can fix everything and ideally shouldn’t be changed.

Can you provide us with your new package sources so we can track and find the real problem?

Best Regards
Maxime / Asobo

Painless8118 · June 4, 2024, 6:31am

Just to add some supporting information, PMDG is seeing a new exception (and failure-to-run) with SDK 0.24.2 that was not present with 0.23.1. With SDK 0.24.2 our aircraft loads dead 100% of the time; reverting to and rebuilding with SDK 0.23.1 restores operation.

We were able to capture the problem. It seems that the simulator is chucking an exception on a std::vector constructor. The std::vector is a member variable of our EICAS class.

What’s interesting is that our EICAS (like our other systems) is stored on the heap, not the stack. And we really have no clue why a vector constructor would throw in practice regardless.

We can’t help but wonder, is the new stack protection logic or the memory shrinkage logic affecting us here? It doesn’t seem to make sense, but it’s the only thing we can point to.

Regards,
Chris P.
PMDG

EPellissier · June 4, 2024, 6:46am

Hi Chris (@Painless8118),

Would you be able to share (through private content) your EICAS class (.h + .cpp) so that we can have a look?
I suppose the module was built in Debug configuration and the WASM Debug Mode (Options menu) was activated in the sim?
I agree with you that our latest changes shouldn’t affect the case you describe so we’ll have to investigate.

Best regards,

Eric / Asobo

esoriaAsobo · June 4, 2024, 12:50pm

Hello,

After investigation, it is indeed a Stack Overflow. The objects pushed into the vector are constructed on the stack, before being copied into the vector, and for some reasons, clang decides to not free the stack between each push. A solution to this is to use emplace and emplace_back to construct the object directly into the vector, so it reduces the stack usage.

In addition, here are two useful option to help with stack overflow:
/clang:-fstack-usage (to put in “Configuration Properities / C/C++ / Command Line” in Additional Options) : it generates a .su file for each generated .obj, placed in the same folder. A .su file contains for each function of the obj the stack usage in bytes, this usage does not take into acount the call to other functions, only stack used by local variables. Keep in mind that the default stack size is 65536 bytes. And also that the stack used at runtime will depend on the callstack and will probably go higher than the maximum stack usage mentioned in the .su file.

-zstack-size=n (to put in “Configuration Properities / Linker / Command Line” in Additional Options) : this function sets the size of the stack to n. n must be a multiple of 16. Setting it below 65536 may not be a good idea, and using a multiple of 65536 should prevent some problems. Do not set unnecessary too high value, the memory allocated to the stack cannot be used for something else, and we may also add a limit in the future

Best Regards
Etienne / Asobo