Replying to Marc, and Nala,

It doesn't seem appropriate to talk about C/C++ here, so I will skip that discussion, except to say: it's all about performance. Flexibility is great but if an API cannot be implemented so it's fast, then it's not a good API. It's very hard to determine if an API is fast without measuring it. Thinking about it is often misleading; reading the implementation easily leads to false conclusions. Sad but true.

And so, an anecdote: Nala notes: "guile parameters are built on top of guile fluids". That is what the docs say, but there may be some deep implementation issues. A few years ago, I performance-tuned a medium-sized guile app (https://github.com/MOZI-AI/annotation-scheme/issues/98) and noticed that code that mixed parameters with call/cc was eating approx half the CPU time. For a job that took hours to run, "half" is a big deal. I vaguely recall that parameters were taking tens of thousands of CPU cycles, or more, to be looked up.

Fluids/parameters need to be extremely fast: dozens of cycles, not tens of thousands. Recall how this concept works in C/C++:

-- A global variable is stored in the text segment, and a lookup of the current value of that variable is a handful of cpu cycles: it's just at some fixed offset in the text segment.

-- A per-thread global variable is stored in the thread control block (TCB), located at the start of the (per-thread) stack. So, a few cycles to find the TCB, a few more to compute the offset and maybe do a pointer chase.

Any sort of solution for per-thread storage in scheme, whether fluids or parameters, needs to be no more complex than the above. The scheme equivalent of the TCB for the currently running thread needs to be instantly available, and not require some traversal of an a-list or hash table. The location of the parameterized value should not be more than a couple of pointer-chases away; dereferencing it should not require locks or atomics. It needs to be fast.

It needs to be fast to avoid the conclusions of the earlier-mentioned "Curse of Lisp" essay: Most scheme programmers are going to be smart enough to cook up their own home-grown, thread-safe paramater object: but their home-grown thing will almost surely have mediocre performance. If "E" is going to be an effective interface to the OS, it needs to be fast. If you can't beat someone's roll-your-own system they cooked up in an afternoon, what's the point?

Conclusion: srfi-226 should almost surely come with a performance-measurement tool-suite, that can spit out hard numbers for parameter-object lookups-per-microsecond while running 12 or 24 threads. If implementations cannot get these numbers into the many-dozens-per-microsecond range, then ... something is misconceived in the API.

My apologies, I cannot make any specific, explicit recommendations beyond the above.

-- Linas.

On Fri, Mar 18, 2022 at 4:51 PM Marc Nieper-Wißkirchen <xxxxxx@gmail.com> wrote:

Am Fr., 18. März 2022 um 19:22 Uhr schrieb Linas Vepstas
<xxxxxx@gmail.com>:
>
> Nala,
>
> On Fri, Mar 18, 2022 at 12:52 PM Nala Ginrut <xxxxxx@gmail.com> wrote:
>>
>> Please also consider the Scheme for the embedded system.
>> The pthread is widely used in RTOS which could be the low-level implementation of the API,
>
>
> After a quick skim, srfi-18 appears to be almost exactly a one-to-one mapping into the pthreads API.

It was also influenced in parts by Java and the Windows API, I think,
but Marc Feeley will definitely know better.

> The raw thread-specific-set! and thread-specific in srfi-18 might be "foundational" but are unusable without additional work: at a minimum, the specific entries need to be alists or hash tables or something -- and, as mentioned earlier, the guile concept of "fluids" seems (to me) to be the superior abstraction, the abstraction that is needed.

Please add your thoughts about it to the SRFI 226 mailing list. As far
as Guile fluids are concerned, from a quick look at them, aren't they
mostly what parameter objects are in R7RS?

>
>>
>> and green-thread is not a good choice for compact embedded systems.
>
>
> My memory of green threads is very dim; as I recall, they were ugly hacks to work around the lack of thread support in the base OS, but otherwise offered zero advantages and zillions of disadvantages. I can't imagine why anyone would want green threads in this day and age, but perhaps I am criminally ignorant on the topic.

For SRFI 18/226, it doesn't matter whether green threads or native
threads underlie the implementation. There can even be a mixture like
mapping N Scheme threads to one OS thread.

>>> Actual threading performance depends strongly on proprietary (undocumented) parts of the CPU implemention. For example, locks are commonly implemented on cache lines, either on L1 or L2 or L3. Older AMD cpus seem to have only one lock for every 6 CPU's, (I think that means the lock hardware is in the L3 cache? I dunno) and so it is very easy to stall with locked cache-line contention. The very newest AMD CPU's seem to have 1 lock per CPU (so I guess they moved the lock hardware to the L1 cache??) and so are more easily parallelized under heavy lock workloads. Old PowerPC's had one lock per L1 cache, if I recall correctly. So servers work better than consumer hardware.
>>>
>>> To be clear: mutexes per-se are not the problem; atomic ops are. For example, in C++, the reference counts on shared pointers uses atomic ops, so if your C++ code uses lots of shared pointers, you will be pounding the heck out CPU lock hardware, and all of the CPU's are all going to snooping on the bus and they will all be checkpointing and invalidating and rolling back like crazy, hitting a very hard brick wall on some CPU designs. I have no clue how much srfi-18 or fibers depend on atomic ops, but these are real issues that hurt real-world parallelizability. Avoid splurging with atomic ops.

If you use std::shared_ptr, many uses of std::move are your friend or
you are probably doing it wrong. :)

>>> As to hash-tables: lock-free hash tables are problematic. Facebook has the open-source "folly" C/C++ implementation for lockless hash tables. Intel has one too, but the documentation for the intel code is... well I could not figure out what intel was doing. There's some cutting-edge research coming out of Israel on this, but I did not see any usable open-source implementations.
>>>
>>> In my application, the lock-less hash tables offered only minor gains; my bottleneck was in the atomics/shared-pointers. YMMV.

Unfortunately, we don't have all the freedom of C/C++. While it is
expected that a C program will crash when a hash table is modified
concurrently, most Scheme systems are expected to handle errors
gracefully and not crash. We may need a few good ideas to minimize the
amount of locking needed.

--
You received this message because you are subscribed to the Google Groups "scheme-reports-wg2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xxxxxx@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scheme-reports-wg2/CAEYrNrTBaY%3Dm%2BuRSmir9e%3Dr2MENDVE5RELmAPQKNqWro9H-zqQ%40mail.gmail.com.