Message boards :
Number crunching :
Better sleep on Windows - new round
Message board moderation
Author | Message |
---|---|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
To enable non-Lunatics members discussion possibility. Here tests done so far: http://lunatics.kwsn.info/index.php/topic,1812.msg61015.html#msg61015 What is strange regarding Sleep(0) vs SwitchToThread (STT) behavior: r3500:class SleepQuantum: total=2.8579862, N=3, <>=0.95266207, min=0.93661302 max=0.97626472 SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
LoL, and while re-looking for typo seems I got an answer: STT doesn't switch context if no active thread on that particular CPU. Mike had idle core, so no other active threads, not slice give up and so on... Uh... EDIT: if so, these results can provide estimation of context switching overhead! 0.0017884213-0.001445154 ~ 0.00034(ms)=34us - approx cost of single context switch for Mike's host :) @Mike - could you repeat with ALL 8 cores busy please now. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196 |
If all processes have the same priority Sleep(0) and STT should behave very similarly. My assumption was that you'd have your GPU tasks running at high-priority and you wanted to let them to yield the rest of their quantum to a low-priority task to crunch for a bit between polling. I was not aware of the constraint that it only gives its quantum to threads running on the same processor. I find the definition a bit imprecise -- same logical processor or physical processor? Is that only when process affinity masks are used that it can't run an arbitrary thread? I checked our engine's code history and as recently as 2014 there was a fix where a Sleep(0) was replaced with STT() to fix a deadlock. The case was busy priority inversion: a low-priority thread was using a spin-lock and a high-priority thread started spinning on the lock; Sleep(0) wouldn't yield to the low-priority thread and so the lock would never be released. This was exposed on a soak machine that was definitely a single physical processor and no thread affinity mask monkey-business (I'm guessing it was a dual-core machine though; certainly not a single-core). So in this case STT yielded to to a lower-priority thread where sleep(0) did not (as per docs) but it's not clear if the thread that yielded was "on the same (processor|logical core)." Regardless: if you're trying to share the core the GPU task is running on presumably there's a suspended low-priority task waiting to run "on the same processor". You could always count the number of times STT fails to see if it's actually yielding. I mentioned this before and should mention again: if STT fails you probably want to call _mm_pause() a bunch of times (or _YieldProcessor() if you prefer -- same thing). IE: if you're going to spin you should do it in a way that is nice for hyper-threading. Without these nice pause instructions even if I reserve a core for my GPU tasks if they're spinning on a busy wait they'll starve the task running on the other half of the core. This was particularly important for us back in the Xbox360/PS3 days. In fact, if you aren't going to sleep at all and you just want to spin on OpenCL polling you should seriously consider injecting a few hundred cycles of _mm_pause() before you poll again. I was thinking about the bizarre case of sleep(1) actually improving throughput marginally: I wonder if smashing into the kernel in a tight loop might be unhealthy (maybe starving the GPU driver a bit?). Busy waiting is one thing but a trip into the kernel with a protection escalation and the back again might be more costly than necessary. It's interesting to measure the CPU time of the GPU process but what's missing is the throughput of the adjacent CPU tasks. Finally: I wouldn't stress about using timeBeginPeriod(TIMECAPS.wPeriodMin) -- we've been doing that for over a decade and nobody has ever complained. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
It seems we have quite different usage cases hence different outcome. As I understood you have multithreaded app. What differencies it implies: 1) to make use of idle CPU cycles no need to switch context. So, you always consider context switching as "absolute evil" and as overhead only. In current SETI case we have 2 different processes: GPU-driving process and anothr CPU-only process. Hence, to make use of idle CPU cycles I inevitably should switch to another process that is, to switch context. This diminishes advantage of STT versus Sleep. 2) In current build there are places of few hundreds of us long where CPU could be idle but NV OpenCL runtime use it for polling and make it busy. I looking for possibility to sleep less than 1ms for these areas. Unfortunately, STT can't be used for this task. As prev test show on fully loaded CPU (that is, when GPU-driving process and CPU-only process share same CPU) STT yield to lower-priority CPU-only process (but Sleep(0) and Sleep(1) do too). But it yield remaining time slice. Cause time slice ~10ms, average remaining time slice will be ~5ms (and indeed I see ~4ms average sleep quantum in this case). Obviously, it's too much for ~300us GPU kernel - GPU performance will drop much. From the over side, Sleep(1)+increased mm timer precision allows stable 1ms-only (much better!) yield to another process. Of course, it is too much for 300us kernel too hence I use such sleep only where few ms kernel/kernel sequencies possible. All this leaves STT just as Sleep(0) out of usable application in current app. I will try to implement mm_pause in those places where sleep less than 1ms would be required. It will be topic for next experiment. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
LoL, and while re-looking for typo seems I got an answer: STT doesn't switch context if no active thread on that particular CPU. Mike had idle core, so no other active threads, not slice give up and so on... Uh... . . Sorry to be pedantic but I think that is 340nS. . |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
indeed. On my i3470 I aquired ~3us - still order of magnitude different though in another side. Not sure how precise such calculations at all. SETI apps news We're not gonna fight them. We're gonna transcend them. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.