Better sleep on Windows - new round

Message boards : Number crunching : Better sleep on Windows - new round
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1812167 - Posted: 24 Aug 2016, 12:20:43 UTC
Last modified: 24 Aug 2016, 12:23:12 UTC

To enable non-Lunatics members discussion possibility.

Here tests done so far:
http://lunatics.kwsn.info/index.php/topic,1812.msg61015.html#msg61015

What is strange regarding Sleep(0) vs SwitchToThread (STT) behavior:

r3500:class SleepQuantum: total=2.8579862, N=3, <>=0.95266207, min=0.93661302 max=0.97626472
Sleep0: class SleepQuantum: total=4.8358912, N=2704, <>=0.0017884213, min=0.00054984231 max=0.4228799
Sleep1: class SleepQuantum: total=2148.8459, N=1791, <>=1.1998023, min=0.86739361 max=3.0483601
STT: class SleepQuantum: total=3.9076965, N=2704, <>=0.001445154, min=0.0004952898 max=0.0027276319

Yep 7 cores were in use.


That shows the need of fixed amount sleep in case of underloaded CPU.
GPU app has bigger priority so, if some free CPU resource awailable, it will be scheduled for exection there.
What strange is no differencies in STT and Sleep(0) behavior. From what I read on main forums Sleep(0) should return to the same process immediately so just spin with full CPU busy while STT should give up CPU slice always. So, in SleepQuantum counter it should have bigger mean value (hard to imagine that with absolute most of 2704 occurencies process was exactly at the end of its current time slice). Nevertheless once can see VERY close mean times (<>) for Sleep(0) and STT. Strange. If so I don't see any advantage of STT at all :-\
[NB: Windows time slice ~10-15 ms and STT mean is 0.0014 ms]


SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1812167 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1812171 - Posted: 24 Aug 2016, 12:29:00 UTC - in response to Message 1812167.  
Last modified: 24 Aug 2016, 12:34:28 UTC

LoL, and while re-looking for typo seems I got an answer: STT doesn't switch context if no active thread on that particular CPU. Mike had idle core, so no other active threads, not slice give up and so on... Uh...

EDIT: if so, these results can provide estimation of context switching overhead!

0.0017884213-0.001445154 ~ 0.00034(ms)=34us - approx cost of single context switch for Mike's host :)

@Mike - could you repeat with ALL 8 cores busy please now.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1812171 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1812189 - Posted: 24 Aug 2016, 13:53:28 UTC

If all processes have the same priority Sleep(0) and STT should behave very similarly. My assumption was that you'd have your GPU tasks running at high-priority and you wanted to let them to yield the rest of their quantum to a low-priority task to crunch for a bit between polling.

I was not aware of the constraint that it only gives its quantum to threads running on the same processor. I find the definition a bit imprecise -- same logical processor or physical processor? Is that only when process affinity masks are used that it can't run an arbitrary thread?

I checked our engine's code history and as recently as 2014 there was a fix where a Sleep(0) was replaced with STT() to fix a deadlock. The case was busy priority inversion: a low-priority thread was using a spin-lock and a high-priority thread started spinning on the lock; Sleep(0) wouldn't yield to the low-priority thread and so the lock would never be released. This was exposed on a soak machine that was definitely a single physical processor and no thread affinity mask monkey-business (I'm guessing it was a dual-core machine though; certainly not a single-core). So in this case STT yielded to to a lower-priority thread where sleep(0) did not (as per docs) but it's not clear if the thread that yielded was "on the same (processor|logical core)."

Regardless: if you're trying to share the core the GPU task is running on presumably there's a suspended low-priority task waiting to run "on the same processor". You could always count the number of times STT fails to see if it's actually yielding.

I mentioned this before and should mention again: if STT fails you probably want to call _mm_pause() a bunch of times (or _YieldProcessor() if you prefer -- same thing). IE: if you're going to spin you should do it in a way that is nice for hyper-threading. Without these nice pause instructions even if I reserve a core for my GPU tasks if they're spinning on a busy wait they'll starve the task running on the other half of the core. This was particularly important for us back in the Xbox360/PS3 days.

In fact, if you aren't going to sleep at all and you just want to spin on OpenCL polling you should seriously consider injecting a few hundred cycles of _mm_pause() before you poll again.

I was thinking about the bizarre case of sleep(1) actually improving throughput marginally: I wonder if smashing into the kernel in a tight loop might be unhealthy (maybe starving the GPU driver a bit?). Busy waiting is one thing but a trip into the kernel with a protection escalation and the back again might be more costly than necessary.

It's interesting to measure the CPU time of the GPU process but what's missing is the throughput of the adjacent CPU tasks.

Finally: I wouldn't stress about using timeBeginPeriod(TIMECAPS.wPeriodMin) -- we've been doing that for over a decade and nobody has ever complained.
ID: 1812189 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1812196 - Posted: 24 Aug 2016, 15:19:09 UTC - in response to Message 1812189.  

It seems we have quite different usage cases hence different outcome.
As I understood you have multithreaded app.
What differencies it implies:
1) to make use of idle CPU cycles no need to switch context.
So, you always consider context switching as "absolute evil" and as overhead only.
In current SETI case we have 2 different processes: GPU-driving process and anothr CPU-only process.
Hence, to make use of idle CPU cycles I inevitably should switch to another process that is, to switch context. This diminishes advantage of STT versus Sleep.
2) In current build there are places of few hundreds of us long where CPU could be idle but NV OpenCL runtime use it for polling and make it busy. I looking for possibility to sleep less than 1ms for these areas.
Unfortunately, STT can't be used for this task.
As prev test show on fully loaded CPU (that is, when GPU-driving process and CPU-only process share same CPU) STT yield to lower-priority CPU-only process (but Sleep(0) and Sleep(1) do too). But it yield remaining time slice. Cause time slice ~10ms, average remaining time slice will be ~5ms (and indeed I see ~4ms average sleep quantum in this case). Obviously, it's too much for ~300us GPU kernel - GPU performance will drop much.
From the over side, Sleep(1)+increased mm timer precision allows stable 1ms-only (much better!) yield to another process. Of course, it is too much for 300us kernel too hence I use such sleep only where few ms kernel/kernel sequencies possible.
All this leaves STT just as Sleep(0) out of usable application in current app.

I will try to implement mm_pause in those places where sleep less than 1ms would be required. It will be topic for next experiment.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1812196 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1814494 - Posted: 2 Sep 2016, 1:12:53 UTC - in response to Message 1812171.  

LoL, and while re-looking for typo seems I got an answer: STT doesn't switch context if no active thread on that particular CPU. Mike had idle core, so no other active threads, not slice give up and so on... Uh...

EDIT: if so, these results can provide estimation of context switching overhead!

0.0017884213-0.001445154 ~ 0.00034(ms)=34us - approx cost of single context switch for Mike's host :)

@Mike - could you repeat with ALL 8 cores busy please now.



. . Sorry to be pedantic but I think that is 340nS.

.
ID: 1814494 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1814574 - Posted: 2 Sep 2016, 7:46:08 UTC - in response to Message 1814494.  
Last modified: 2 Sep 2016, 7:49:23 UTC


. . Sorry to be pedantic but I think that is 340nS.
.

indeed.

On my i3470 I aquired ~3us - still order of magnitude different though in another side. Not sure how precise such calculations at all.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1814574 · Report as offensive

Message boards : Number crunching : Better sleep on Windows - new round


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.