GUPPI Rescheduler for Linux and Windows - Move GUPPI work to CPU and non-GUPPI to GPU

Message boards : Number crunching : GUPPI Rescheduler for Linux and Windows - Move GUPPI work to CPU and non-GUPPI to GPU
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 37 · Next

AuthorMessage
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1810917 - Posted: 21 Aug 2016, 4:51:42 UTC - in response to Message 1810914.  

run 0.5 ...with no issues

As I read the code - there are at least 2 bugs:
- the AstroPulse will not be identified (because "ap" is searched in wrong place/file) = previous bug is not fixed
- may enter infinite loop (hang) if the string "ap" is found (despite searched in a wrong place)
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1810917 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1810919 - Posted: 21 Aug 2016, 5:08:20 UTC - in response to Message 1810917.  

run 0.5 ...with no issues

As I read the code - there are at least 2 bugs:
- the AstroPulse will not be identified (because "ap" is searched in wrong place/file) = previous bug is not fixed
- may enter infinite loop (hang) if the string "ap" is found (despite searched in a wrong place)

Haven't experienced either, but I did read your notes and thanks for the heads-up!
ID: 1810919 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1810926 - Posted: 21 Aug 2016, 5:20:49 UTC

Sigh... version 0.51 uploaded... please update and my apologies.

Thank you BilBg for catching my bonehead errors. I shouldn't code on a Saturday night even when sober. :^p

Or maybe at any other time... lol. Maybe I can finally sleep now. Happily my response time is now much faster.
ID: 1810926 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1810929 - Posted: 21 Aug 2016, 5:33:41 UTC - in response to Message 1810926.  

Sigh... version 0.51 uploaded... please update and my apologies.

Thank you BilBg for catching my bonehead errors. I shouldn't code on a Saturday night even when sober. :^p

Or maybe at any other time... lol. Maybe I can finally sleep now. Happily my response time is now much faster.

Strictly cosmetic, and maybe I'm missing something, but remember the (*ux) CR vs (Dos/Win) CR/LF bit in your readme-s. Seems to me I had an old ADDLF program around here that added the missing bit, but it's been 20 years since I ever thought about it.
Off to update the .exe x 5 :) and thanks again ...
ID: 1810929 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1810950 - Posted: 21 Aug 2016, 7:03:59 UTC - in response to Message 1810926.  
Last modified: 21 Aug 2016, 7:17:16 UTC

 
Just out of curiosity:

I searched my client_state.xml (279 698 bytes) to estimate the probability of the two bugs combined to hang the (now old v0.5) program:
find /c "ap" client_state.xml - gives count 321
find /c "ap_" client_state.xml - gives count 10

If randomly "poking" in client_state.xml
321 / 279698 = 0.0011 = 0.11 % to find "ap" and hang (1 per 1000 runs) - "ap" is found in many words inside client_state.xml : apic swap app api
(Yes, it was not supposed to poke client_state.xml for the short "ap" string)

10 / 279698 = 0.000036 = 0.0036 % to find "ap_" - if using ap_ : ~30 times less chance for the bug to show itself (would have been 1 per 30000 runs)
 
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1810950 · Report as offensive
I3APR

Send message
Joined: 23 Apr 16
Posts: 99
Credit: 70,717,488
RAC: 0
Italy
Message 1811852 - Posted: 23 Aug 2016, 13:48:30 UTC

Hello, a couple of days ago I started GuppiReschedule and it did a fine job swappin 36 WUs between CPU<->GPU.

Today I did it again :

1) Stopped BOINC, and waited about 1 minute
2) Backed up client_state.xml
3) Started GUPPIRescheduler, here's the output :

C:\ProgramData\BOINC>GUPPIRescheduler.exe
Mr. Kevvy's GUPPI Rescheduler v0.4 - (c)2016 Kevin Dorner

Reading configuration files...
Found sched_request GPU platform=windows_intelx86 app_version=1 version_num=812
CPU platform=windows_intelx86 app_version=3 version_num=800 and GPU plan_class=o
pencl_nvidia_SoG
Searching for and moving work units in client state...
Writing updated configuration client_state.xml...
Done: 57 non-GUPPI workunits moved to GPU and 57 GUPPI workunits moved to CPU.


4) Started Boinc again, after about 10/15 seconds

Now all I've got is the popup window with "Communicating with Boinc Client. Please wait.."

Task manager shows some activity ( 16 tasks ) , while Afterburner shows that only two GPU on five are working : it is in this situation since 10 minutes !!

UPDATE : I clicked on "cancel" and I have lost connection with the project.
This is a disaster, since this critter was working with a RAC of about 73.000 !

I have now tried to shut down BOINC and return to the saved client state but some process is preventing it.
I'm rebooting the server.
Hope I'm not saying Bye Bye to WOW... :-(

A
ID: 1811852 · Report as offensive
I3APR

Send message
Joined: 23 Apr 16
Posts: 99
Credit: 70,717,488
RAC: 0
Italy
Message 1811861 - Posted: 23 Aug 2016, 14:08:37 UTC

Ok, after reboot, it seems that now BOINC is working, but I've lost about 5 WUs ( self-aborted) and had 14 WUs crashed :



I don't know...last time it worked well, and btw I'm grateful to Mr. Kevvy and everyone else helping with side scripts/programs to help us, but let me say I've lost about 1000/1500 credits, plus had all the GPU jobs reset to zero, so I'm a bit reluctant to run it again in the future... :-(

A.
ID: 1811861 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1811865 - Posted: 23 Aug 2016, 14:24:19 UTC - in response to Message 1811861.  
Last modified: 23 Aug 2016, 14:25:33 UTC

Hey A!

In Mr Kevvy's thread, please see his last post from 2 days ago:
https://setiathome.berkeley.edu/forum_thread.php?id=79954&postid=1810926

He writes at the end:
Happily my response time is now much faster.

so you might want to post a link to this thread in his thread (since he probably is "Subscribed" to his own thread and could get an auto-email sent when there is a new post...if he set his forum preferences that way).

Also, you could try sending him a private message (PM) in case he doesn't get an immediate email notification
(or even worse: he might not even be "subscribed" to his own thread ...cuz it's a forum bug, since you don't get automatically subscribed to your own thread!)

Hope that helps a bit,
RobG
ID: 1811865 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1811871 - Posted: 23 Aug 2016, 14:41:41 UTC - in response to Message 1811861.  

Works perfect here. I put in an scheduled task and runs automaticaly each 6 hours in 3 diferent hosts and no error reported. SOmething else must be happening at your side.
ID: 1811871 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1811878 - Posted: 23 Aug 2016, 15:04:17 UTC - in response to Message 1811852.  
Last modified: 23 Aug 2016, 15:10:00 UTC

Mr. Kevvy's GUPPI Rescheduler v0.4


Current one is v0.51 and any work unit loss should be resolved. Download.

If BOINC Manager hangs and doesn't display anything when it's started, it means that it wasn't quit for long enough when GUPPIRescheduler was run and files were in use. Quit it, make sure to check the box to quit running apps., wait at least ten seconds (on Linux this is when the command prompt cursor stops flashing... a handy timer) run GUPPIRescheduler again (as it did nothing the last time with the files open) and relaunch BOINC.

Edit: I would also suggest to keep any discussion of it in its own thread to keep the board tidier without numerous threads about the same app.
ID: 1811878 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1811920 - Posted: 23 Aug 2016, 22:26:42 UTC

Mr. Kevvy,
I thought I recalled earlier that there was a potential issue with APs, but iirc 0.51 solved that, correct? Reason I'm asking is that I'm doing some development on Stubbles' script, and have tentatively added a flag to not invoke your rescheduler if there are APs present. I have tested here quite a bit, and not seen any issue with this.
One thing I'm also spending a good bit of time on is ensuring, via tasklist queries, that I know what is and isn't running when. Hopefully, that will prevent issues like mentioned here earlier.
By default, I am not going to shut down BOINCTasks, though again I've added a command line option to do that is the user wishes. Do you see any reason BOINCTasks needs to be down while you're running?
Finally, not sure if this is intentional, but the "Y" to proceed in 0.51 is case sensitive. Not sure if you intended that. That got me a couple times, when I wasn't looking closely, thought it had run and it hadn't due to my "y" instead of "Y" response. Dunno if you're willing to Case that?
Thanks, !!
ID: 1811920 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1811943 - Posted: 24 Aug 2016, 0:08:43 UTC - in response to Message 1811920.  

I've run the 0.51 rescheduler now several times on all three crunchers that have had AP work on board and haven't ghosted any tasks. Looks like the bug is squashed to me.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1811943 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1811954 - Posted: 24 Aug 2016, 0:36:02 UTC - in response to Message 1811920.  
Last modified: 24 Aug 2016, 0:36:59 UTC

Finally, not sure if this is intentional, but the "Y" to proceed in 0.51 is case sensitive. Not sure if you intended that. That got me a couple times, when I wasn't looking closely, thought it had run and it hadn't due to my "y" instead of "Y" response. Dunno if you're willing to Case that?
Thanks, !!


Being a newly-"Minted" Linux junkie I'm used to excessive case sensitivity and wondered if anyone would be thrown by that in Windows. I'll add that into the next version... assuming there is one! :^)

I've run the 0.51 rescheduler now several times on all three crunchers that have had AP work on board and haven't ghosted any tasks. Looks like the bug is squashed to me.


Excellent... I can sleep better now. ;^) Thank you for your feedback as it was quite helpful.
ID: 1811954 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1811999 - Posted: 24 Aug 2016, 2:49:34 UTC - in response to Message 1811943.  

I've run the 0.51 rescheduler now several times on all three crunchers that have had AP work on board and haven't ghosted any tasks. Looks like the bug is squashed to me.

Sorry, Mr. Kevvy, I should have provided feedback as well. Have been running on all 5 of my crunchers since you released 0.51. Plenty of home runs, no hits and no errors:) Across the 5 boxes, RAC up 6k since launch.
That's what prompted me to get interested in working with Stubbles to further develop his batch file to manage this.
Again, thanks!
ID: 1811999 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1812000 - Posted: 24 Aug 2016, 2:55:13 UTC - in response to Message 1811954.  

Being a newly-"Minted" Linux junkie I'm used to excessive case sensitivity and wondered if anyone would be thrown by that in Windows. I'll add that into the next version... assuming there is one! :^)

Old SCO/VXWorks guy here, though I haven't looked at it in 20+years. No worries about the case itself, problem is that any input other than "Y" results in program termination that could be missed when you're not looking closely. Worth it to strip the case just to eliminate the ambiguity and have it be more clear as to result either way, for my .02 worth:)
Thanks again ...
ID: 1812000 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1812003 - Posted: 24 Aug 2016, 3:03:27 UTC
Last modified: 24 Aug 2016, 3:03:57 UTC

You're welcome! Thanks for the nice feedback. :^)

And now on to the not-so-good: I did eliminate the possible endless loop bilbg pointed out, but with plenty of "ap" work in the cache when I ran it recently, it hung on the first line. I looked over the source and I am not sure why... it should break out either to terminate or continue. However this was a one-off: I launched BOINC and retried a few minutes later and it worked (moved all of one work unit... bloody GUPPIs. :^p)

So if anyone else has this, break out of it with Ctrl+c and please let me know if a retry doesn't fix it a bit later. I'll be having a look at this. Right now unfortunately it's bedtime!
ID: 1812003 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1812007 - Posted: 24 Aug 2016, 3:19:21 UTC - in response to Message 1812003.  

So if anyone else has this, break out of it with Ctrl+c and please let me know if a retry doesn't fix it a bit later. I'll be having a look at this. Right now unfortunately it's bedtime!

I'll beat on it and see if I can duplicate. Anything weird, I'll drop a note here. l8r
ID: 1812007 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1812038 - Posted: 24 Aug 2016, 5:54:29 UTC - in response to Message 1809907.  

Hm...
http://stackoverflow.com/questions/1383943/switchtothread-vs-sleep1
In general, Sleep(0) will be much more likely to yield a timeslice, and will ALWAYS yield to the OS, even if there are no other threads waiting. This is why adding a Sleep(0) in a loop will take the processor usage from 100% (per core) to near 0% in many cases. SwitchToThread will not, unless another thread is waiting for a time slice.


Sounds just diametrally different... definitely worth to try and see by yourself.


My experience has been that Sleep(0) won't yield to lower-priority threads but STT will; we had priority-inversion bugs as a result of using Sleep(0).

I'd try for(;;} { if(!SwitchToThread()) { CallMMPauseForAWhile(); } PollCUDA(); }

Hi, Shaggie76. Could you look at that thread please: http://lunatics.kwsn.info/index.php/topic,1812.msg61053.html#msg61053 - do you have any explanation of those results regarding Sleep(0) and STT behavior?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1812038 · Report as offensive
Luigi R.

Send message
Joined: 26 Nov 13
Posts: 10
Credit: 1,608,382
RAC: 0
Italy
Message 1812101 - Posted: 24 Aug 2016, 9:27:58 UTC
Last modified: 24 Aug 2016, 9:29:02 UTC

Hello,
this is my configuration for SETI.

app_info.xml
<app_info>
	<app>
		<name>setiathome_v8</name>
	</app>
	<file_info>
		<name>MBv8_8.05r3345_avx_linux64</name>
		<executable/>
	</file_info>
	<app_version>
		<app_name>setiathome_v8</app_name>
		<version_num>804</version_num>
		<platform>x86_64-pc-linux-gnu</platform>
		<cmdline></cmdline>
		<file_ref>
			<file_name>MBv8_8.05r3345_avx_linux64</file_name>
			<main_program/>
		</file_ref>
	</app_version>
	<app_version>
		<app_name>setiathome_v8</app_name>
		<version_num>805</version_num>
		<platform>x86_64-pc-linux-gnu</platform>
		<cmdline></cmdline>
		<file_ref>
			<file_name>MBv8_8.05r3345_avx_linux64</file_name>
			<main_program/>
		</file_ref>
	</app_version>
	<app>
		<name>setiathome_v8</name>
	</app>
	<file_info>
		<name>setiathome_8.01_x86_64-pc-linux-gnu__cuda60</name>
		<executable/>
	</file_info>
	<app_version>
		<app_name>setiathome_v8</app_name>
		<version_num>801</version_num>
		<platform>x86_64-pc-linux-gnu</platform>
		<coproc>
			<type>NVIDIA</type>
			<count>0.5</count>
		</coproc>
		<plan_class>cuda60</plan_class>
		<avg_ncpus>0.05</avg_ncpus>
		<max_ncpus>0.2</max_ncpus>
		<cmdline></cmdline>
		<file_ref>
			<file_name>setiathome_8.01_x86_64-pc-linux-gnu__cuda60</file_name>
			<main_program/>
		</file_ref>
	</app_version>
</app_info>


./GUPPIRescheduler output
Mr. Kevvy's GUPPI Rescheduler v0.51 - (c)2016 Kevin Dorner

Reading configuration files...
Found sched_request GPU platform=x86_64-pc-linux-gnu app_version=1 version_num=801
CPU platform=x86_64-pc-linux-gnu app_version=0 version_num=801 and GPU plan_class=cuda60
Searching for and moving workunits in client state...
No non-GUPPI workunits are assigned to CPU to move to GPU; no changes made.


Should app version be the same for CPU and GPU? Should I run cuda50?
ID: 1812101 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1812104 - Posted: 24 Aug 2016, 9:36:08 UTC - in response to Message 1812003.  
Last modified: 24 Aug 2016, 9:47:53 UTC

And now on to the not-so-good: I did eliminate the possible endless loop bilbg pointed out, but with plenty of "ap" work in the cache when I ran it recently, it hung on the first line. I looked over the source and I am not sure why... it should break out either to terminate or continue. However this was a one-off: I launched BOINC and retried a few minutes later and it worked (moved all of one work unit... bloody GUPPIs. :^p)

So if anyone else has this, break out of it with Ctrl+c and please let me know if a retry doesn't fix it a bit later. I'll be having a look at this. Right now unfortunately it's bedtime!

Now that the AP ghods have rained upon me, I get an error, as follows:

Mr. Kevvy's GUPPI Rescheduler v0.51 - (c)2016 Kevin Dorner

Reading configuration files...
This machine appears attached to SETI@Home Beta but warning overriden.
Error: could not determine CPU version_num from client_state. Nothing changed.


Removed the -b option, with same result:

Mr. Kevvy's GUPPI Rescheduler v0.51 - (c)2016 Kevin Dorner

Reading configuration files...
Warning: This machine appears to be attached to SETI@Home Beta.
If there are active Beta workunits, this program may reassign them improperly,
as they look like SETI@Home work. OK to proceed (Y to continue)?Y
Error: could not determine CPU version_num from client_state. Nothing changed.


No hang, so ctl-C not needed to get out of this ... Cold start on the box didn't change this either.
Something not happy in AP check world :)

Anything I can look at or send your way to help swat this?
Later, Jim ...
ID: 1812104 · Report as offensive
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 37 · Next

Message boards : Number crunching : GUPPI Rescheduler for Linux and Windows - Move GUPPI work to CPU and non-GUPPI to GPU


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.