Message boards :
Number crunching :
GUPPI Rescheduler for Linux and Windows - Move GUPPI work to CPU and non-GUPPI to GPU
Message board moderation
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 37 · Next
Author | Message |
---|---|
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
run 0.5 ...with no issues As I read the code - there are at least 2 bugs: - the AstroPulse will not be identified (because "ap" is searched in wrong place/file) = previous bug is not fixed - may enter infinite loop (hang) if the string "ap" is found (despite searched in a wrong place) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
run 0.5 ...with no issues Haven't experienced either, but I did read your notes and thanks for the heads-up! |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
Sigh... version 0.51 uploaded... please update and my apologies. Thank you BilBg for catching my bonehead errors. I shouldn't code on a Saturday night even when sober. :^p Or maybe at any other time... lol. Maybe I can finally sleep now. Happily my response time is now much faster. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Sigh... version 0.51 uploaded... please update and my apologies. Strictly cosmetic, and maybe I'm missing something, but remember the (*ux) CR vs (Dos/Win) CR/LF bit in your readme-s. Seems to me I had an old ADDLF program around here that added the missing bit, but it's been 20 years since I ever thought about it. Off to update the .exe x 5 :) and thanks again ... |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
 Just out of curiosity: I searched my client_state.xml (279 698 bytes) to estimate the probability of the two bugs combined to hang the (now old v0.5) program: find /c "ap" client_state.xml - gives count 321 find /c "ap_" client_state.xml - gives count 10 If randomly "poking" in client_state.xml 321 / 279698 = 0.0011 = 0.11 % to find "ap" and hang (1 per 1000 runs) - "ap" is found in many words inside client_state.xml : apic swap app api (Yes, it was not supposed to poke client_state.xml for the short "ap" string) 10 / 279698 = 0.000036 = 0.0036 % to find "ap_" - if using ap_ : ~30 times less chance for the bug to show itself (would have been 1 per 30000 runs)   - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
I3APR Send message Joined: 23 Apr 16 Posts: 99 Credit: 70,717,488 RAC: 0 |
Hello, a couple of days ago I started GuppiReschedule and it did a fine job swappin 36 WUs between CPU<->GPU. Today I did it again : 1) Stopped BOINC, and waited about 1 minute 2) Backed up client_state.xml 3) Started GUPPIRescheduler, here's the output : C:\ProgramData\BOINC>GUPPIRescheduler.exe Mr. Kevvy's GUPPI Rescheduler v0.4 - (c)2016 Kevin Dorner Reading configuration files... Found sched_request GPU platform=windows_intelx86 app_version=1 version_num=812 CPU platform=windows_intelx86 app_version=3 version_num=800 and GPU plan_class=o pencl_nvidia_SoG Searching for and moving work units in client state... Writing updated configuration client_state.xml... Done: 57 non-GUPPI workunits moved to GPU and 57 GUPPI workunits moved to CPU. 4) Started Boinc again, after about 10/15 seconds Now all I've got is the popup window with "Communicating with Boinc Client. Please wait.." Task manager shows some activity ( 16 tasks ) , while Afterburner shows that only two GPU on five are working : it is in this situation since 10 minutes !! UPDATE : I clicked on "cancel" and I have lost connection with the project. This is a disaster, since this critter was working with a RAC of about 73.000 ! I have now tried to shut down BOINC and return to the saved client state but some process is preventing it. I'm rebooting the server. Hope I'm not saying Bye Bye to WOW... :-( A |
I3APR Send message Joined: 23 Apr 16 Posts: 99 Credit: 70,717,488 RAC: 0 |
Ok, after reboot, it seems that now BOINC is working, but I've lost about 5 WUs ( self-aborted) and had 14 WUs crashed : I don't know...last time it worked well, and btw I'm grateful to Mr. Kevvy and everyone else helping with side scripts/programs to help us, but let me say I've lost about 1000/1500 credits, plus had all the GPU jobs reset to zero, so I'm a bit reluctant to run it again in the future... :-( A. |
Stubbles Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0 |
Hey A! In Mr Kevvy's thread, please see his last post from 2 days ago: https://setiathome.berkeley.edu/forum_thread.php?id=79954&postid=1810926 He writes at the end: Happily my response time is now much faster. so you might want to post a link to this thread in his thread (since he probably is "Subscribed" to his own thread and could get an auto-email sent when there is a new post...if he set his forum preferences that way). Also, you could try sending him a private message (PM) in case he doesn't get an immediate email notification (or even worse: he might not even be "subscribed" to his own thread ...cuz it's a forum bug, since you don't get automatically subscribed to your own thread!) Hope that helps a bit, RobG |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Works perfect here. I put in an scheduled task and runs automaticaly each 6 hours in 3 diferent hosts and no error reported. SOmething else must be happening at your side. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
Mr. Kevvy's GUPPI Rescheduler v0.4 Current one is v0.51 and any work unit loss should be resolved. Download. If BOINC Manager hangs and doesn't display anything when it's started, it means that it wasn't quit for long enough when GUPPIRescheduler was run and files were in use. Quit it, make sure to check the box to quit running apps., wait at least ten seconds (on Linux this is when the command prompt cursor stops flashing... a handy timer) run GUPPIRescheduler again (as it did nothing the last time with the files open) and relaunch BOINC. Edit: I would also suggest to keep any discussion of it in its own thread to keep the board tidier without numerous threads about the same app. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Mr. Kevvy, I thought I recalled earlier that there was a potential issue with APs, but iirc 0.51 solved that, correct? Reason I'm asking is that I'm doing some development on Stubbles' script, and have tentatively added a flag to not invoke your rescheduler if there are APs present. I have tested here quite a bit, and not seen any issue with this. One thing I'm also spending a good bit of time on is ensuring, via tasklist queries, that I know what is and isn't running when. Hopefully, that will prevent issues like mentioned here earlier. By default, I am not going to shut down BOINCTasks, though again I've added a command line option to do that is the user wishes. Do you see any reason BOINCTasks needs to be down while you're running? Finally, not sure if this is intentional, but the "Y" to proceed in 0.51 is case sensitive. Not sure if you intended that. That got me a couple times, when I wasn't looking closely, thought it had run and it hadn't due to my "y" instead of "Y" response. Dunno if you're willing to Case that? Thanks, !! |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I've run the 0.51 rescheduler now several times on all three crunchers that have had AP work on board and haven't ghosted any tasks. Looks like the bug is squashed to me. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
Finally, not sure if this is intentional, but the "Y" to proceed in 0.51 is case sensitive. Not sure if you intended that. That got me a couple times, when I wasn't looking closely, thought it had run and it hadn't due to my "y" instead of "Y" response. Dunno if you're willing to Case that? Being a newly-"Minted" Linux junkie I'm used to excessive case sensitivity and wondered if anyone would be thrown by that in Windows. I'll add that into the next version... assuming there is one! :^) I've run the 0.51 rescheduler now several times on all three crunchers that have had AP work on board and haven't ghosted any tasks. Looks like the bug is squashed to me. Excellent... I can sleep better now. ;^) Thank you for your feedback as it was quite helpful. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
I've run the 0.51 rescheduler now several times on all three crunchers that have had AP work on board and haven't ghosted any tasks. Looks like the bug is squashed to me. Sorry, Mr. Kevvy, I should have provided feedback as well. Have been running on all 5 of my crunchers since you released 0.51. Plenty of home runs, no hits and no errors:) Across the 5 boxes, RAC up 6k since launch. That's what prompted me to get interested in working with Stubbles to further develop his batch file to manage this. Again, thanks! |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Being a newly-"Minted" Linux junkie I'm used to excessive case sensitivity and wondered if anyone would be thrown by that in Windows. I'll add that into the next version... assuming there is one! :^) Old SCO/VXWorks guy here, though I haven't looked at it in 20+years. No worries about the case itself, problem is that any input other than "Y" results in program termination that could be missed when you're not looking closely. Worth it to strip the case just to eliminate the ambiguity and have it be more clear as to result either way, for my .02 worth:) Thanks again ... |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
You're welcome! Thanks for the nice feedback. :^) And now on to the not-so-good: I did eliminate the possible endless loop bilbg pointed out, but with plenty of "ap" work in the cache when I ran it recently, it hung on the first line. I looked over the source and I am not sure why... it should break out either to terminate or continue. However this was a one-off: I launched BOINC and retried a few minutes later and it worked (moved all of one work unit... bloody GUPPIs. :^p) So if anyone else has this, break out of it with Ctrl+c and please let me know if a retry doesn't fix it a bit later. I'll be having a look at this. Right now unfortunately it's bedtime! |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
So if anyone else has this, break out of it with Ctrl+c and please let me know if a retry doesn't fix it a bit later. I'll be having a look at this. Right now unfortunately it's bedtime! I'll beat on it and see if I can duplicate. Anything weird, I'll drop a note here. l8r |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Hm... Hi, Shaggie76. Could you look at that thread please: http://lunatics.kwsn.info/index.php/topic,1812.msg61053.html#msg61053 - do you have any explanation of those results regarding Sleep(0) and STT behavior? SETI apps news We're not gonna fight them. We're gonna transcend them. |
Luigi R. Send message Joined: 26 Nov 13 Posts: 10 Credit: 1,608,382 RAC: 0 |
Hello, this is my configuration for SETI. app_info.xml <app_info> <app> <name>setiathome_v8</name> </app> <file_info> <name>MBv8_8.05r3345_avx_linux64</name> <executable/> </file_info> <app_version> <app_name>setiathome_v8</app_name> <version_num>804</version_num> <platform>x86_64-pc-linux-gnu</platform> <cmdline></cmdline> <file_ref> <file_name>MBv8_8.05r3345_avx_linux64</file_name> <main_program/> </file_ref> </app_version> <app_version> <app_name>setiathome_v8</app_name> <version_num>805</version_num> <platform>x86_64-pc-linux-gnu</platform> <cmdline></cmdline> <file_ref> <file_name>MBv8_8.05r3345_avx_linux64</file_name> <main_program/> </file_ref> </app_version> <app> <name>setiathome_v8</name> </app> <file_info> <name>setiathome_8.01_x86_64-pc-linux-gnu__cuda60</name> <executable/> </file_info> <app_version> <app_name>setiathome_v8</app_name> <version_num>801</version_num> <platform>x86_64-pc-linux-gnu</platform> <coproc> <type>NVIDIA</type> <count>0.5</count> </coproc> <plan_class>cuda60</plan_class> <avg_ncpus>0.05</avg_ncpus> <max_ncpus>0.2</max_ncpus> <cmdline></cmdline> <file_ref> <file_name>setiathome_8.01_x86_64-pc-linux-gnu__cuda60</file_name> <main_program/> </file_ref> </app_version> </app_info> ./GUPPIRescheduler output Mr. Kevvy's GUPPI Rescheduler v0.51 - (c)2016 Kevin Dorner Reading configuration files... Found sched_request GPU platform=x86_64-pc-linux-gnu app_version=1 version_num=801 CPU platform=x86_64-pc-linux-gnu app_version=0 version_num=801 and GPU plan_class=cuda60 Searching for and moving workunits in client state... No non-GUPPI workunits are assigned to CPU to move to GPU; no changes made. Should app version be the same for CPU and GPU? Should I run cuda50? |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
And now on to the not-so-good: I did eliminate the possible endless loop bilbg pointed out, but with plenty of "ap" work in the cache when I ran it recently, it hung on the first line. I looked over the source and I am not sure why... it should break out either to terminate or continue. However this was a one-off: I launched BOINC and retried a few minutes later and it worked (moved all of one work unit... bloody GUPPIs. :^p) Now that the AP ghods have rained upon me, I get an error, as follows: Mr. Kevvy's GUPPI Rescheduler v0.51 - (c)2016 Kevin Dorner Removed the -b option, with same result: Mr. Kevvy's GUPPI Rescheduler v0.51 - (c)2016 Kevin Dorner No hang, so ctl-C not needed to get out of this ... Cold start on the box didn't change this either. Something not happy in AP check world :) Anything I can look at or send your way to help swat this? Later, Jim ... |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.