GUPPI Rescheduler for Linux and Windows - Move GUPPI work to CPU and non-GUPPI to GPU

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1810129 - Posted: 18 Aug 2016, 13:16:34 UTC - in response to Message 1810125. MB work (for me at least) seems to have completely dried up, so if you're getting errors using this app. it may mean there's simply nothing to move right now. Agreed that Arecibo work is not currently being split, but we've had no announcement to indicate whether that's permanent or temporary. And if temporary, how long for - they might restart the splitters as soon as they get into the lab, in a couple of hours' time. The problem we had last night was MB tasks being given a <version_num> and <plan_class> combination, both of which were only valid for Astropulse work - so of course there was no matching setiathome_v8 application available to process them. That could be overcome by checking the <app_name> when choosing a version_num and plan_class to be used for substitution in the rescheduled task. ID: 1810129 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80	Message 1810131 - Posted: 18 Aug 2016, 13:25:44 UTC Last modified: 18 Aug 2016, 13:26:09 UTC Does it really matter ? Eric mentioned at beta month ago that we will run out of arecibo work. Under scientific conditions what are a few month ? With each crime and every kindness we birth our future. ID: 1810131 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1810133 - Posted: 18 Aug 2016, 13:45:25 UTC - in response to Message 1810131. Does it really matter ? Eric mentioned at beta month ago that we will run out of arecibo work. Under scientific conditions what are a few month ? It mattered to the people who had a substantial proportion of their caches discarded, and were initially mis-directed into suspecting it was an app_info problem. Eric also said that he was looking into splitting Astropulse workunits from GBT data, so it would seem to be wise to prepare for that eventuality while it's still fresh in our minds. ID: 1810133 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1810154 - Posted: 18 Aug 2016, 15:30:12 UTC - in response to Message 1810131. Last modified: 18 Aug 2016, 15:32:21 UTC Does it really matter ? Eric mentioned at beta month ago that we will run out of arecibo work. Under scientific conditions what are a few month ? My perspective is: As soon as there is two sources of data (from different telescopes), there will be a need for a DeviceQueueOptimizer (DQO) such as Mr Kevvy's GuppiRescheduler. (This wouldn't apply unless of course, the hardware is all the same. And from what I have heard, there are few telescopes close to being anywhere close to the same.) If that assumption is correct, my vision is: - continue building a simple DQO platform to be integrated next year into a S@h manager (as a sidekick to the BoincManager). - The S@h manager would also capture run times of tasks and the config being used (such as commandline, etc). It would also make it easier for novices to make changes to certain options, such as: app_config.xml By 2018 or later, it could integrate a small AI program that would communicate with other S@h managers (with similar hardware) over something like P2P in order to figure out what the best config is for max throughput (under certain constraints from the human pimp). This is just a thought in progress and the only one I've had a beginning of a convo with is Jimbolous since he shared his small farm mgt vision before me. I can't seem to find that thread though. Cheers, Rob ;-) PS: If there's important variable that I haven't identified, please let me know so that I can evolve this thought-in-progress. ID: 1810154 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1810156 - Posted: 18 Aug 2016, 15:34:57 UTC - in response to Message 1810059. Thanks... I will have a look at that. Will have time this weekend to check it over in detail (thankfully no plans for a change.) One other thing that I think I noticed, and that you might want to consider when you're digging back into your code, is that it appeared to me that all blc/guppi tasks are treated as VLARs and eligible for relocation to the CPU. However, I recall from the MESSIER031 files that were processed back in April, that guppis can, on rare occasions, actually be normal AR tasks, too. And since it appears that they're on the verge of splitting another batch of MESSIER031 files, there probably wouldn't be anything to gain in moving such non-VLAR guppis to the CPU. If fact, if any of those get originally assigned to a CPU, they would benefit from a move to a GPU. So, I guess what I'm saying is that it would probably be better to key on "vlar" in the task name, rather than "blc", in determining what tasks to move where. ID: 1810156 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1810163 - Posted: 18 Aug 2016, 15:47:48 UTC - in response to Message 1810097. Is it impossible for a 32-bit program to access and edit client_state.xml on 64-bit Windows? No, I have my own VLAR rescheduler that runs as a 32-bit program on all my boxes, both 32-bit and 64-bit. The one tricky thing, that took me a while to figure out, is that Registry access (to retrieve the DATADIR and INSTALLDIR) on the 64-bit machines requires an extra bit of code to overcome the "redirection" of Registry keys that normally prevents a 32-bit program from accessing many of the entries stored by a 64-bit program, in this case BOINC. ID: 1810163 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1810166 - Posted: 18 Aug 2016, 15:58:02 UTC - in response to Message 1810156. Last modified: 18 Aug 2016, 15:59:05 UTC As a side not for future development: with BoincTasks when I look at the "Time Left" column of the tasks in the GPU queue, I can sort-of-tell which ones will be shorties; they have a "Time Left" of ~12mins instead of the regular 30+mins. My guess is that it would be better to run these on the CPU due to the loading times involved on doing so on the GPU. Has anyone else seen what I am noticing? I remember in early July, I ran all the shortest "Time Left" tasks and about 90% were shorties and the others were those that take half the avg time. I didn't even think back then of running them on the CPU to see what happens. If anyone thinks this is another subset of tasks that is better run on a certain device, let me know and I'll move these to the CPU queue to see they are actually just as fast on that device. ID: 1810166 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1810174 - Posted: 18 Aug 2016, 16:33:06 UTC - in response to Message 1810166. As a side not for future development: with BoincTasks when I look at the "Time Left" column of the tasks in the GPU queue, I can sort-of-tell which ones will be shorties; they have a "Time Left" of ~12mins instead of the regular 30+mins. You can also tell the true 'shorties' (VHAR) by their deadline - 21 days from issue, so this morning's issued shorty tasks had a deadline of 07 September. Note that Astropulse tasks, as and when available, have a different and much more onerous ratio of 'work to be done' against deadline. ID: 1810174 ·

Stubbles Volunteer tester Send message Joined: 29 Nov 99 Posts: 358 Credit: 5,909,255 RAC: 0	Message 1810175 - Posted: 18 Aug 2016, 16:51:45 UTC - in response to Message 1810174. Last modified: 18 Aug 2016, 16:56:50 UTC As a side not for future development: with BoincTasks when I look at the "Time Left" column of the tasks in the GPU queue, I can sort-of-tell which ones will be shorties; they have a "Time Left" of ~12mins instead of the regular 30+mins. You can also tell the true 'shorties' (VHAR) by their deadline - 21 days from issue, so this morning's issued shorty tasks had a deadline of 07 September. Note that Astropulse tasks, as and when available, have a different and much more onerous ratio of 'work to be done' against deadline. Very cool! Thanks Richard I had hidden the "Deadline" column since I thought they were simply ludicrous values. That'll teach me to hide fields! lol Should I bother testing shorties on CPU vs GPU? Also, if Arecibo tasks disappear during the next few months, are there any subsets of tasks from Green Bank that could see a *much smaller benefit* to run on GPU or CPU? ID: 1810175 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1810181 - Posted: 18 Aug 2016, 17:38:09 UTC - in response to Message 1810175. I believe - though this is where we really miss Joe Segur - that VHAR shorties, when processed on CPU, place particularly heavy stress on the memory subsystems. So, if you have a modern CPU with good on-die L2 and L3 cache (backed up by good system memory, of course), then your CPU will like them. Poor memory support will slow them significantly. This was particularly a problem with my E5320 dual Xeon workstation when it was originally supplied in dual-channel memory mode: increasing the memory to quad-channel was a noticeable help. Shorties were also loved by NVidia GPUs during the original v6 processing days. The additional processing added for v7 and now v8 has hit them particularly hard. My rule of thumb has been: on a good CPU, shorties run in about a third of the time of mid-AR work with the same application. On an NV GPU in the days of v6 processing, they ran in a quarter of the time. Now, with the current v8 processing, they take half the time. Just generalisations, and I have no experience with AMD CPUs or ATi GPUs: but an attempt to explain why the only one-word answer is - [it] depends. ID: 1810181 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22190 Credit: 416,307,556 RAC: 380	Message 1810182 - Posted: 18 Aug 2016, 17:54:09 UTC My estimates for AMD CPUs suggest a similar ratio to your Richard. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1810182 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 1810888 - Posted: 21 Aug 2016, 2:08:01 UTC Last modified: 21 Aug 2016, 2:15:52 UTC Version 0.5 uploaded which should address the AP work unit issue... not a lot of AP work right now so it's hard to verify with live data. Bonus: Added --b (Linux) and -b (Windows) command line switch to skip the warning that SETI Beta is present; helpful for automating GUPPIRescheduler ie with Stubbles' script. I would have liked to work on other things ie the AMD variation that reverses the reassignment direction, but I have to get the fix posted ASAP. I'll be watching this thread now that an issue was noted, so please post for any issues and PM me if there's no response. If any problems are noted I should now be able to resolve the same day. ID: 1810888 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1810891 - Posted: 21 Aug 2016, 2:16:22 UTC - in response to Message 1810888. Thanks for the update Mr. Kevvy. I have one AP GPU task currently on one machine that prevents me from rescheduling. Good chance to try the new app out. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1810891 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1810896 - Posted: 21 Aug 2016, 2:33:03 UTC - in response to Message 1810891. Good chance to try the new app out. You can try it offline first (in new empty directory): http://setiathome.berkeley.edu/forum_thread.php?id=80141&postid=1810871#1810871 Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1810896 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1810897 - Posted: 21 Aug 2016, 2:38:08 UTC - in response to Message 1810888. Bonus: Added --b (Linux) and -b (Windows) command line switch to skip the warning that SETI Beta is present Did you consider adding some code to search only from the start to end of SETI@home section (and not the entire client_state.xml): http://setiathome.berkeley.edu/forum_thread.php?id=79954&postid=1807421#1807421 Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1810897 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1810898 - Posted: 21 Aug 2016, 2:41:00 UTC - in response to Message 1810891. Of course, only one task and a GPU type to boot, but I ran the new rescheduler and no errors and no dumped task. I think the true test is with AP CPU tasks as I believe I ran the old beta rescheduler on two machines earlier today without realizing there were AP GPU tasks on board and I didn't dump tasks. Either I got lucky or there is more to the failure mechanism than simply whether there is any kind of AP work on board. So I don't think the true test will come until there are both AP CPU and GPU work on board. If it moves GUPPI's and non-VLAR's successfully without dumping work when both types of AP tasks are on board, I think we can call the new beta a success. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1810898 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1810907 - Posted: 21 Aug 2016, 3:24:34 UTC - in response to Message 1810898. Either I got lucky or there is more to the failure mechanism than simply whether there is any kind of AP work on board. Mr. Kevvy would have to confirm, but from what I originally saw, an AP task (CPU or GPU) would have to have been the first (i.e, oldest) non-CPU task in your queue (following at least one CPU task) in order to trigger the problem. If any other type of GPU task (i.e., a task with a plan class) filled that spot, everything was fine. That's why it wasn't consistent across your machines. A true test would only come when that condition is met again, so watch to see where your current AP task is in your queue and try the Rescheduler when it bubbles up near the top with nothing but CPU tasks ahead of it. Of course, now the guppis are drying up, so maybe you won't have anything to reschedule for awhile, anyway. ;^) ID: 1810907 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1810908 - Posted: 21 Aug 2016, 3:32:49 UTC - in response to Message 1810888. but I have to get the fix posted ASAP. That part of code look strange to me: tempposition = sched_request.rfind("<name>", currentposition) + 6; // Find backwards to eliminate picking up "<name>ap" which is an AstroPulse version number if ( client_state.substr(tempposition, 2) != "ap" ) { You find/set tempposition in sched_request and then check what is in that tempposition in client_state (?) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1810908 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1810913 - Posted: 21 Aug 2016, 4:07:12 UTC - in response to Message 1810888. Last modified: 21 Aug 2016, 4:58:13 UTC I just noticed this program can't be used for BOINC 6.x This is from my sched_request_setiathome.berkeley.edu.xml on BOINC 6.10.58 <other_result> <name>07my10aa.24998.22181.15.42.119_1</name> <plan_class>opencl_ati5_sah</plan_class> </other_result> <other_result> <name>14mr09ac.1350.7434.13.40.186_2</name> <plan_class>opencl_ati5_sah</plan_class> </other_result> </other_results> As you can see there is no </app_version> here so on BOINC 6.x these lines will not find tasks: currentposition = sched_request.find("</app_version>\n </other_result>", currentposition); // No <plan_class> between these two lines indicates a CPU workunit currentposition = sched_request.find("</app_version>\n <plan_class>"); // Now do the same as the block above but for app_versionGPU ! Also the second line misses to start at currentposition (infinite loop if "ap" is found), should be: currentposition = sched_request.find("</app_version>\n <plan_class>", currentposition); EDIT: Some protection against infinite loops can be: #define MAXLOOP 10000 currentposition = 0; endloop = 0; while (currentposition >= 0 && endloop < MAXLOOP) { endloop++; ... Â Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1810913 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1810914 - Posted: 21 Aug 2016, 4:37:43 UTC So, I've run 0.5 on all 5 crunchers here, with no issues. Thanks! ID: 1810914 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.