OK, BOINCers, riddle me this

Author	Message
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1105258 - Posted: 11 May 2011, 19:52:56 UTC I have two huge anomalies running 6.10.56 on my machine "unimatrix001" and 6.10.58 on my machine "BOINCBox" (both have essentially the same hardware). I am running the optimized apps. 1) On both machines, the servers give S@H Enhanced (CPU) an Average Procesing Rate of about 9, and AP (CPU) an APR of about 15. However, unimatrix001 gets an APR for GPU of ~74, and BOINBox gets an APR of ~24. How can this be? (The machines have both processed about 45,000 CUDA WUs according to the display). 2) On unimatrix001, I have it running a CUDA WU on High Priority now with a due date of 6/12 (yes, June) despite the fact that I have many, many CUDA WUs due before that one (literally hundreds due before then). Again, how can this be? ID: 1105258 ·

skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60	Message 1105260 - Posted: 11 May 2011, 20:05:58 UTC - in response to Message 1105258. APR thats new. please describe/define the quantities is that WU's or time or something else. The Computers are clearly not working the same WU's so theres a very good chance that one got a crapload of shorties while the other was running midrange WU's. In a rich man's house there is no place to spit but his face. Diogenes Of Sinope ID: 1105260 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1105267 - Posted: 11 May 2011, 20:52:50 UTC - in response to Message 1105258. Have you got flops values in the app_info's? because they'll make a difference to the final Average Procesing Rate, Claggy ID: 1105267 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1105272 - Posted: 11 May 2011, 21:03:21 UTC - in response to Message 1105267. Yep, it's better to remove all FLOPS entries, don't know if they are used, server side? Apart from the fact, that the amount of FLOPS of GPU's, are hard to measure and S.P. gives a higher FLOPS then D.P., not even linear ID: 1105272 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1105279 - Posted: 11 May 2011, 21:34:04 UTC Skildude: Average Processing Rate is something the servers calculate (I don't know how). If you go to the Account page -> Computers on This Account -> Details -> Application Details, you will see it. Also, processing 40K or more WUs each should even out the WU properties for the two machines, I would think. Claggy: I have no FLOPS in my app_info.xml on either machine. So I am mystified. Any more ideas out there? Any idea what that number (APR) is used for? ID: 1105279 ·

Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0	Message 1105283 - Posted: 11 May 2011, 21:46:11 UTC - in response to Message 1105258. I have two huge anomalies running 6.10.56 on my machine "unimatrix001"... Where did you get a Borg computer? Did you assimilate it? ;-) GruÃŸ, Gundolf ID: 1105283 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1105296 - Posted: 11 May 2011, 22:45:11 UTC - in response to Message 1105279. Skildude: Average Processing Rate is something the servers calculate (I don't know how). If you go to the Account page -> Computers on This Account -> Details -> Application Details, you will see it. Also, processing 40K or more WUs each should even out the WU properties for the two machines, I would think. Claggy: I have no FLOPS in my app_info.xml on either machine. So I am mystified. Any more ideas out there? Any idea what that number (APR) is used for? The servers keep an average called "et" for each application version, which simply is a declining exponential average of validated task run times divided by the original rsc_fpops_est produced by the splitters. After a fast adaptation during the first 20 validated tasks, the average is adjusted by 0.01x the difference between a new task ratio and the old average, so can change by no more than 1% in response to each validation. That rate of change is glacial for hosts which do less than 1 task per day, but fairly quick for the most productive hosts. It could theoretically change by slightly more than 50% as the result of 70 validations. APR is simply the inverse of "et" scaled down by 1e9. It is in that form simply because humans seem to think bigger is better. That "et" average is not affected by <flops> settings, etc., but it is combined with them to generate the scaling factor used for adjusting the rsc_fpops_est and rsc_fpops_bound values for tasks sent to a host. The rsc_fpops_est values the mb_splitters produce are based on the relative speeds we measured in the Estimates and Deadlines revisited thread. Those measurements were for CPU processing, and looking at the amount of variability of individual CPU host times against the estimate curve you can see the estimates are likely to be off considerably for any specific task on any specific host. For GPU processing the curve certainly is a worse fit. Things like result_overflow tasks which complete much faster than estimated, or tasks of fairly low angle range which take relatively longer on CUDA, get no special handling in that averaging. If distribution of tasks to a host could be evenly spread across the various "tape" files being split, the averaging might not be affected much. But mb_splitter processes are very bursty, they take in 107.37 seconds of raw 2.5 MHz. sample data (256 Mebisamples), do the needed FFTs to split it into 256 subbands, then output 256 WUs in a burst which cause generation of 512 tasks at the end of the "Results ready to send" queue. IOW, what's put in the queue is largely blobs of similar tasks when the mb_splitters are active, leavened somewhat by the fairly constant production of AP tasks and reissues of both kinds. When a host asks for a large amount of work at a time when one of those blobs has reached the front of the queue it may get many tasks all from a single recording time. The effect on the average would be really bad if it were calculated immediately when a result is reported, the variability of when a wingmate reports helps spread the effect, but a host running a large enough cache that it is usually second to report is more affected. RAC (a 7 day half life) is more stable than APR (a 69 validated task half life) for hosts which do more than 10 tasks a day. Joe ID: 1105296 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1105340 - Posted: 12 May 2011, 3:58:26 UTC - in response to Message 1105283. I have two huge anomalies running 6.10.56 on my machine "unimatrix001"... Where did you get a Borg computer? Did you assimilate it? ;-) GruÃŸ, Gundolf No - it assembled itself, like all good automata do. ID: 1105340 ·

Miep Volunteer moderator Send message Joined: 23 Jul 99 Posts: 2412 Credit: 351,996 RAC: 0	Message 1105379 - Posted: 12 May 2011, 10:55:16 UTC - in response to Message 1105258. 1) On both machines, the servers give S@H Enhanced (CPU) an Average Procesing Rate of about 9, and AP (CPU) an APR of about 15. However, unimatrix001 gets an APR for GPU of ~74, and BOINBox gets an APR of ~24. How can this be? (The machines have both processed about 45,000 CUDA WUs according to the display). I suppose Joe wants to say that it's entirely possible for a large batch of particular tasks to one machine and not to the other, to have APRs that far apart. 2) On unimatrix001, I have it running a CUDA WU on High Priority now with a due date of 6/12 (yes, June) despite the fact that I have many, many CUDA WUs due before that one (literally hundreds due before then). Again, how can this be? enable cpu_sched_debug and rr_simulation in the log part of cc_config, get a very large unreadable debugging output and come to the conclusion, that your boinc version thinks if it does that task now, it will finish most/all of them in time. Probably your DCF got misaligned by the shorty storm and runtime estimates are way off. Unless you run a far too large cache it shouldn't be a problem. Carola ------- I'm multilingual - I can misunderstand people in several languages! ID: 1105379 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1105382 - Posted: 12 May 2011, 11:36:15 UTC - in response to Message 1105379. Unless you run a far too large cache it shouldn't be a problem. I run an 8 or 10 day cache, but I don't play games with my connect status to make it more than that; I am always connected. That wouldn't seem to be too much in light of the one month lead time on that "High Priority" task, would it? It still seems rather bizarre to have that occur. ID: 1105382 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1105384 - Posted: 12 May 2011, 12:00:15 UTC - in response to Message 1105258. Looking at your two hosts, i see a few things different, one has Vista x64, and one has XP x64, i'd expect the XP machine to be faster and have a higher RAC, and it does, looking at the MB GPU APR figures shows 54.45 for the XP machine and 30.44 for the Vista machine, i would have expected it to be closer, looking at a few tasks on each machine, i see the Vista machine's tasks get restarted from checkpoints while the XP machine's don't, (of the tasks i looked at), i suspect one host gets used by you, and has eithier 'Suspend work while computer is in use?' or 'Suspend GPU work while computer is in use?' enabled, and the other host doesn't get used by you so often, so doesn't get interupted, the thing about GPU computation is if the task gets suspended, it is then unloaded from memory, and you loose that computation since the last checkpoint, while the computer is being used the GPU is idle, so APR will be lower, the XP host also has over twice as many tasks in it's list, suggesting it has far more shorties than the Vista host, all these differences add up, resultid=1905362466 on XP host, run time on AR 1.51 Wu 487.52 secs resultid=1905416279 on Vista host, run time on AR 1.51 Wu 630.56 (and it was interrupted for how long?) Claggy ID: 1105384 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1105575 - Posted: 13 May 2011, 0:25:48 UTC - in response to Message 1105384. Looking at your two hosts, i see a few things different, one has Vista x64, and one has XP x64, i'd expect the XP machine to be faster and have a higher RAC, and it does, looking at the MB GPU APR figures shows 54.45 for the XP machine and 30.44 for the Vista machine, i would have expected it to be closer, looking at a few tasks on each machine, i see the Vista machine's tasks get restarted from checkpoints while the XP machine's don't, (of the tasks i looked at), i suspect one host gets used by you, and has eithier 'Suspend work while computer is in use?' or 'Suspend GPU work while computer is in use?' enabled, and the other host doesn't get used by you so often, so doesn't get interupted, the thing about GPU computation is if the task gets suspended, it is then unloaded from memory, and you loose that computation since the last checkpoint, while the computer is being used the GPU is idle, so APR will be lower, the XP host also has over twice as many tasks in it's list, suggesting it has far more shorties than the Vista host, all these differences add up, resultid=1905362466 on XP host, run time on AR 1.51 Wu 487.52 secs resultid=1905416279 on Vista host, run time on AR 1.51 Wu 630.56 (and it was interrupted for how long?) Claggy Actually, neither is used by me, except to check in occasionally; I do not have "suspend while in use" set. Because the TDCF on both machines fluctuates quite a bit (from .5 to 1.4 over a day or two), the machines d/l lots of tasks at times; the XP machine recently had close to 3K tasks active or waiting in BOINC, while the V had less than 1500. The "restarts" are precisely because the damn machine has been screwing around with high priority on various WUs. It currently has 27 tasks active (some Running, some Running in High Priority, some Waiting to Run). THAT's what I was questioning!!! Some of this has to do with rescheduling VHARs to CPU, but I did that recently on BOTH machines because I was running out of CPU work. ID: 1105575 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.