Daily midnight traffic peaks

Author	Message
Rudy Volunteer tester Send message Joined: 23 Jun 99 Posts: 189 Credit: 794,998 RAC: 0	Message 872049 - Posted: 4 Mar 2009, 15:05:50 UTC - in response to Message 872044. since both these hosts are running stock should Boinc perhaps have a feature that when a computer loses enough quota the apps are re-issued? ID: 872049 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 872085 - Posted: 4 Mar 2009, 16:59:44 UTC - in response to Message 872049. since both these hosts are running stock should Boinc perhaps have a feature that when a computer loses enough quota the apps are re-issued? That is not a bad idea...at least for those running the stock apps... Boinc should be able to send a code back to the servers indicating that the stock app is corrupt or missing and initiate a reload of the app. It would have to recognize the presence of an app_info file indicating that the user is running some kind of opti app, and not take any action in that case, as the use of opti apps is the user's responsibility. Another thought, either in addition to that, or instead of that, would be to have the Seti servers send an automated email to the user's registered account saying something to the effect that "Your computer ID XXXXXXX which is contributing to the Seti project, has returned XXXX results with errors in the last XX hours. You may want to check it for problems." after a certain percentage of client errors have been returned. This could be done even for hosts running optimized apps, as it would also alert users of otherwise properly running rigs that they may have some hardware problems that have arisen over time. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 872085 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 872140 - Posted: 4 Mar 2009, 19:01:40 UTC - in response to Message 872036. Could there be a large number of crunchers with corrupt AP applications or app info still out there? My answer would be YES 2 out of 6 WU's I got have hosts that are trashing every WU they get. Hosts 4077778 and 4069312. Yes, 4077778 is trashing every WU, and its quota is down to 1/day. But 4069312 has a quota of 93/day and had successfully done at least one AP_v5 5.03 task before reporting the 7 errors. The two WUs it is now working haven't errored out, I'd judge the user fixed whatever problem cropped up March 1 and future results will be OK. It takes about 1/3 of the deadline time to complete an AP_v5 WU, so about another week before those WUs are finished. Joe ID: 872140 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 872198 - Posted: 4 Mar 2009, 21:13:44 UTC - in response to Message 872049. since both these hosts are running stock should Boinc perhaps have a feature that when a computer loses enough quota the apps are re-issued? Good idea, but what if it isn't the app? What if it's a dirty HSF and the machine is throwing errors because it's just too hot? ... or the memory is failing? ID: 872198 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 872213 - Posted: 4 Mar 2009, 22:03:16 UTC - in response to Message 872140. Shouldn't that be they haven't failed yet? Looking at his tasks I see this.... 1170594009 417760101 21 Feb 2009 20:15:02 UTC 1 Mar 2009 20:51:41 UTC Over Client error Compute error 464,726.30 614.09 PROUD MEMBER OF Team Starfire World BOINC ID: 872213 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 872279 - Posted: 5 Mar 2009, 1:20:58 UTC - in response to Message 872213. Shouldn't that be they haven't failed yet? Looking at his tasks I see this.... 1170594009 417760101 21 Feb 2009 20:15:02 UTC 1 Mar 2009 20:51:41 UTC Over Client error Compute error 464,726.30 614.09 That's the first one of the seven which reduced the host quota to 93. Another was reported as a failure at the same time, the host then got five more within a few minutes which failed quickly. Then there's the last two which I assume are still running OK after several days. Anything is possible though, the host could have totally died and can't report any more failures. Truly any result within deadline which hasn't yet been reported from any host is in a "hasn't failed yet" state as far as the servers know. I think most of us have occasionally had a problem which trashed some work one way or another. Joe ID: 872279 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 872297 - Posted: 5 Mar 2009, 2:51:51 UTC - in response to Message 872279. Shouldn't that be they haven't failed yet? Looking at his tasks I see this.... 1170594009 417760101 21 Feb 2009 20:15:02 UTC 1 Mar 2009 20:51:41 UTC Over Client error Compute error 464,726.30 614.09 That's the first one of the seven which reduced the host quota to 93. Another was reported as a failure at the same time, the host then got five more within a few minutes which failed quickly. Then there's the last two which I assume are still running OK after several days. Anything is possible though, the host could have totally died and can't report any more failures. Truly any result within deadline which hasn't yet been reported from any host is in a "hasn't failed yet" state as far as the servers know. I think most of us have occasionally had a problem which trashed some work one way or another. Joe Actually..I've done 11 WUs now and none have failed yet. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 872297 ·

Rudy Volunteer tester Send message Joined: 23 Jun 99 Posts: 189 Credit: 794,998 RAC: 0	Message 872305 - Posted: 5 Mar 2009, 3:34:42 UTC - in response to Message 872297. Host 2946210 ID: 872305 ·

archae86 Send message Joined: 31 Aug 99 Posts: 909 Credit: 1,582,816 RAC: 0	Message 873473 - Posted: 7 Mar 2009, 18:05:45 UTC I this morning did a little exercise of looking at failing quorum partners for a recent 20 AstroPulse v5 results provided to the Frozen Nehi. These results were sent between 1:46 UTC on March 5 and 15:04 March 6. They were appreciably scattered in that interval. So although the sample is small, I think it approaches a random look at very recent health. In that sample of 20 issues, the already failed quorum partners were many, and the symptoms were surprisingly diverse. Nearly all the failing quorum partners had failed repeatedly in recent time, with many beaten down to between 1 and 30 daily quota/CPU. Most failed this particular download within moments of receiving it. Based on text in the stderr out portion of the task page, here is a summary, including host IDs: 5 of these 4386944, 4722208, 4709402, 4367590, 4770983: too many exit(0)s 3 of these 4609430,4828328,2157900 app_version download error: couldn't get input files: <file_xfer_error> <file_name>ap_graphics_5.03_windows_intelx86.exe</file_name> <error_code>-200</error_code> 2 of these 2845114, 1807096 : process got signal 11 1 of these 4710603 : app_version download error: couldn't get input files: <file_xfer_error> <file_name>astropulse_5.03_AUTHORS</file_name> <error_code>-200</error_code> 1 of these 2344739 : One or more missing files 1 of these 4316501: Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> In ap_gfx_main.cpp: in ap_graphics_init(): Starting client. 1 of these 4456674 : Can't get shared memory segment name: shmget() failed 1 of these 4293747 : WU download error: couldn't get input files: <file_xfer_error> <file_name>ap_21ja09af_B4_P1_00268_20090305_00538.wu</file_name> <error_code>-200</error_code> 1 of these 4615895 : boinc_graphics_make_shmem failed: 0 Comments: In looking at failed quorum partners for just twenty recently issued results, I found about 16 distinct hosts, most of which haven been repeatedly failing. While the distinction of what one things a distinct symptom is a bit hazy, and is likely altered by the OS, Science Ap, and client in use on the host, this appears to be multiple distinct symptoms. While I did not log CPU time, the great majority of these were time zero fails, though at least one host was repeatedly failing doing enough CPU to represent a small fraction of a WU. I'm not really presenting this as specific to the Midnight Burst phenomenon, but rather as one flashlight shining into the dark pit of general health for hosts running Astropulse v5. I doubt such a diverse group of individual users are going to do something to "fix" their hosts, so if this situation is to get better, something more global needs to be done. In the meantime we can be grateful that the eventual limitation to 1/day/CPU limits the share of total traffic these hosts consume, and fear that a short period of MP availability may revive their appetites. ID: 873473 ·

Rudy Volunteer tester Send message Joined: 23 Jun 99 Posts: 189 Credit: 794,998 RAC: 0	Message 874309 - Posted: 10 Mar 2009, 13:41:57 UTC - in response to Message 873473. great analysis archae86 Looking at the errors, they still all look like application errors that may have a come from corrupted app downloads. The variety of errors could just be a symption of different forms of file corruption. On the astropulse graphs, the midnight errors are continuing, not decreasing. The results returned continues to spike up after midnight, and the crunch time continues to spike down indicating an inrush of bad results. Scarecrow AP graphs I agree that until something globally is done these hosts will continue filling their quotas with errors every night (and day). Hopefully a new version of APv5 stock is not too far down the road, which would hopefully fix this problem. ID: 874309 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.