Daily midnight traffic peaks

Message boards : Number crunching : Daily midnight traffic peaks
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Rudy
Volunteer tester

Send message
Joined: 23 Jun 99
Posts: 189
Credit: 794,998
RAC: 0
Canada
Message 872049 - Posted: 4 Mar 2009, 15:05:50 UTC - in response to Message 872044.  

since both these hosts are running stock should Boinc perhaps have a feature that when a computer loses enough quota the apps are re-issued?
ID: 872049 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 872085 - Posted: 4 Mar 2009, 16:59:44 UTC - in response to Message 872049.  

since both these hosts are running stock should Boinc perhaps have a feature that when a computer loses enough quota the apps are re-issued?

That is not a bad idea...at least for those running the stock apps...
Boinc should be able to send a code back to the servers indicating that the stock app is corrupt or missing and initiate a reload of the app.
It would have to recognize the presence of an app_info file indicating that the user is running some kind of opti app, and not take any action in that case, as the use of opti apps is the user's responsibility.

Another thought, either in addition to that, or instead of that, would be to have the Seti servers send an automated email to the user's registered account saying something to the effect that "Your computer ID XXXXXXX which is contributing to the Seti project, has returned XXXX results with errors in the last XX hours. You may want to check it for problems." after a certain percentage of client errors have been returned. This could be done even for hosts running optimized apps, as it would also alert users of otherwise properly running rigs that they may have some hardware problems that have arisen over time.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 872085 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 872140 - Posted: 4 Mar 2009, 19:01:40 UTC - in response to Message 872036.  

Could there be a large number of crunchers with corrupt AP applications or app info still out there?


My answer would be YES

2 out of 6 WU's I got have hosts that are trashing every WU they get.

Hosts 4077778 and 4069312.

Yes, 4077778 is trashing every WU, and its quota is down to 1/day. But 4069312 has a quota of 93/day and had successfully done at least one AP_v5 5.03 task before reporting the 7 errors. The two WUs it is now working haven't errored out, I'd judge the user fixed whatever problem cropped up March 1 and future results will be OK. It takes about 1/3 of the deadline time to complete an AP_v5 WU, so about another week before those WUs are finished.
                                                         Joe
ID: 872140 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 872198 - Posted: 4 Mar 2009, 21:13:44 UTC - in response to Message 872049.  

since both these hosts are running stock should Boinc perhaps have a feature that when a computer loses enough quota the apps are re-issued?

Good idea, but what if it isn't the app?

What if it's a dirty HSF and the machine is throwing errors because it's just too hot?

... or the memory is failing?
ID: 872198 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 872213 - Posted: 4 Mar 2009, 22:03:16 UTC - in response to Message 872140.  

Shouldn't that be they haven't failed yet? Looking at his tasks I see this....

1170594009 417760101 21 Feb 2009 20:15:02 UTC 1 Mar 2009 20:51:41 UTC Over Client error Compute error 464,726.30 614.09


PROUD MEMBER OF Team Starfire World BOINC
ID: 872213 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 872279 - Posted: 5 Mar 2009, 1:20:58 UTC - in response to Message 872213.  

Shouldn't that be they haven't failed yet? Looking at his tasks I see this....

1170594009 417760101 21 Feb 2009 20:15:02 UTC 1 Mar 2009 20:51:41 UTC Over Client error Compute error 464,726.30 614.09

That's the first one of the seven which reduced the host quota to 93. Another was reported as a failure at the same time, the host then got five more within a few minutes which failed quickly. Then there's the last two which I assume are still running OK after several days. Anything is possible though, the host could have totally died and can't report any more failures.

Truly any result within deadline which hasn't yet been reported from any host is in a "hasn't failed yet" state as far as the servers know. I think most of us have occasionally had a problem which trashed some work one way or another.
                                                              Joe
ID: 872279 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 872297 - Posted: 5 Mar 2009, 2:51:51 UTC - in response to Message 872279.  

Shouldn't that be they haven't failed yet? Looking at his tasks I see this....

1170594009 417760101 21 Feb 2009 20:15:02 UTC 1 Mar 2009 20:51:41 UTC Over Client error Compute error 464,726.30 614.09

That's the first one of the seven which reduced the host quota to 93. Another was reported as a failure at the same time, the host then got five more within a few minutes which failed quickly. Then there's the last two which I assume are still running OK after several days. Anything is possible though, the host could have totally died and can't report any more failures.

Truly any result within deadline which hasn't yet been reported from any host is in a "hasn't failed yet" state as far as the servers know. I think most of us have occasionally had a problem which trashed some work one way or another.
                                                              Joe

Actually..I've done 11 WUs now and none have failed yet.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 872297 · Report as offensive
Rudy
Volunteer tester

Send message
Joined: 23 Jun 99
Posts: 189
Credit: 794,998
RAC: 0
Canada
Message 872305 - Posted: 5 Mar 2009, 3:34:42 UTC - in response to Message 872297.  

ID: 872305 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 873473 - Posted: 7 Mar 2009, 18:05:45 UTC

I this morning did a little exercise of looking at failing quorum partners for a recent 20 AstroPulse v5 results provided to the Frozen Nehi. These results were sent between 1:46 UTC on March 5 and 15:04 March 6. They were appreciably scattered in that interval. So although the sample is small, I think it approaches a random look at very recent health.

In that sample of 20 issues, the already failed quorum partners were many, and the symptoms were surprisingly diverse. Nearly all the failing quorum partners had failed repeatedly in recent time, with many beaten down to between 1 and 30 daily quota/CPU. Most failed this particular download within moments of receiving it.

Based on text in the stderr out portion of the task page, here is a summary, including host IDs:

5 of these 4386944, 4722208, 4709402, 4367590, 4770983:
too many exit(0)s

3 of these 4609430,4828328,2157900
app_version download error: couldn't get input files:
<file_xfer_error>
  <file_name>ap_graphics_5.03_windows_intelx86.exe</file_name>
  <error_code>-200</error_code>

2 of these 2845114, 1807096 :
process got signal 11

1 of these 4710603 :
app_version download error: couldn't get input files:
<file_xfer_error>
  <file_name>astropulse_5.03_AUTHORS</file_name>
  <error_code>-200</error_code>

1 of these 2344739 :
One or more missing files

1 of these 4316501:
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
In ap_gfx_main.cpp: in ap_graphics_init(): Starting client.

1 of these 4456674 :
Can't get shared memory segment name: shmget() failed

1 of these 4293747 :
WU download error: couldn't get input files:
<file_xfer_error>
  <file_name>ap_21ja09af_B4_P1_00268_20090305_00538.wu</file_name>
  <error_code>-200</error_code>

1 of these 4615895 :
boinc_graphics_make_shmem failed: 0


Comments: In looking at failed quorum partners for just twenty recently issued results, I found about 16 distinct hosts, most of which haven been repeatedly failing. While the distinction of what one things a distinct symptom is a bit hazy, and is likely altered by the OS, Science Ap, and client in use on the host, this appears to be multiple distinct symptoms.

While I did not log CPU time, the great majority of these were time zero fails, though at least one host was repeatedly failing doing enough CPU to represent a small fraction of a WU.

I'm not really presenting this as specific to the Midnight Burst phenomenon, but rather as one flashlight shining into the dark pit of general health for hosts running Astropulse v5.

I doubt such a diverse group of individual users are going to do something to "fix" their hosts, so if this situation is to get better, something more global needs to be done. In the meantime we can be grateful that the eventual limitation to 1/day/CPU limits the share of total traffic these hosts consume, and fear that a short period of MP availability may revive their appetites.



ID: 873473 · Report as offensive
Rudy
Volunteer tester

Send message
Joined: 23 Jun 99
Posts: 189
Credit: 794,998
RAC: 0
Canada
Message 874309 - Posted: 10 Mar 2009, 13:41:57 UTC - in response to Message 873473.  

great analysis archae86

Looking at the errors, they still all look like application errors that may have a come from corrupted app downloads. The variety of errors could just be a symption of different forms of file corruption.

On the astropulse graphs, the midnight errors are continuing, not decreasing. The results returned continues to spike up after midnight, and the crunch time continues to spike down indicating an inrush of bad results.

Scarecrow AP graphs

I agree that until something globally is done these hosts will continue filling their quotas with errors every night (and day).

Hopefully a new version of APv5 stock is not too far down the road, which would hopefully fix this problem.
ID: 874309 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : Daily midnight traffic peaks


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.