The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 63 · 64 · 65 · 66 · 67 · 68 · 69 . . . 107 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2043483 - Posted: 7 Apr 2020, 10:45:34 UTC - in response to Message 2043467.  

With 2 month & even 3 months+ deadlines, that's how long it will take for the current stuff in limbo to eventually be resent. Then of course a few of those will probably end up being picked up by a system that won't return it, so add another 3+ months before those are likely to finally be returned (unless they end up with yet another black hole system).


. . I am hoping that the majority of the delinquent hosts were amongst the systems that were shut down. So very few resends will again end up in such black holes.

Stephen.

:)
ID: 2043483 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2043489 - Posted: 7 Apr 2020, 11:39:33 UTC - in response to Message 2043485.  

We'll see. The first day or so after the hibernation, I did get quite a few resends. Now though, it's been many hours since I got any.


. . The same here, but I have had a half a dozen or maybe a dozen in the last 24 hours. Looking at some of my pendings there should be significant resends in the next few days. So maybe there will be a change in the direction of the different stage numbers.

Stephen

< fingers crossed >
ID: 2043489 · Report as offensive     Reply Quote
Scrooge McDuck
Avatar

Send message
Joined: 26 Nov 99
Posts: 711
Credit: 1,674,173
RAC: 54
Germany
Message 2043490 - Posted: 7 Apr 2020, 11:46:42 UTC - in response to Message 2043467.  

With 2 month & even 3 months+ deadlines, that's how long it will take for the current stuff in limbo to eventually be resent. Then of course a few of those will probably end up being picked up by a system that won't return it, so add another 3+ months before those are likely to finally be returned (unless they end up with yet another black hole system).

Raising the initial task replication from two tasks per work unit to maybe four solves this problem way faster. This can also be done for workunits already sent out into the field. I've seen raised replication numbers for some "blc guppi" work units recently (started with two and send out another two at March 31st)...

...like this one.

As soon as at least two results have been validatet for an wu, an authorized result exists and it can be assimilated into the science DB. Later returned results still get credit but they aren't necessary for assimilation. So, after some weeks only such non-relevant late tasks are out in the field, maybe lost in blackholes or machines which run the screensaver 2 hours per day... At this point in the foreseeable future the guys in Berkeley will freeze their science database and cut connections to their boinc DB servers.

And I think they want to retire or at least switch off many machines then, which they dont't want to herd during projects' hiatus.
ID: 2043490 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2043495 - Posted: 7 Apr 2020, 12:27:45 UTC - in response to Message 2043490.  

Raising the initial task replication from two tasks per work unit to maybe four solves this problem way faster. This can also be done for workunits already sent out into the field. I've seen raised replication numbers for some "blc guppi" work units recently (started with two and send out another two at March 31st)...
...like this one.
As soon as at least two results have been validatet for an wu, an authorized result exists and it can be assimilated into the science DB. Later returned results still get credit but they aren't necessary for assimilation. So, after some weeks only such non-relevant late tasks are out in the field, maybe lost in blackholes or machines which run the screensaver 2 hours per day... At this point in the foreseeable future the guys in Berkeley will freeze their science database and cut connections to their boinc DB servers.
And I think they want to retire or at least switch off many machines then, which they dont't want to herd during projects' hiatus.


. . Actually the other way around. Despite having multiple validated results that task will not advance to the assimilators until that last host returns a result or times out. So increasing the replications will actually slow things down unless the existing unreturned task/s is/are timed out. This was the purpose of the script that sadly had a glitch and caused multiple unneccessary replications. I still believe it would be worth another look at.

Stephen

. .
ID: 2043495 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2043500 - Posted: 7 Apr 2020, 12:47:44 UTC
Last modified: 7 Apr 2020, 13:13:50 UTC

At the actual return rate of 7.5 K WU/hr we will have our answer in about 2 weeks.
By them these hosts who still has WU (slow hosts, bunkered, spoofed or whatever) must be at the end their crunching season and the few who left will be this never returning hosts or hosts who was power off by their users.
Them around May 15 the resends will restart to appears from them in larger quantities, nothing huge of course.

Replica seconds behind master 305,895 something was improved at least.
ID: 2043500 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2043506 - Posted: 7 Apr 2020, 13:57:29 UTC - in response to Message 2043500.  

[quoteReplica seconds behind master 305,895 something was improved at least.[/quote]
. . Now under 300,000 catching up at last ...

Stephen

:)
ID: 2043506 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2043532 - Posted: 7 Apr 2020, 17:03:24 UTC - in response to Message 2043479.  

AV software creating Ghosts comes to mind.
But even though that system doesn't actually have them, they're in the Seti system as though they are until they eventually timeout & get re-issued.

AV shouldn't be doing this. Not anything I can see which would cause it to reject file transfers. As far as I can see, it is all http(s) transfers of gridded binary data. Nothing there should be of concern to AV software.
Every other update seems to result in AV software of one brand or another considering BOINC activity to be malicious, and it's not at all unusual for someone to complain they aren't getting any work, when they have been- it's just that the AV software has been intercepting it as suspect activity.

Sounds more like people trying to configure their AV software. There isn't anything there which would trigger the various heurisitc engines, and all connections are with established TCP connections.
ID: 2043532 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2043535 - Posted: 7 Apr 2020, 17:07:49 UTC - in response to Message 2043480.  
Last modified: 7 Apr 2020, 17:33:09 UTC

Because of systems like this one.
Last contact, April 1
Tasks in progress, 14,409!
14k tasks in progress on a machine with 1 GPU?
AV software creating Ghosts comes to mind.
But even though that system doesn't actually have them, they're in the Seti system as though they are until they eventually timeout & get re-issued.
AV shouldn't be doing this. Not anything I can see which would cause it to reject file transfers. As far as I can see, it is all http(s) transfers of gridded binary data. Nothing there should be of concern to AV software.
Actually many people have wound up being the victim of overzealous AV ware and unless the BOINC folders are excluded from being scanned (and its internet connection being monitored) then those results are indeed very possible and I've trouble shooted many here who have suffered from them as well. ;-)

Cheers.

Again, I'd point the finger at user configurations. I have literally overseen billions of transfers of this type of file from every type of connection from 2400 baud modems, to OC 48 connections. It is absolutely illogical that these files or the transfer process would trigger AV software. GRIB data doesn't tend to trigger the engines, and every damn piece of AV software in the world comes with a default rule to allow http and TCP established connections. User configurations....that I can believe. What's worse, is that recommending a specific area of storage be added to an exclusion list, leaves their machine with another vector for attack, and it isn't necessary. If, as you suggest, that AV were having a difficult time, the first recommendation would be an reboot. The AV software might have hiccoughed because it was looking at new executable files in the BOINC directory, not the data directory. In 30 years of working with these files, I have seen a total number of "0" instances of AV software having an issue with GRIB data, from every Microsoft operating systems since DOS 3.1 to windows 10, hundreds of linux distributions, SunOS 2.3 to Solaris 10 (including Trusted Solaris), Irix, Trix, and HPUX. I've dealt with each on of these and never seen what you're suggesting.
ID: 2043535 · Report as offensive     Reply Quote
Scrooge McDuck
Avatar

Send message
Joined: 26 Nov 99
Posts: 711
Credit: 1,674,173
RAC: 54
Germany
Message 2043538 - Posted: 7 Apr 2020, 17:09:54 UTC - in response to Message 2043495.  
Last modified: 7 Apr 2020, 17:14:28 UTC

. . Actually the other way around. Despite having multiple validated results that task will not advance to the assimilators until that last host returns a result or times out. So increasing the replications will actually slow things down unless the existing unreturned task/s is/are timed out. This was the purpose of the script that sadly had a glitch and caused multiple unneccessary replications. I still believe it would be worth another look at..

Uuhh, I wasn't aware of this long-suffering behaviour of the assimilators. Sorry for my misleading remark.

But without insight into BOINCs backend logic, I would suspect that such months-long cascades of timeouts and resends and timeouts and..., if a project is stopped, or similarly, a certain app version within (e.g. setiathome_v7) could really be shortened. Raising the initial replication over the past few month(s) would speed things up, provided that the result "assimilation" is done in two steps: (1) feed authorized results into the science DB - despite any missing late results from additional replicated tasks - to freeze this DB at a near date. Even today, such late results are only credited after reporting and then ignored until final deletion. And a small change in the assimilator logic that allows early assimilation, while (2) keeping the corresponding workunit in the boinc DB as long as it is actively "living" (active pending tasks).

Sure there will be some drawbacks: a raised replication rate means increased stress for the already overloaded DB servers...

But... most important... such ultralong-living workunits are an endless and invaluable source of excitement to us setizens. Some of us love to count them down in forum posts over months until an overjoyed spectator can announce the final hurray. The recent hurray about the last s@h_v7 workunit celebrated the decay of a backward app while its successor was doing great. Now it will be the last app and the last workunit for the forseeable future.

Now it's getting philosophical... What is the essence of being from the perspective of a work unit? Its natural existence should not be harmed, not even by the SETI triumvirate's expressed will to dissolve their kingdom. Let them live on! Let's protest! Let's write some appeals! Someone has to convince the guys in Berkeley not to interfere with a workunit's natural life cycle. Maybe the boinc DB can be compressed a little and moved to some small energy efficient box to allow the last mammoths workunits to die naturally.

Eric wrote at Mar 31:
... We'll be accepting results and resending results that didn't validate for a while.

And after that "while"? After 20 years of their natural existence, can we allow him to exterminate the last survivors of an endangered species: seti workunits? Isn't there an ethical obligation to accompany their foreseeable demise with dignity until the very last day?

I therefore advocate a raised initial replication instead of pulling the plug on a Judgement Day. ;-)

Cheers, Scrooge.
ID: 2043538 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043540 - Posted: 7 Apr 2020, 17:13:45 UTC

Most resends in the future will go to hosts with empty caches that will crunch and return them almost immediately, so the chance for a resend to end up in a black hole will be significantly reduced.
ID: 2043540 · Report as offensive     Reply Quote
Scrooge McDuck
Avatar

Send message
Joined: 26 Nov 99
Posts: 711
Credit: 1,674,173
RAC: 54
Germany
Message 2043546 - Posted: 7 Apr 2020, 18:41:20 UTC - in response to Message 2043540.  

Most resends in the future will go to hosts with empty caches that will crunch and return them almost immediately, so the chance for a resend to end up in a black hole will be significantly reduced.

Sure most but not all. We can start a bet: in May enough resends will again end in black holes and in July some of their resends too... and so on. There is some funny thread on the final seti_v7 resends somewhere in this forum. Can't find the thread, sorry. Counting them finally down to zero. How long does it take? Half a year, a full year? I don't remember the dates.
ID: 2043546 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2043555 - Posted: 7 Apr 2020, 19:58:33 UTC - in response to Message 2043546.  

There is some funny thread on the final seti_v7 resends somewhere in this forum. Can't find the thread, sorry. Counting them finally down to zero. How long does it take? Half a year, a full year? I don't remember the dates.
71 of them were still showing on the SSP until early this year - they only cleared when we took desperate measures to clean up the Christmas holiday mess.
ID: 2043555 · Report as offensive     Reply Quote
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 716
Credit: 8,032,827
RAC: 62
France
Message 2043560 - Posted: 7 Apr 2020, 20:32:19 UTC

just received 2 WU one GPU et another CPU ...

07-Apr-2020 20:06:47 [SETI@home] Scheduler request completed: got 1 new tasks
07-Apr-2020 20:20:19 [SETI@home] Scheduler request completed: got 1 new tasks


before the deadline of the _1 wingman, a _2 was send to a third one
and before the deadline of the _2, i receive the _3 !
3842707158

it's not the first we see, new server behavior ?
ID: 2043560 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043564 - Posted: 7 Apr 2020, 20:42:20 UTC - in response to Message 2043546.  

There is some funny thread on the final seti_v7 resends somewhere in this forum. Can't find the thread, sorry. Counting them finally down to zero. How long does it take? Half a year, a full year? I don't remember the dates.
That isn't really comparable to this situation. Hosts that crunched the v7 resends had caches full of v8 work so those resends had the same turnaround time as any task. Longer a task sits in someone's cache, higher the chance of something fatal like filesystem failure or the user quitting to happen to it.
ID: 2043564 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043565 - Posted: 7 Apr 2020, 20:50:07 UTC - in response to Message 2043555.  

71 of them were still showing on the SSP until early this year - they only cleared when we took desperate measures to clean up the Christmas holiday mess.
Those 71 were all returned, validated, and assimilated and were waiting for purging, so the 'guilty' party was the server, not the users. I guess the v7 db purger had been taken out of service before those 71 tasks were validated so they ended up in limbo.
ID: 2043565 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2043573 - Posted: 7 Apr 2020, 22:00:51 UTC - in response to Message 2043538.  

But without insight into BOINCs backend logic, I would suspect that such months-long cascades of timeouts and resends and timeouts and..., if a project is stopped, or similarly, a certain app version within (e.g. setiathome_v7) could really be shortened. Raising the initial replication over the past few month(s) would speed things up, provided that the result "assimilation" is done in two steps: (1) feed authorized results into the science DB - despite any missing late results from additional replicated tasks - to freeze this DB at a near date. Even today, such late results are only credited after reporting and then ignored until final deletion. And a small change in the assimilator logic that allows early assimilation, while (2) keeping the corresponding workunit in the boinc DB as long as it is actively "living" (active pending tasks).

Cheers, Scrooge.


. . Actually all they need to do is shorten the b(&&^y deadlines. Things will clear up much faster in a fairly 'normal' fashion. So all the outstanding work can be completed.

Stephen

. .
ID: 2043573 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2043575 - Posted: 7 Apr 2020, 22:05:36 UTC - in response to Message 2043560.  

just received 2 WU one GPU et another CPU ...

07-Apr-2020 20:06:47 [SETI@home] Scheduler request completed: got 1 new tasks
07-Apr-2020 20:20:19 [SETI@home] Scheduler request completed: got 1 new tasks


before the deadline of the _1 wingman, a _2 was send to a third one
and before the deadline of the _2, i receive the _3 !
3842707158

it's not the first we see, new server behavior ?


. . I have noticed this also. Not sure what it means, but unless the old task is terminated (timed out) the WU will remain active until it does. If the time out is within 24 hours then chances are the extra replication will be returned by then and the WU can move through the system.

Stephen

:)
ID: 2043575 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13751
Credit: 208,696,464
RAC: 304
Australia
Message 2043586 - Posted: 7 Apr 2020, 23:38:24 UTC

Another system (from another thread).

In progress 25281
And it's been a week since last contact with the server.

And with the user being Anonymous, for all we know they could have several systems in the same state.
Grant
Darwin NT
ID: 2043586 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13751
Credit: 208,696,464
RAC: 304
Australia
Message 2043587 - Posted: 7 Apr 2020, 23:45:43 UTC - in response to Message 2043535.  
Last modified: 7 Apr 2020, 23:46:38 UTC

Again, I'd point the finger at user configurations. I have literally overseen billions of transfers of this type of file from every type of connection from 2400 baud modems, to OC 48 connections. It is absolutely illogical that these files or the transfer process would trigger AV software.
*shrug*
All i know is that every few months we would get someone complaining about not getting any work, we'd look at their systems and see way more work In progress than was possible. We'd ask if they had recently updated their AV/Malware software. Answer yes. A bit later they'd say they are now getting work.
Then there are the ones that would post here saying they are leaving Seti@home because their AV/Malware software is saying it's a virus or worm of some description. Sometimes it's a result of an upgrade to the software, other times just a normal AV definition daily update & the activity of BOINC now sets off it's alarms.
It's happened. Repeatedly. For years.
*shrug*
Grant
Darwin NT
ID: 2043587 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2043592 - Posted: 8 Apr 2020, 0:15:14 UTC - in response to Message 2043587.  
Last modified: 8 Apr 2020, 0:22:17 UTC

Again, I'd point the finger at user configurations. I have literally overseen billions of transfers of this type of file from every type of connection from 2400 baud modems, to OC 48 connections. It is absolutely illogical that these files or the transfer process would trigger AV software.
*shrug*
All i know is that every few months we would get someone complaining about not getting any work, we'd look at their systems and see way more work In progress than was possible. We'd ask if they had recently updated their AV/Malware software. Answer yes.
It's happened. Repeatedly. For years.
*shrug*


. . Exactly ... happens kind of regularly.

Stephen

. .
ID: 2043592 · Report as offensive     Reply Quote
Previous · 1 . . . 63 · 64 · 65 · 66 · 67 · 68 · 69 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.