The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 94 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2027202 - Posted: 10 Jan 2020, 20:59:32 UTC - in response to Message 2027157.  

. . But if all cards agree the sample is an overflow (noisy) then it does not matter as there is no result going into the database right or wrong. If you are worried about the 1 or 2 credits then what can I say ....
It's not about Credit, it's about science, so the type of overflow is important- the number of pulses triples or whatever.
It is important to get it right.
Grant
Darwin NT
ID: 2027202 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2027203 - Posted: 10 Jan 2020, 21:04:29 UTC

I see the Scheduler is taking another of it's random breaks, and the splitters also took a time out from work production for almost an hour.
MB In progress is falling off, and my requests for work are mostly "Project has no tasks available" and occasionally 1 or 2 WUs when reporting 10-15.

I am surprised the system held up as long as it did under the present load.
Grant
Darwin NT
ID: 2027203 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2027204 - Posted: 10 Jan 2020, 21:06:52 UTC - in response to Message 2027110.  
Last modified: 10 Jan 2020, 21:08:32 UTC

First time this year I have seen the returns past 7 million.
This year, yes. Last year they reached 10.8 million for a while.

Presently there is a Validation/ assimilation/ deletion backlog. Which along with the huge return rate & Scheduler issues could explain the Splitters taking a break for a while earlier.
Grant
Darwin NT
ID: 2027204 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1647
Credit: 12,921,799
RAC: 89
New Zealand
Message 2027208 - Posted: 10 Jan 2020, 21:13:34 UTC - in response to Message 2027204.  

First time this year I have seen the returns past 7 million.
This year, yes. Last year they reached 10.8 million for a while.

Presently there is a Validation/ assimilation/ deletion backlog. Which along with the huge return rate & Scheduler issues could explain the Splitters taking a break for a while earlier.

The splitters are currently in overdrive I think we while ago I saw that would producing over 300/second it is currently over 150/second
ID: 2027208 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2027222 - Posted: 10 Jan 2020, 22:42:07 UTC - in response to Message 2027202.  

. . But if all cards agree the sample is an overflow (noisy) then it does not matter as there is no result going into the database right or wrong. If you are worried about the 1 or 2 credits then what can I say ....
It's not about Credit, it's about science, so the type of overflow is important- the number of pulses triples or whatever.
It is important to get it right.


. . I think someone needs to make it clear, are noise bombs added to the science database or NOT?

. . My understanding is that they do not contribute to the database at all.

Stephen

? ?
ID: 2027222 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2027227 - Posted: 10 Jan 2020, 22:55:41 UTC - in response to Message 2027222.  

. . I think someone needs to make it clear, are noise bombs added to the science database or NOT?
They are added to the database.

'Noise bomb' is an awkward term. The technical term is 'overflow' (reached 30 signals, no space for any more). That can happen at any point in a task's run - from <1%, to 99%. The 'Late onset' overflows (my own term for them) can certainly contain useful data - immediate overflows, less likely. But they should be treated with respect, and validated properly, whenever the axe comes down.
ID: 2027227 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 38193
Credit: 261,360,520
RAC: 489
Australia
Message 2027232 - Posted: 10 Jan 2020, 23:18:24 UTC

1 thing that I am wondering about is when MB v7 is finally going to be put to bed?

Those 71 tasks still listed were finished years ago and that could free up a few cycles and processes on the servers. ;-)

Cheers.
ID: 2027232 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2027234 - Posted: 10 Jan 2020, 23:25:57 UTC - in response to Message 2027227.  

. . I think someone needs to make it clear, are noise bombs added to the science database or NOT?
They are added to the database.

'Noise bomb' is an awkward term. The technical term is 'overflow' (reached 30 signals, no space for any more). That can happen at any point in a task's run - from <1%, to 99%. The 'Late onset' overflows (my own term for them) can certainly contain useful data - immediate overflows, less likely. But they should be treated with respect, and validated properly, whenever the axe comes down.


. . Well I think the term noise bomb is fairly explicit, they are noisy signal samples that go off like a bomb, 'pop' and are gone, before any serious processing is done which seems to be the spurious results the faulty drivers are returning. Certainly late overflows have a fair proportion of the signal analysed and would be expected to be treated with more gravity. I am surprised that the 'unmentionables' are added to the database since almost none of the sample has even been looked at.

Stephen

? ?
ID: 2027234 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2027235 - Posted: 10 Jan 2020, 23:29:19 UTC - in response to Message 2027232.  

1 thing that I am wondering about is when MB v7 is finally going to be put to bed?

Those 71 tasks still listed were finished years ago and that could free up a few cycles and processes on the servers. ;-)

Cheers.


. . And maybe close of a few dead ends ...

Stephen

??
ID: 2027235 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1859
Credit: 268,616,081
RAC: 1,349
United States
Message 2027255 - Posted: 11 Jan 2020, 2:53:38 UTC - in response to Message 2027232.  

1 thing that I am wondering about is when MB v7 is finally going to be put to bed?

Those 71 tasks still listed were finished years ago and that could free up a few cycles and processes on the servers. ;-)

Probably when they need the column for MB v9 :)
But I doubt there's any appreciable impact on MB v7s presence . . .
ID: 2027255 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2027285 - Posted: 11 Jan 2020, 10:31:29 UTC
Last modified: 11 Jan 2020, 10:33:53 UTC

Web site & forums very slow, are we about to crash?

Edit- and Scheduler is back to mostly "Project has no tasks available" responses with the occasional release of 1 or 2 no matter how many are being reported.
Grant
Darwin NT
ID: 2027285 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2027292 - Posted: 11 Jan 2020, 11:26:56 UTC

It's very variable. Two different machines:

11/01/2020 11:22:01 | SETI@home | [sched_op] NVIDIA GPU work request: 11753.94 seconds; 0.00 devices
11/01/2020 11:22:05 | SETI@home | Scheduler request completed: got 24 new tasks

11/01/2020 11:23:35 | SETI@home | [sched_op] NVIDIA GPU work request: 18778.74 seconds; 0.00 devices
11/01/2020 11:23:37 | SETI@home | Scheduler request completed: got 0 new tasks

The more you ask for, the less you get!
ID: 2027292 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2027305 - Posted: 11 Jan 2020, 13:31:04 UTC

Sat 11 Jan 2020 07:28:31 AM CST |  | Project communication failed: attempting access to reference site
Sat 11 Jan 2020 07:28:31 AM CST | SETI@home | Temporarily failed upload of 11oc10aa.23426.13973.6.33.30_2_r320875423_0: transient HTTP error


Got up this morning with bunch of download retry s hanging on one box. Hit the retry button and down they came. Then I got this.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2027305 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2027310 - Posted: 11 Jan 2020, 14:03:21 UTC
Last modified: 11 Jan 2020, 14:07:52 UTC

Could be just a coincidence but each time the sum of the WU reaches 22-23 MM weird things happening.

Who knows?

Sometimes is wise to do an step back and rethink about the change of the WU limits. Maybe, just maybe, 200CPU/300GPU is too much for the servers to handle.

My 0.02 cents.
ID: 2027310 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2027353 - Posted: 11 Jan 2020, 20:07:52 UTC

Well, the Scheduler's back up after it's random break.
Now i'm getting more uploads instantly timing out on their first attempt than usual, and even getting some timing out on their second.
Grant
Darwin NT
ID: 2027353 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2027355 - Posted: 11 Jan 2020, 20:12:10 UTC - in response to Message 2027353.  

Been having that issue for the past day or so.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2027355 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2027360 - Posted: 11 Jan 2020, 20:23:07 UTC
Last modified: 11 Jan 2020, 20:26:06 UTC

I see new files have been loaded, and they're a lower in number than the BLC35s, so as the present BLC35 files finish, then the BLC35s clear out of caches, then the resends work their way through, that should reduce the load on the servers by a huge amount.
So the upload issue should return to it's normal level and the next time the Scheduler takes it's random break, it won't last as long (hopefully).
Grant
Darwin NT
ID: 2027360 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2027372 - Posted: 11 Jan 2020, 22:30:59 UTC - in response to Message 2027227.  

immediate overflows, less likely. But they should be treated with respect, and validated properly, whenever the axe comes down.
The position where the axe blade makes contact can be inconsistent. CPU apps that process the data serially hit the 30th signal at the same point if they see the same signals but GPU apps that do parallel processing discover the signals in unpredictable order. So they may report a completely different subset of 30 signals out of potentially thousands that exist in the file.

At least this is my theory of why my GPU scores invalids from overlows quite regularly but non overflow results are always valid.
ID: 2027372 · Report as offensive
Sleepy
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 219
Credit: 98,947,784
RAC: 28,360
Italy
Message 2027376 - Posted: 11 Jan 2020, 22:48:17 UTC - in response to Message 2027372.  

At least this is my theory of why my GPU scores invalids from overlows quite regularly but non overflow results are always valid.
I have very few invalids (just not to say none, never say never) , CPU or GPU.

I fear there is something going on at your end, Ville.
ID: 2027376 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2027377 - Posted: 11 Jan 2020, 22:49:37 UTC

Something's broken- Received-last-hour for both MB & AP have plummeted, as has the splitter output. All at the same time, around 21:20 UTC.
Although it looks like things might just be coming back to life again; splitter output is no longer 0, Returned-last-hour is climbing again.

Very odd; usually I have to make the post before the problem clears. This time I just had to start typing about it.
Grant
Darwin NT
ID: 2027377 · Report as offensive
Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.