The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 107 · Next

AuthorMessage
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22205
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2038601 - Posted: 17 Mar 2020, 23:24:44 UTC

There are ways.....
But less impressive is the number of errors due to late returns - over 300 and counting :-(
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2038601 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2038603 - Posted: 17 Mar 2020, 23:30:35 UTC - in response to Message 2038582.  

The One Wingman didn't agree with the One Overflow marked Valid

Most of my six pages of Quorum=1 tasks are just regular, normal tasks and not overflows. Also not many are AMD either. A lot are just regular cpu tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2038603 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2038611 - Posted: 18 Mar 2020, 0:29:13 UTC - in response to Message 2038582.  

. . Wouldn't that introduce a large number of rubbish results into the science database? This way may be slow and painful but it seems to me it is necessary ...

Stephen

? ?
ID: 2038611 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2038612 - Posted: 18 Mar 2020, 0:34:36 UTC - in response to Message 2038591.  

Looking at my olders pendings, I found this interesting computer:
https://setiathome.berkeley.edu/show_host_detail.php?hostid=8473422
4 core processor with 1 gpu and 15000 tasks ???
His oldest pending was returned 3/12.
He received thousands of tasks today.
No wonder things are not working


. . It is hosts such as this one that justify the suggested change of forcing all existing outstanding WUs to a deadline of one week at this stage which would force them to resends and maybe get results before the curtain comes down. His return rates are 17 days for CPU tasks and 55 days for SoG tasks on a 1080ti. That is very, very wrong. 15,000 cached tasks? Crazy!!

Stephen

:(
ID: 2038612 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2038614 - Posted: 18 Mar 2020, 0:38:18 UTC - in response to Message 2038603.  

The One Wingman didn't agree with the One Overflow marked Valid

Most of my six pages of Quorum=1 tasks are just regular, normal tasks and not overflows. Also not many are AMD either. A lot are just regular cpu tasks.


. . But how hard would it be to sort the 'normal' CPU or SoG results from those of less confident hosts? Because it would be unwise to just accept single host results from such machines. Not just AMD 57nn cards but any hosts with a suspect result quality.

Stephen

? ?
ID: 2038614 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2038638 - Posted: 18 Mar 2020, 3:15:00 UTC - in response to Message 2038621.  

Lots of new files added to the splitters now.

Question is will we get these all processed within the next 13 to 14 days?
In my eyes there has been progress made because "results waiting DB purge" has risen to over a million. I do not know the last time I saw this
ID: 2038638 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2038639 - Posted: 18 Mar 2020, 3:18:45 UTC - in response to Message 2038611.  

. . Wouldn't that introduce a large number of rubbish results into the science database? This way may be slow and painful but it seems to me it is necessary ...

Stephen

? ?
The Results from the "Bad" AMD 5700s will Never be used to Search for ET as the Results from those machines are ALL at Chirp = 0, and will be Removed as RFI even if they do make it to the Database. The other Machines sent the minimum quorum = 1 WUs are reliable machines and do Not produce "rubbish", that's Why they were sent those WUs. At this point it would be Much better to FIX the Database than try to save a few WUs that will be trashed as RFI. Trying to save a few WUs is causing Thousands from being completed. Why try to save a few if it means Missing Thousands? I think any changes to the quorums should be reverted back to where they were before the problem started in early Dec. I'd much rather have a few WUs listed as RFI than have the fastest machines go DAYS producing Nothing instead of Thousands of completed tasks a day. You do realize the Problem with the AMD 5700s was going on for Months before the changes were made to the quorum count, right? It's not as if they haven't already loaded the database with loads of RFI WUs.
ID: 2038639 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2038642 - Posted: 18 Mar 2020, 4:19:10 UTC - in response to Message 2038591.  

Looking at my olders pendings, I found this interesting computer:

https://setiathome.berkeley.edu/show_host_detail.php?hostid=8473422

4 core processor with 1 gpu and 15000 tasks ???

His oldest pending was returned 3/12.

He received thousands of tasks today.

No wonder things are not working

How does this happen? 15,000 tasks for a 1K credit person?
ID: 2038642 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2038643 - Posted: 18 Mar 2020, 4:24:09 UTC - in response to Message 2038642.  

The client operation can be broken in a multitude of ways with a host that does not fit within the norms. So the default braking mechanism fails spectacularly in some cases.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2038643 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2038650 - Posted: 18 Mar 2020, 4:55:11 UTC - in response to Message 2038643.  

The client operation can be broken in a multitude of ways with a host that does not fit within the norms. So the default braking mechanism fails spectacularly in some cases.

Obviously :). Spectacularly is a very fitting and proper word here. LOL.
ID: 2038650 · Report as offensive     Reply Quote
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2038654 - Posted: 18 Mar 2020, 5:45:15 UTC

There is a healthy sized ready to send queue and the system is running well. The replica is also keeping up. I'm puzzled as certain numbers seem high, so I don't know what magic was worked. I'm glad it is running well now though.
ID: 2038654 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2038655 - Posted: 18 Mar 2020, 5:56:24 UTC - in response to Message 2038521.  

No outage again this Tuesday?
It would appear not.
So far things are continuing to struggle along. Whenever we have an outage it takes over 12 hours once the system comes back up for things to settle down anyway.
And until no new work is issued (or the variable Quorum's changed back to a fixed 2), none of the backlogs or bloat are going to clear.
So we might as well just stagger along towards the finish line, with only 2 weeks to go now.
Grant
Darwin NT
ID: 2038655 · Report as offensive     Reply Quote
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2038724 - Posted: 18 Mar 2020, 17:10:34 UTC

How are files added to be split??

I think the Aricebo files are automatically added. Not sure how the blc files arrive. Can the seti crew add files to be split remotely??

How is this affecting the data collection sites? Scientists who have time on telescopes may not be able to travel to get there.
ID: 2038724 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2038728 - Posted: 18 Mar 2020, 17:36:58 UTC - in response to Message 2038724.  

Can the seti crew add files to be split remotely??
Server computers are generally boxes stacked in server racks with no keyboards or monitors, so they are always accessed remotely.
ID: 2038728 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2038730 - Posted: 18 Mar 2020, 17:40:20 UTC
Last modified: 18 Mar 2020, 17:42:57 UTC

By SSP All unstarted tapes has been removed. The beginning of the end?
ID: 2038730 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2038768 - Posted: 18 Mar 2020, 21:17:51 UTC
Last modified: 18 Mar 2020, 21:18:50 UTC

even if we run out of BLC work, it looks like the Arecibo automation is still running. 2 more tapes have been added.

it also looks like they removed any throttling. they put the brick on the gas pedal and let it go! hahaha.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2038768 · Report as offensive     Reply Quote
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2038780 - Posted: 18 Mar 2020, 22:18:09 UTC - in response to Message 2038768.  

even if we run out of BLC work, it looks like the Arecibo automation is still running. 2 more tapes have been added.

it also looks like they removed any throttling. they put the brick on the gas pedal and let it go! hahaha.

If the database doesn't hold together, we may indeed go out with a bang!
ID: 2038780 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2038782 - Posted: 18 Mar 2020, 22:20:08 UTC - in response to Message 2038768.  

I'll accept 16mr20af as automation, but I think 16se11ab must have been manually chosen.
ID: 2038782 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2038783 - Posted: 18 Mar 2020, 22:20:49 UTC - in response to Message 2038768.  

even if we run out of BLC work, it looks like the Arecibo automation is still running. 2 more tapes have been added.

it also looks like they removed any throttling. they put the brick on the gas pedal and let it go! hahaha.

Also I just noticed after the latest update of the SSP the deleted seem to be doing their job as well because results waiting to be purged is back under a million
ID: 2038783 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2038789 - Posted: 18 Mar 2020, 22:40:40 UTC - in response to Message 2038780.  

we may indeed go out with a bang!

Where is the kaboom? It's suppose to be a kaboom!

The new WU is flow without restrictions, the size of the DB WU count is >20MM and there are no problems t all.

Did they finally find the fix for the DB size problem? Just now when the curtains are ready to close?
ID: 2038789 · Report as offensive     Reply Quote
Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.