The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 107 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2034066 - Posted: 26 Feb 2020, 18:13:04 UTC

I've already cut down to just 3 machines running SETI, now, one of those can't get enough work to keep busy. Since the outage yesterday morning it has only had work for about 7 hours. It's currently Out of Work again, https://setiathome.berkeley.edu/results.php?hostid=6813106
Perhaps I should just cut back to just 2 machines?
ID: 2034066 · Report as offensive     Reply Quote
Lazydude
Volunteer tester

Send message
Joined: 17 Jan 01
Posts: 45
Credit: 96,158,001
RAC: 136
Sweden
Message 2034069 - Posted: 26 Feb 2020, 18:43:28 UTC

I use RTT as early warnig
32h and above are in my eyes OK
below 31h WARNING
under 30h "Houston we have a small problem"

As of time i wrote this
Result turnaround time (last hour average) 29.60 hours
ID: 2034069 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2034070 - Posted: 26 Feb 2020, 19:11:39 UTC - in response to Message 2034069.  
Last modified: 26 Feb 2020, 19:18:58 UTC

I use RTT as early warnig
32h and above are in my eyes OK
below 31h WARNING
under 30h "Houston we have a small problem"

As of time i wrote this
Result turnaround time (last hour average) 29.60 hours

By your scale: Now at 31.34 hours (just 30 min after your post) we are in the Warning stage.

<edit> Reached 32.25 hours at 19:10:04 UTC few minutes after.

At this increase rate: Are we doomed?
ID: 2034070 · Report as offensive     Reply Quote
Lazydude
Volunteer tester

Send message
Joined: 17 Jan 01
Posts: 45
Credit: 96,158,001
RAC: 136
Sweden
Message 2034071 - Posted: 26 Feb 2020, 19:28:32 UTC

At this increase rate: Are we doomed?

No - when the trend is going at shorter times then is warning
Now its in recovey mode now when the trend is uppwards

May add that over round about 36h - then we have had an outake ..
ID: 2034071 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2034108 - Posted: 27 Feb 2020, 0:11:39 UTC

Now the 2nd out of 3 machines has run Out of Work, https://setiathome.berkeley.edu/results.php?hostid=6796479
That leaves 1 machine still working. I suppose when that one runs Out of Work I'll just shut every thing down and brag about how much money I'm saving on electricity.
ID: 2034108 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19118
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034116 - Posted: 27 Feb 2020, 0:45:07 UTC - in response to Message 2034108.  

Now the 2nd out of 3 machines has run Out of Work, https://setiathome.berkeley.edu/results.php?hostid=6796479
That leaves 1 machine still working. I suppose when that one runs Out of Work I'll just shut every thing down and brag about how much money I'm saving on electricity.

I can only assume it is your problem. I've had very few problems since 08:00 26th UTC.
ID: 2034116 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2034123 - Posted: 27 Feb 2020, 1:02:31 UTC - in response to Message 2034116.  

That's what you get for assuming, Since making that post the machine that had been run out of work now has 400 tasks instead of Zero. I had absolutely nothing to do with it.
ID: 2034123 · Report as offensive     Reply Quote
Boiler Paul

Send message
Joined: 4 May 00
Posts: 232
Credit: 4,965,771
RAC: 64
United States
Message 2034125 - Posted: 27 Feb 2020, 1:12:09 UTC

work can be hard to come by. all I've gotten over the past few hours is the Project has no tasks available in the log. Just need to be patient
ID: 2034125 · Report as offensive     Reply Quote
Boiler Paul

Send message
Joined: 4 May 00
Posts: 232
Credit: 4,965,771
RAC: 64
United States
Message 2034126 - Posted: 27 Feb 2020, 1:17:51 UTC - in response to Message 2034125.  

and, of course, after I post, I receive work!
ID: 2034126 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2034131 - Posted: 27 Feb 2020, 1:41:16 UTC

I'm still convinced that somehow, whether it be intent or just net result, the higher your RAC is the lower you are in the priority stack in terms of actually getting work during a recovery from outage. This is entirely too consistent to be the luck of the draw.
ID: 2034131 · Report as offensive     Reply Quote
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 2034135 - Posted: 27 Feb 2020, 2:20:40 UTC - in response to Message 2034131.  

I'm still convinced that somehow, whether it be intent or just net result, the higher your RAC is the lower you are in the priority stack in terms of actually getting work during a recovery from outage. This is entirely too consistent to be the luck of the draw.


+1
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 2034135 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19118
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034141 - Posted: 27 Feb 2020, 2:54:08 UTC

Could it be that the assimilation process is the problem.

How difficult is it to translate the data we produce and all the other details necessary and put onto the science database.

this is what the Server Status page says;
sah_assimilator/ap_assimilator : Takes scientific data from validated results and puts them in the SETI@home (or Astropulse) database for later analysis.
ID: 2034141 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2034144 - Posted: 27 Feb 2020, 3:09:55 UTC - in response to Message 2034131.  

I'm still convinced that somehow, whether it be intent or just net result, the higher your RAC is the lower you are in the priority stack in terms of actually getting work during a recovery from outage. This is entirely too consistent to be the luck of the draw.
It is an illusion. Everyone has the same priority but higher your RAC, the more successful scheduler request you need to keep your cache not depleting.

If every 12th request wins the lottery and gets some work, then you get some work once every hour and this may be all that a slow host needs to refill its cache to the brim but nowhere near the one hour production of a fast host.
ID: 2034144 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2034146 - Posted: 27 Feb 2020, 3:23:54 UTC - in response to Message 2034141.  

Could it be that the assimilation process is the problem.
There has clearly been some problem in assimilation for the last several weeks, but the problem can be in many different places. It could be the throughput of the boinc database that somehow hits the assimilator harder than the other processes. Or it could be a problem in the assimilator program itself. Or it can be the throughput of the science databases. Or the throughput of the upload filesystem where the result files the assimilator needs to read are.
ID: 2034146 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19118
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034167 - Posted: 27 Feb 2020, 8:17:43 UTC - in response to Message 2034131.  
Last modified: 27 Feb 2020, 8:25:22 UTC

I'm still convinced that somehow, whether it be intent or just net result, the higher your RAC is the lower you are in the priority stack in terms of actually getting work during a recovery from outage. This is entirely too consistent to be the luck of the draw.

Maybe related, but I think it is more to do with how much work the host requests.
When I get up Wednesday mornings, UTC times rule in the UK winter, if the computer hasn't started receiving work, I set the cache to a very low level. I find that usually works after a few attempts, and as I receive work, I increase the cache in steps up to 0.6 days which, unless the servers give me oddles of AP, fills the GPU cache to 150 tasks.

Also, here are some numbers on tasks downloaded and validated in the 24 hrs since 08:06:31 26th Feb, ~24hours ago.
After 12 hours at ~20:00 26th
Downloaded - 345; In Progress 150; Valid 86
Processed = 345 - 150 = 195
Percentage of tasks downloaded and Validated in 12 hours = 100 * 86 / 195 = 44.1%

After 4 hours at ~08:00 27th
Downloaded - 523; In Progress 150; Valid 253
Processed = 523 - 150 = 373
Percentage of tasks downloaded and Validated in 24 hours = 100 * 253 / 373 = 67.8%

I only crunch on the GPU so it is fairly simple just to scroll through the pages and count each page then add up the page numbers.

edit] Prior to 08:06 yesterday the Seti cache was empty.
ID: 2034167 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034169 - Posted: 27 Feb 2020, 8:32:03 UTC - in response to Message 2034167.  

I'm still convinced that somehow, whether it be intent or just net result, the higher your RAC is the lower you are in the priority stack in terms of actually getting work during a recovery from outage. This is entirely too consistent to be the luck of the draw.
Maybe related, but I think it is more to do with how much work the host requests.
When I get up Wednesday mornings, UTC times rule in the UK winter, if the computer hasn't started receiving work, I set the cache to a very low level. I find that usually works after a few attempts, and as I receive work, I increase the cache in steps up to 0.6 days which, unless the servers give me oddles of AP, fills the GPU cache to 150 tasks.
That's my experience, too. I now have two machines in the 'high RAC' category (top 100): they were both completely dry yesterday morning. I did a little Einstein backup work while the servers were sorting themselves out, but once work started flowing, I ramped them up gently by requesting an hour of work at a time (0.05 days) and increasing the cache a step at a time as they filled up. Reached full cache by evening, with just a little tweak any time I happened to be passing.
ID: 2034169 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2034224 - Posted: 27 Feb 2020, 18:48:43 UTC - in response to Message 2034169.  

Validation Pending still steadily growing, looks like around 23 million objects waiting to be satisfied. Still getting work though, despite the RTS showing a pretty steady 0. I even fell asleep in the wrong configuration night before last to decrease my Pending column below normal average, but I'm well over that again. This poor system needs a break.
ID: 2034224 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2034292 - Posted: 27 Feb 2020, 22:28:05 UTC - in response to Message 2034169.  
Last modified: 27 Feb 2020, 22:35:36 UTC

I'm still convinced that somehow, whether it be intent or just net result, the higher your RAC is the lower you are in the priority stack in terms of actually getting work during a recovery from outage. This is entirely too consistent to be the luck of the draw.
Maybe related, but I think it is more to do with how much work the host requests.
When I get up Wednesday mornings, UTC times rule in the UK winter, if the computer hasn't started receiving work, I set the cache to a very low level. I find that usually works after a few attempts, and as I receive work, I increase the cache in steps up to 0.6 days which, unless the servers give me oddles of AP, fills the GPU cache to 150 tasks.
That's my experience, too. I now have two machines in the 'high RAC' category (top 100): they were both completely dry yesterday morning. I did a little Einstein backup work while the servers were sorting themselves out, but once work started flowing, I ramped them up gently by requesting an hour of work at a time (0.05 days) and increasing the cache a step at a time as they filled up. Reached full cache by evening, with just a little tweak any time I happened to be passing.

Sounds like a reality, not "an illusion". Main cruncher cache here is around >25% ~50%. [my error] No heartburn, work is getting assigned and completed, but it seems clear that there's more than "first come, first served" going on.
ID: 2034292 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22241
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034295 - Posted: 27 Feb 2020, 22:33:04 UTC

I can't help wondering if the splitters are being deliberately throttled in an attempt to reduce the amount of work sitting around in the various queues. After all work not being split will have that effect
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034295 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35004
Credit: 261,360,520
RAC: 489
Australia
Message 2034303 - Posted: 27 Feb 2020, 23:04:48 UTC - in response to Message 2034295.  
Last modified: 27 Feb 2020, 23:05:59 UTC

I can't help wondering if the splitters are being deliberately throttled in an attempt to reduce the amount of work sitting around in the various queues. After all work not being split will have that effect
Yes they are, it was stated by Eric that this is being done to try and keep the system within it's RAM limits, I just can't remember where that post was made and whether Eric actually made it or it was passed along ATM.

Cheers.
ID: 2034303 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.