The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 78 · 79 · 80 · 81 · 82 · 83 · 84 . . . 94 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031383 - Posted: 8 Feb 2020, 10:53:40 UTC - in response to Message 2031378.  

Think about a computer with a 100k RAC suddenly starting to throw invalids, set the trigger at 1% - it needs to throw 1000 before it is trapped.
Now you're the one muddling the units!

A RAC of 100K - measured in credits, that's what the 'C' stands for - is probably a count of ~1K. So a 1% trigger is only ~10 tasks.

Or so a pedant might say. The underlying principle of keeping the maths as simple as possible for a server which has to process 150,000 tasks per hour is still valid.
ID: 2031383 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031389 - Posted: 8 Feb 2020, 11:37:12 UTC - in response to Message 2031378.  

Why then did YOU mention RAC? - It is YOU who proposed a system based on the RAC of a host.
Where did I mention that? The the first time in this discussion 'RAC' appears in any of my posts is when I was replying to YOU bringing that up.

If the host produced only invalids, then its invalid score would rise very fast!
Yes, its invalid COUNT would rise, use that as the control, not some derived variable - far simpler, and far more able to catch the event early.
And the count would stay high forever after the host stops producing invalids. That's why a decay mechanism is needed.

Think about a computer with a 100k RAC suddenly starting to throw invalids, set the trigger at 1% - it needs to throw 1000 before it is trapped - now consider a computer with a RAC of 1M (and they do exist) - that figure now becomes 10k - both of which are far more than "just a few"
RAC would have absolutely nothing to do with it. Higher RAC host would trigger it faster because it chews through the tasks faster but any host would trigger it after the exact same number of tasks.

If we assume the decay multiplier 0.999. the limit 1%, the host producing only invalids and the value starting form zero, then the value would evolve like this:
0.001, 0.001999, 0.002997. 0.003994, 0.004990, 0.005985, 0.006979, 0.007972, 0.008964, 0.009955, 0.010945. So 11th consecutive invalid result would trigger the trap.
ID: 2031389 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19406
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2031399 - Posted: 8 Feb 2020, 12:52:15 UTC - in response to Message 2031389.  

But a host will probably be producing good results on the CPU and if fitted other GPU's.
ID: 2031399 · Report as offensive
bluestar

Send message
Joined: 5 Sep 12
Posts: 7264
Credit: 2,084,789
RAC: 3
Message 2031400 - Posted: 8 Feb 2020, 12:56:26 UTC - in response to Message 2031381.  
Last modified: 8 Feb 2020, 12:59:09 UTC

Sorry, but noise is a thing only coming our way, for being a radio telescope being pointed at the sky, for detecting radio signals which also could be intelligent in nature.

Therefore the blc35_2bit_guppi tasks among these, for only a bit of noise, except also the possible transmission that also could be meant to be.

Again just sorry for only the discussion of that of RAC for also credit it should be here, for also the scheduled maintenance happening at times.
ID: 2031400 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031404 - Posted: 8 Feb 2020, 13:02:55 UTC - in response to Message 2031399.  

But a host will probably be producing good results on the CPU and if fitted other GPU's.
Even those buggy GPUs/drivers only fail on some tasks. Not all of them.
ID: 2031404 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22536
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2031413 - Posted: 8 Feb 2020, 14:42:14 UTC - in response to Message 2031383.  

Thanks Richard - I had a look at that post and thought "there's something wrong there, but what...."
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2031413 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2031424 - Posted: 8 Feb 2020, 15:33:00 UTC - in response to Message 2031413.  

Thanks Richard - I had a look at that post and thought "there's something wrong there, but what...."
I was trained - a long, long, time ago - at the Cavendish lab, Cambridge university. Two pieces of teaching stuck:

* No number has a meaning unless the units are stated.
* Do every calculation twice. Once, using the most accurate mechanical/electronic device available - and then again, on the back of an envelope or a chalk board. The second only has to be done using order-of-magnitude approximations, but it proves that the decimal point hasn't slipped in the first one.
ID: 2031424 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031555 - Posted: 9 Feb 2020, 10:41:28 UTC - in response to Message 2031554.  

Sunday, and the replica if falling behind again.
As of now, 3,176 seconds (53 minutes) behind, and that number is getting bigger fast.
(no wonder it looks as if I had not returned anything for a while, when looking at my task list)
This has happened on many weekends. At approximately the same time the total outages we experienced on many consecutive Sundays in last September and October. I wonder if the causes are related.
ID: 2031555 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 2031755 - Posted: 10 Feb 2020, 6:09:50 UTC

The Replica has had it's break & is now catching up again.
Grant
Darwin NT
ID: 2031755 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 2031766 - Posted: 10 Feb 2020, 9:03:04 UTC

A few noisy WUs in the current Arecibo group, and more than the usual number of uploads timing out instantly.
Grant
Darwin NT
ID: 2031766 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031817 - Posted: 10 Feb 2020, 16:21:19 UTC

Assimilation queue is still going down at a steady rate. If the same rate continues, the backlog is gone in about 8 days.
ID: 2031817 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2031876 - Posted: 10 Feb 2020, 21:07:00 UTC
Last modified: 10 Feb 2020, 21:14:11 UTC

Did anyone get anything other than noise bombs for ap_29ja16ad? Must have been something going on that day.

Edit: only thing I see on any cosmic calendars is the moon being at its apogee.
ID: 2031876 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2031888 - Posted: 10 Feb 2020, 22:04:51 UTC - in response to Message 2031876.  

Did anyone get anything other than noise bombs for ap_29ja16ad? Must have been something going on that day.

Edit: only thing I see on any cosmic calendars is the moon being at its apogee.

Haven't done any of them yet. Still in progress.

It could be caused by any number of things. The radar could have been on. Terrestrial interference. Any number of transmitting satellites could have passed by in the aperture capture window. etc. etc.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2031888 · Report as offensive
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2031896 - Posted: 10 Feb 2020, 22:35:51 UTC - in response to Message 2031888.  

It could be caused by any number of things. The radar could have been on. Terrestrial interference. Any number of transmitting satellites could have passed by in the aperture capture window. etc. etc.


Thanks, just unusual to get an sse, sse2, and 4 OpenCL_ati_mac WUs, just to have them all bomb out.
ID: 2031896 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36850
Credit: 261,360,520
RAC: 489
Australia
Message 2031899 - Posted: 10 Feb 2020, 22:44:41 UTC - in response to Message 2031896.  

It could be caused by any number of things. The radar could have been on. Terrestrial interference. Any number of transmitting satellites could have passed by in the aperture capture window. etc. etc.
Thanks, just unusual to get an sse, sse2, and 4 OpenCL_ati_mac WUs, just to have them all bomb out.
It's nothing uncommon when the tasks are 100% blanked.

Cheers.
ID: 2031899 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031938 - Posted: 11 Feb 2020, 4:18:32 UTC - in response to Message 2031888.  

It could be caused by any number of things. The radar could have been on. Terrestrial interference. Any number of transmitting satellites could have passed by in the aperture capture window. etc. etc.
This will become more and more common over time as Elon Musk and his competitors are spamming the low Earth orbit with thousands of internet satellites.
ID: 2031938 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19406
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2031951 - Posted: 11 Feb 2020, 7:48:15 UTC

Milestone As of 11 Feb 2020, 7:30:05 UTC

"Results returned and awaiting validation" is below 10,000,000.
exact figure 9,990,238.
ID: 2031951 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 2031952 - Posted: 11 Feb 2020, 7:53:25 UTC - in response to Message 2031951.  
Last modified: 11 Feb 2020, 7:56:11 UTC

Milestone As of 11 Feb 2020, 7:30:05 UTC

"Results returned and awaiting validation" is below 10,000,000.
exact figure 9,990,238.
Only another 5.2 million to go (and another 2.41 million to go to clear the Assimilation backlog).
Grant
Darwin NT
ID: 2031952 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2031970 - Posted: 11 Feb 2020, 13:26:03 UTC - in response to Message 2031952.  

Only another 5.2 million to go (and another 2.41 million to go to clear the Assimilation backlog).
Those are essentially the same thing. Each workunit in assimilation queue is preventing on average about 2.2 results from transitioning to 'waiting for db purging' state. 5.2/2.41=2.16
ID: 2031970 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2031971 - Posted: 11 Feb 2020, 22:00:49 UTC

and we are back....
ID: 2031971 · Report as offensive
Previous · 1 . . . 78 · 79 · 80 · 81 · 82 · 83 · 84 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.