The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 107 · Next

AuthorMessage
Profile doublechaz

Send message
Joined: 17 Nov 00
Posts: 90
Credit: 76,455,865
RAC: 735
United States
Message 2044769 - Posted: 14 Apr 2020, 19:27:23 UTC

None of my ghosts were from AV. But through manual methods they have all been recovered. The effort sure made a mess of my hosts screen, but at least all the work is serviced.
ID: 2044769 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2044800 - Posted: 14 Apr 2020, 22:10:13 UTC - in response to Message 2044769.  
Last modified: 14 Apr 2020, 22:10:59 UTC

None of my ghosts were from AV. But through manual methods they have all been recovered. The effort sure made a mess of my hosts screen, but at least all the work is serviced.

. . Well done, it can be a little tedious but worth the doing.

Stephen

:)
ID: 2044800 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2044802 - Posted: 14 Apr 2020, 22:24:04 UTC
Last modified: 14 Apr 2020, 22:25:24 UTC

Finally the splitters status shows disabled in the SSP.
ID: 2044802 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2044807 - Posted: 14 Apr 2020, 22:59:25 UTC - in response to Message 2044802.  

Finally the splitters status shows disabled in the SSP.


I was wondering when that would happen.

Next SSP milestone is probably "Results out in the field" dropping below 2,000,000 which will probably occur in a little more than 12 hours

Ready to send seems to be back in touch with reality after the 4000+ myths a few hours ago.


In other news, the "1337" cruncher now has well under 50K tasks in progress, currently around 49.4K, and, I can see all of my tasks for both active crunchers on a single page.

My SETI RAC has now dropped below 1000 after a few weeks above for the first time ever, peak was around 1300
ID: 2044807 · Report as offensive     Reply Quote
Profile doublechaz

Send message
Joined: 17 Nov 00
Posts: 90
Credit: 76,455,865
RAC: 735
United States
Message 2044811 - Posted: 14 Apr 2020, 23:08:28 UTC

Yes, I was pretty upset to have a lot of ghosts and *really* didn't want everyone to have to wait for them to time out. It's only about 150, but still...

Looks like I got about 40 units today not counting my ghosts. If I can keep up that rate and I don't have a lot of noisy units I might make 76M.
ID: 2044811 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2044880 - Posted: 15 Apr 2020, 5:26:46 UTC

There is still some weird stuff on SSP.

There is only 4.136 WUs waiting for purging but 63,435 results. Less than 10,000 results can fit in those 4.136 WUs, so it looks like the database is holding over 50,000 "orphan" results with no matching WU.
ID: 2044880 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22220
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2044886 - Posted: 15 Apr 2020, 7:25:42 UTC

They won't be ghost tasks in the way you describe because every single task has as part of its structure the parent wok-unit id, so is traceable to a work unit.
Just a thought, as there are far more tasks than the number of work units would suggest has the project decided on different delays for tasks and work units?
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2044886 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2044889 - Posted: 15 Apr 2020, 7:52:13 UTC
Last modified: 15 Apr 2020, 7:58:28 UTC

If the result is in the purging queue, then the WU it belongs to should be too as validation and assimilation happen a whole WU at a time, not with individual results. Back when things were running normally and purging queue was kept at 24 hours, the result count in there was very close to 2.2 times the WU count which matched the average replication.

The numbers not matching may mean just that the SSP is displaying bogus data, which would not be unheard of. But it could also mean there are results without the matching WUs in the database, which could mean database corruption.

Both numbers have increased after I posted the numbers and If I calculate the difference, those do agree pretty nicely. The result count has increased by 1,528 and WU count by 719. Making the ratio 2.13
ID: 2044889 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044890 - Posted: 15 Apr 2020, 7:54:27 UTC

Something or someone is playing around with the replication number again. I've received a small block of tasks all similar to WU 3898363809.

Normal two-replication WU, one returned, one timed out this morning. I got the replacement for the timeout - so far, so good.

But an additional replication was created at 20:06:19 UTC last night, for no very obvious reason.

My copy will be returned this afternoon (it was picked up by a slow machine), and it looks like this extra wingmate is good, too. But why?
ID: 2044890 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2044891 - Posted: 15 Apr 2020, 7:57:57 UTC - in response to Message 2044890.  

Something or someone is playing around with the replication number again. I've received a small block of tasks all similar to WU 3898363809.

Normal two-replication WU, one returned, one timed out this morning. I got the replacement for the timeout - so far, so good.

But an additional replication was created at 20:06:19 UTC last night, for no very obvious reason.

My copy will be returned this afternoon (it was picked up by a slow machine), and it looks like this extra wingmate is good, too. But why?

I've had a handful of these too. I've just given up on any of the things happening.
ID: 2044891 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2044892 - Posted: 15 Apr 2020, 8:02:33 UTC

It's probably Eric's script sending those extra replications.

The same script that bugged in the end of last month replicating the same tasks again and again creating these.
ID: 2044892 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044896 - Posted: 15 Apr 2020, 8:22:35 UTC - in response to Message 2044892.  

It's probably Eric's script sending those extra replications.
Yes, we know about that - he owned up to making a mistake in the heat of the moment. But now everything's calmed down a lot, and we have time to think, this doesn't look like a very clever strategy.

Sending out a buckshee extra replication less than 12 hours before the normal timeout only helps if it goes to a host which will return it within those 12 hours - thus preempting the normal resend. But if it doesn't get an almost-instant turnound - as in this case - it doubles the chance of the WU being delayed to early June by an AWOL host.

Multiple extra replications only work with severely truncated deadlines.
ID: 2044896 · Report as offensive     Reply Quote
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 2044898 - Posted: 15 Apr 2020, 8:29:35 UTC - in response to Message 2044896.  

Multiple extra replications only work with severely truncated deadlines.

Or with canceling as soon as one of them is returned (and the other host connects to the server).

it doubles the chance of the WU being delayed to early June by an AWOL host.

Well, it doubles also the chance, that at least one of them will be returned successfully and the WU won't need to be resent again in June.
ID: 2044898 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044899 - Posted: 15 Apr 2020, 8:37:03 UTC - in response to Message 2044898.  

Well, it doubles also the chance, that at least one of them will be returned successfully and the WU won't need to be resent again in June.
I think it would be better to send the extras serially, rather than in parallel. Send out one, deadline say 1 week. If that fails, send another. And so on.

Look at the original timeout that caused the resend in the first place.

Owner			Anonymous
Created			20 Feb 2020, 22:20:29 UTC
CPU type		GenuineIntel Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz [Family 6 Model 85 Stepping 4]
Number of processors	12
Coprocessors		[4] NVIDIA Quadro RTX 5000 (4095MB) driver: 418.74 OpenCL: 1.2
Tasks			407
Number of times client has contacted server	5
Last contact		22 Feb 2020
That looks like either a burn-in test, or a cluster node filling in some spare time. We want to avoid those during this clean-up.
ID: 2044899 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2044902 - Posted: 15 Apr 2020, 8:48:13 UTC - in response to Message 2044896.  
Last modified: 15 Apr 2020, 8:57:36 UTC

It looks like some, but not all WU's are getting a preemptive, extra task, around 24 hours before the timeout.

https://setiathome.berkeley.edu/results.php?hostid=8917043, my Windows machine currently has 3 like this from yesterday.

https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive Wus
ID: 2044902 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2044912 - Posted: 15 Apr 2020, 9:09:53 UTC - in response to Message 2044902.  
Last modified: 15 Apr 2020, 9:10:24 UTC

https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive Wus
Yes, that's the way it should go - and it happens to be a short-deadline VHAR, as well.

But I don't hold out a lot of hope for your new extra wingmate on WU 3893947696.
ID: 2044912 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2044918 - Posted: 15 Apr 2020, 9:26:02 UTC - in response to Message 2044912.  
Last modified: 15 Apr 2020, 9:28:01 UTC

https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive Wus
Yes, that's the way it should go - and it happens to be a short-deadline VHAR, as well.

But I don't hold out a lot of hope for your new extra wingmate on WU 3893947696.


995 should return in less than 2 hours, it's currently running on my single Intel GPU, wall clock time is counting down at about the correct rate :-)
ETA before 11:00 UTC
ID: 2044918 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2044921 - Posted: 15 Apr 2020, 9:29:53 UTC - in response to Message 2044912.  

https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive Wus
Yes, that's the way it should go - and it happens to be a short-deadline VHAR, as well.

But I don't hold out a lot of hope for your new extra wingmate on WU 3893947696.

My only recommendation at this point is to call out the people who have the wingman position for your work unit here... Pretty lame recommendation, especially if they don't frequent the thread, or they go Anonymous like a lot of these tend to be.
ID: 2044921 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2044923 - Posted: 15 Apr 2020, 9:34:11 UTC - in response to Message 2044921.  


My only recommendation at this point is to call out the people who have the wingman position for your work unit here... Pretty lame recommendation, especially if they don't frequent the thread, or they go Anonymous like a lot of these tend to be.

I mean, you give me the last two digits of the machine, and give me a task name, and I'm all over crunching a task for anybody here. Outside of that....well...
ID: 2044923 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2044931 - Posted: 15 Apr 2020, 11:30:16 UTC - in response to Message 2044918.  

https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive Wus
Yes, that's the way it should go - and it happens to be a short-deadline VHAR, as well.

But I don't hold out a lot of hope for your new extra wingmate on WU 3893947696.


995 should return in less than 2 hours, it's currently running on my single Intel GPU, wall clock time is counting down at about the correct rate :-)
ETA before 11:00 UTC


https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 Completed and Validated
My _3 was the deciding tie-breaker

Results will probably be purged in under 2 hours
ID: 2044931 · Report as offensive     Reply Quote
Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.