Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation
Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 107 · Next
Author | Message |
---|---|
doublechaz Send message Joined: 17 Nov 00 Posts: 90 Credit: 76,455,865 RAC: 735 |
None of my ghosts were from AV. But through manual methods they have all been recovered. The effort sure made a mess of my hosts screen, but at least all the work is serviced. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
None of my ghosts were from AV. But through manual methods they have all been recovered. The effort sure made a mess of my hosts screen, but at least all the work is serviced. . . Well done, it can be a little tedious but worth the doing. Stephen :) |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Finally the splitters status shows disabled in the SSP. |
Keith T. Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 |
Finally the splitters status shows disabled in the SSP. I was wondering when that would happen. Next SSP milestone is probably "Results out in the field" dropping below 2,000,000 which will probably occur in a little more than 12 hours Ready to send seems to be back in touch with reality after the 4000+ myths a few hours ago. In other news, the "1337" cruncher now has well under 50K tasks in progress, currently around 49.4K, and, I can see all of my tasks for both active crunchers on a single page. My SETI RAC has now dropped below 1000 after a few weeks above for the first time ever, peak was around 1300 |
doublechaz Send message Joined: 17 Nov 00 Posts: 90 Credit: 76,455,865 RAC: 735 |
Yes, I was pretty upset to have a lot of ghosts and *really* didn't want everyone to have to wait for them to time out. It's only about 150, but still... Looks like I got about 40 units today not counting my ghosts. If I can keep up that rate and I don't have a lot of noisy units I might make 76M. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
There is still some weird stuff on SSP. There is only 4.136 WUs waiting for purging but 63,435 results. Less than 10,000 results can fit in those 4.136 WUs, so it looks like the database is holding over 50,000 "orphan" results with no matching WU. |
rob smith Send message Joined: 7 Mar 03 Posts: 22404 Credit: 416,307,556 RAC: 380 |
They won't be ghost tasks in the way you describe because every single task has as part of its structure the parent wok-unit id, so is traceable to a work unit. Just a thought, as there are far more tasks than the number of work units would suggest has the project decided on different delays for tasks and work units? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
If the result is in the purging queue, then the WU it belongs to should be too as validation and assimilation happen a whole WU at a time, not with individual results. Back when things were running normally and purging queue was kept at 24 hours, the result count in there was very close to 2.2 times the WU count which matched the average replication. The numbers not matching may mean just that the SSP is displaying bogus data, which would not be unheard of. But it could also mean there are results without the matching WUs in the database, which could mean database corruption. Both numbers have increased after I posted the numbers and If I calculate the difference, those do agree pretty nicely. The result count has increased by 1,528 and WU count by 719. Making the ratio 2.13 |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14667 Credit: 200,643,578 RAC: 874 |
Something or someone is playing around with the replication number again. I've received a small block of tasks all similar to WU 3898363809. Normal two-replication WU, one returned, one timed out this morning. I got the replacement for the timeout - so far, so good. But an additional replication was created at 20:06:19 UTC last night, for no very obvious reason. My copy will be returned this afternoon (it was picked up by a slow machine), and it looks like this extra wingmate is good, too. But why? |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
Something or someone is playing around with the replication number again. I've received a small block of tasks all similar to WU 3898363809. I've had a handful of these too. I've just given up on any of the things happening. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
It's probably Eric's script sending those extra replications. The same script that bugged in the end of last month replicating the same tasks again and again creating these. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14667 Credit: 200,643,578 RAC: 874 |
It's probably Eric's script sending those extra replications.Yes, we know about that - he owned up to making a mistake in the heat of the moment. But now everything's calmed down a lot, and we have time to think, this doesn't look like a very clever strategy. Sending out a buckshee extra replication less than 12 hours before the normal timeout only helps if it goes to a host which will return it within those 12 hours - thus preempting the normal resend. But if it doesn't get an almost-instant turnound - as in this case - it doubles the chance of the WU being delayed to early June by an AWOL host. Multiple extra replications only work with severely truncated deadlines. |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
Multiple extra replications only work with severely truncated deadlines. Or with canceling as soon as one of them is returned (and the other host connects to the server). it doubles the chance of the WU being delayed to early June by an AWOL host. Well, it doubles also the chance, that at least one of them will be returned successfully and the WU won't need to be resent again in June. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14667 Credit: 200,643,578 RAC: 874 |
Well, it doubles also the chance, that at least one of them will be returned successfully and the WU won't need to be resent again in June.I think it would be better to send the extras serially, rather than in parallel. Send out one, deadline say 1 week. If that fails, send another. And so on. Look at the original timeout that caused the resend in the first place. Owner Anonymous Created 20 Feb 2020, 22:20:29 UTC CPU type GenuineIntel Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz [Family 6 Model 85 Stepping 4] Number of processors 12 Coprocessors [4] NVIDIA Quadro RTX 5000 (4095MB) driver: 418.74 OpenCL: 1.2 Tasks 407 Number of times client has contacted server 5 Last contact 22 Feb 2020That looks like either a burn-in test, or a cluster node filling in some spare time. We want to avoid those during this clean-up. |
Keith T. Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 |
It looks like some, but not all WU's are getting a preemptive, extra task, around 24 hours before the timeout. https://setiathome.berkeley.edu/results.php?hostid=8917043, my Windows machine currently has 3 like this from yesterday. https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive Wus |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14667 Credit: 200,643,578 RAC: 874 |
https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive WusYes, that's the way it should go - and it happens to be a short-deadline VHAR, as well. But I don't hold out a lot of hope for your new extra wingmate on WU 3893947696. |
Keith T. Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 |
https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive WusYes, that's the way it should go - and it happens to be a short-deadline VHAR, as well. 995 should return in less than 2 hours, it's currently running on my single Intel GPU, wall clock time is counting down at about the correct rate :-) ETA before 11:00 UTC |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive WusYes, that's the way it should go - and it happens to be a short-deadline VHAR, as well. My only recommendation at this point is to call out the people who have the wingman position for your work unit here... Pretty lame recommendation, especially if they don't frequent the thread, or they go Anonymous like a lot of these tend to be. |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
I mean, you give me the last two digits of the machine, and give me a task name, and I'm all over crunching a task for anybody here. Outside of that....well... |
Keith T. Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 |
https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 , also one of mine shows where this policy should work. My Windows machine is relatively a tortoise compared to some of the "big boys", but it usually produces the decider on many Validation Inconclusive WusYes, that's the way it should go - and it happens to be a short-deadline VHAR, as well. https://setiathome.berkeley.edu/workunit.php?wuid=3947300995 Completed and Validated My _3 was the deciding tie-breaker Results will probably be purged in under 2 hours |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.