Message boards :
News :
Low available work.
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
KWSN - Sir Nutsalot Send message Joined: 4 Jun 99 Posts: 5 Credit: 22,114,565 RAC: 47 |
I have Machines still getting work and others now getting nothing at all for example :- AMD 3600 with Amd rx580 empty no units coming in for CPU and GPU. AMD 2600 with Nvidia1660ti empty no units coming in for CPU and GPU. Intel 2600k with Amd rx580 still occasionally getting units and cruching plenty of units Intel 6700k with Amd rx590 still occasionally getting units and cruching but running out. Intel 8750h with Nvidia 1060 still occasionally getting units and cruching but nearly running out. Clearly still issues and for many months never had a problem. |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 |
does all the computed WU were reported ? |
grote Send message Joined: 7 May 01 Posts: 2 Credit: 2,056,727 RAC: 0 |
Is this issue still present? I noticed most of the servers show as running but I still have zero tasks I keep getting Communication Deferred. Still waiting for the tasks to begin!! |
Bernie Vine Send message Joined: 26 May 99 Posts: 9958 Credit: 103,452,613 RAC: 328 |
Is this issue still present? I noticed most of the servers show as running but I still have zero tasks I keep getting Communication Deferred. Your machine has not contacted the servers since the 16th of Jan |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
(cc'ed from NC) I contacted Dr. Korpela and he inidcated that there still is some throttling going on to keep the total results less than 20M (lest we have the same issue where the results table exceeds memory) which is probably why the BLC splitters were disabled earlier. No doubt the "shorty storm" from blc35_2bit_guppi_58691_* is causing this. In the interim things seem to be improving and I'm getting just enough work to keep my machines busy, so it should be over soon. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
(cc'ed from NC)I thought the point of the extended outage database reorganisation was to rejig the database to it could deal with the increased load & process the work & keep those backlogs from occurring? Grant Darwin NT |
KWSN - Sir Nutsalot Send message Joined: 4 Jun 99 Posts: 5 Credit: 22,114,565 RAC: 47 |
All my machines are now full to the apparent 150 GPU and CPU unit limit :) Therefore i am now back to usual. Apart from the real scheduled outages uploading units was never an issue to my machines. |
John Ellis Send message Joined: 22 Jun 05 Posts: 4 Credit: 23,932,244 RAC: 4 |
Greetings, Seems this issue has lasted longer than originally expected to resolve. Is there any "official" estimate for returning to normal operations? Only one of my two SETI rigs is processing, the one idling has been in that state for hours now. Thanks, John |
Miklos M. Send message Joined: 5 May 99 Posts: 955 Credit: 136,115,648 RAC: 73 |
I am eager for the day when work will be plentiful again. Only two of the four rigs get work and it is rarely. |
Mike Ryan Send message Joined: 24 Jun 99 Posts: 46 Credit: 24,363,752 RAC: 47 |
Not sure where to post this (it's really getting hard to find a specific tree in the forest) but I'll try here because it seems to be a "dishing out the work" issue... perhaps related to allowing more work (yeah!) to be sent to each device. I had three work units time out as "no response" but the time given to compute (which is typically a month or so) was less than 7 minutes (6:55 to be exact). Never seen that happen before. With a current queue on this machine of 300 units in progress (50/50 split between CPU and Nvidia GPU) it does seem a bit odd to request three of the units to be finished within 7 minutes when all the other running tasks before they were downloaded would run first. Here's the work units in question: Work unit / Sent / Time Reported or deadline / Status 3857435802 27 Jan 2020, 22:52:34 UTC 27 Jan 2020, 22:59:29 UTC Time out - no response 3857435646 27 Jan 2020, 22:52:34 UTC 27 Jan 2020, 22:59:29 UTC Time out - no response 3857435722 27 Jan 2020, 22:52:34 UTC 27 Jan 2020, 22:59:29 UTC Time out - no response Not a huge deal, but seems like if this was widespread there would be a LOT of unnecessary duplicate work units sent out for processing. |
Wiggo Send message Joined: 24 Jan 00 Posts: 36764 Credit: 261,360,520 RAC: 489 |
Not sure where to post this (it's really getting hard to find a specific tree in the forest) but I'll try here because it seems to be a "dishing out the work" issue... perhaps related to allowing more work (yeah!) to be sent to each device.They were ghosts that never got delivered to you and as they're in red they don't count against you. ;-) Cheers. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
They were ghosts that never got delivered to you and as they're in red they don't count against you. ;-)Instantly timed out ghosts do get counted against you. I've had a lot of them because every time I have tried to redownload ghosts using the ghost recovery protocol, the server has decided to instantly expire them instead of letting me download them. When a big bunch of those happens, the host and app involved gets its max tasks per day heavily penalized. When it drops lower than the number of tasks already downloaded that day, the server stops giving the host any more work for that app :( |
Chris Send message Joined: 3 Apr 06 Posts: 1 Credit: 2,824,282 RAC: 3 |
My Raspberry Pi cluster has not been getting work. Two machines are idle and the others are just finishing up what they have. |
Nostra Send message Joined: 18 Feb 01 Posts: 2 Credit: 1,612,863 RAC: 7 |
Same here. No work 😠|
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
This thread (and its future descendants) in the Number Crunching forum is a good go-to whenever there are problems like this, as it's regularly updated with findings of issue causes. Also Computing > Server Status is helpful to see what project machines are disabled or down, and the work stats. In this case, work generation has been turned off as there are too many results in the field because of of the millions of "shorties" or "noise bombs" from the most result data files... work units that take only a few seconds to be determined as noise and completed. For example "Results returned and awaiting validation" is currently over 13M. |
betreger Send message Joined: 29 Jun 99 Posts: 11415 Credit: 29,581,041 RAC: 66 |
Almost out of CPU work and the GPU is back to Einstein. |
bigSPAM Send message Joined: 12 May 00 Posts: 4 Credit: 62,452,292 RAC: 292 |
My most productive machine (and likely most power hungry) has about two hundred WUs in transfers, 'Download pending' and no work for the last 18 hours. Are there likely to be some sent or should I just shut this one down until Wednesday? Thanks! Just looked at my #2 machine, and it too has a few hundred in 'Download pending" state and running only GPU, WUs Edit Noticed #1 PC did download but mostly GUPPI WUs Those all ran in a couple minutes with half my CPU working the sole 8 remaining AstroPulse and 'normal' WUs. Also got and a bunch of GPUs to crunch. The GUPPI units have issues I gather? |
Gary Easton Send message Joined: 14 Nov 00 Posts: 9 Credit: 12,118,453 RAC: 65 |
No Work coming in again, this just makes me wanna go Hmmmmm! |
bigSPAM Send message Joined: 12 May 00 Posts: 4 Credit: 62,452,292 RAC: 292 |
My most productive machine (and likely most power hungry) has about two hundred WUs in transfers, 'Download pending' and no work for the last 18 hours. Blew through all the downloads except the remaining two Astropulses on #1 #2 is in the same boat. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3806 Credit: 1,114,826,392 RAC: 3,319 |
I've only been a member since the 10th of January, so I have to ask, are these maintenance shutdowns and work slow downs a regular weekly occurrence? It seems just as I settle into a routine, everything gets screwed up. The maintenance shutdowns prior to this problem lasted only a few hours; the last two have lasted a day or more. It takes close to the same amount of time for things to stabilize after a shutdown as it lasted as the longer it's down, the more short of work the participants' computers get. The recent issues which have been for two weeks of long outages and no work are an anomaly. They seem to be caused by a backlog of the servers not processing returned work so it accumulates. The hope is that eventually the SETI@Home team is going to run a script or equivalent to clear the logjam and let things get back to normal. Edit: To add some details to this, this is from the last look at Computing > Server status: Results out in the field = 3,932,277 The total of these three is 21,310,599, which is too large. Dr. Korpela has noted that the functional limit of the active result table is 20M. After this it may no longer fit into memory (which is maxed out) on the machine it's running on. Probably at this point virtual memory will start paging it to disk which is extremely slow, hundreds or thousands of times slower than memory's bandwidth, so everything grinds to a halt. Thus the splitters are disabled from generating more work until this total falls lower than 20M. However this is a band-aid. It is not falling nearly as fast nor as low as it should, thus the need to run a process to get the validation and purging done (the latter should be very close to zero.) That process/script may not even exist yet requiring someone (inevitably already-overloaded project director Dr. Korpela) to develop it. Hope this explains things a little. :^) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.