Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 51 · 52 · 53 · 54 · 55 · 56 · 57 . . . 94 · Next
Author | Message |
---|---|
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
Another way to reduce database size would be if the servers paired hosts crunching the same workunit smarter. Send the wu to hosts with similar average turnaround times. |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 ![]() |
The size reduction wouldn't be that much.56% of the tasks in my 'Validation pending' list are ones I returned over 1 week ago. I am definitely not helping in that regard.... since the SSD where my BOINC install is on has died unexpectedly, so trying to recover that is a nightmare..... |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1859 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
The size reduction wouldn't be that much.56% of the tasks in my 'Validation pending' list are ones I returned over 1 week ago. Facts are always better: Just did a spot check of one of my boxes. Of ~5600 waiting validation, more than 2000 are older than 1 Jan, 25 days ago. Many are back to mid-October. Somehow, I think that's probably typical, and it certainly seems significant to me. ![]() ![]() |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22815 Credit: 416,307,556 RAC: 380 ![]() ![]() |
On your highest scoring computer 8 tasks date back to October, or 0.14% of your pendings!!!! Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
Is it more efficient for all the splitters to bunch up splitting the same file? If they spread out on different files, it would probably dilute these overflow storms significantly. |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
On your highest scoring computer 8 tasks date back to October, or 0.14% of your pendings!!!!The impact on database size scales by the length of time. A single task lingering for 3 months has the same effect as 90 one day tasks. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Another way to reduce database size would be if the servers paired hosts crunching the same workunit smarter. Send the wu to hosts with similar average turnaround times. . . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog. . . Just a thought. Stephen ? ? |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22815 Credit: 416,307,556 RAC: 380 ![]() ![]() |
There is a very simple reason why that is a bad idea (and this has been discussed many, many times in the past) - using random diversity in processing is there to reduce the possibility of common-mode errors. Recently we've seen a prime example of such errors in the way certain AMD GPUs were producing wrong results, and were "ganging up" on other devices producing correct results. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
There is a very simple reason why that is a bad idea (and this has been discussed many, many times in the past) - using random diversity in processing is there to reduce the possibility of common-mode errors. Recently we've seen a prime example of such errors in the way certain AMD GPUs were producing wrong results, and were "ganging up" on other devices producing correct results. . . Which is why I believe the grouping would be more effective. The tasks would be paired with different devices not identical devices but they would NOT be wildly different causing tasks to sit pending for weeks or months. Stephen . . |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19714 Credit: 40,757,560 RAC: 67 ![]() ![]() |
There is a very simple reason why that is a bad idea (and this has been discussed many, many times in the past) - using random diversity in processing is there to reduce the possibility of common-mode errors. Recently we've seen a prime example of such errors in the way certain AMD GPUs were producing wrong results, and were "ganging up" on other devices producing correct results. It will increase the chances of an AMD/ATI GPU match as they are liable to have similar times. |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22815 Credit: 416,307,556 RAC: 380 ![]() ![]() |
Not so - it is highly probable that similar devices would end up in the same set, so INCREASE the chances of similar devices pairing with each other and INCREASE the probability of another event similar to the one we have just suffered from where pairs of similar AMD devices were paired and dumped what is almost certainly incorrect data into the database. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
rob smith ![]() ![]() ![]() Send message Joined: 7 Mar 03 Posts: 22815 Credit: 416,307,556 RAC: 380 ![]() ![]() |
Not true - a single result occupies "one unit" of database space, while 90 occupy 90 units of database space. Re-sends, for whatever reason, are very expensive in terms of the number of queries they require, and that is actually a far higher figure than hat required for a validation & purge cycle - hence the desire to make sure as many results as possible are not re-sent by having long deadlines.; and one result of decreasing deadlines would be to increase the re-send rate If you think back a couple of months, there was something of a hiatus after the aborted server software update when re-send was turned on and the qps on the server went up dramatically. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
... there was something of a hiatus after the aborted server software update when re-send was turned on and the qps on the server went up dramatically.That was 're-send of existing task' after the original sending was lost - whether by operator fumble or comms failure. That requires an exhaustive comparison of server records and 'other tasks' reported by hosts on RPC - that's the expensive one. Deadlines - short or long - are easy. "Not coming back? OK, bye-bye". Create a replacement task and bung it on the back of the to-send queue for anyone to collect. |
![]() ![]() Send message Joined: 6 Jun 99 Posts: 233 Credit: 200,655,462 RAC: 212 ![]() ![]() |
I am sure it will all be fixed during the next "maintenance" on tuesday.... See? I said it would be all fixed during the next "maintenance" but since I received no response or acknowledgment, I responded to myself - now my clique is more exclusive than yours! I will be here another 20 years NOT waiting for an answer. Thanks again to everyone at sah for keeping things running for over twenty years. I whine a lot but I really do understand how much you have done... Member of the 20 Year Club ![]() ![]() |
![]() ![]() Send message Joined: 26 Jan 15 Posts: 88 Credit: 280,183 RAC: 1 ![]() |
@Jimbocous thanks for the reply, I re-ran the Lunatics installer and re-generated the app_info.xml. then tried getting new tasks again but i keep getting the same notices. - Project communication failed: attempting access to reference site - Internet access OK - project servers may be temporarily down. this seems like a longer than usual time to go with no tasks at all.
|
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
. . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog.This wouldn't help much. Two equally powerful hosts could have wildly different queue sizes. Better to group them by their average turnaround times. If a host returns its results in 12 hours from obtaining them, what does it matter if it processed two or two thousand tasks during those those 12 hours? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
For Me the problems with the Failing Uploads seems to be getting Worse. This morning I found all machines, except the fastest one, working fine. I found the top Mining machine was clogged with failed Uploads, dozens of them. The only machine without any Uploads waiting on retries was the slowest one. Trying to clear the Uploads on the one machine also Failed, countless times. I tried everything, then tried using my USB/Ethernet adapter which finally allowed the Uploads to clear. But, even with the USB adapter I now have an average of 6 retries waiting on that machine. It seems if you get very many they just Fail altogether and then rapidly start piling up until the Downloads stop. At that point it becomes difficult to get the Uploads to clear. It's Not getting any better... |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
Not true - a single result occupies "one unit" of database space, while 90 occupy 90 units of database space.Space and spacetime are different things. One result lasting 90 days consumes one unit of row-days every day for 90 days, which could have supported 90 results if each lasted only a day. If results on average lasted twice longer, then the average number of results in the database at a time would also double. There are other kinds of resources consumed when results are created, deleted or their state changed which don't depend on the time the results spend in the database but the recent problems were caused by the database swelling too big to fit in RAM which severely affected the server performance. This depends purely on the row counts, so long lived results are bad. |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
. . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog.This wouldn't help much. Two equally powerful hosts could have wildly different queue sizes. Better to group them by their average turnaround times. If a host returns its results in 12 hours from obtaining them, what does it matter if it processed two or two thousand tasks during those those 12 hours? There is an easy way to solve this. Just make all resends to be sended to the top 50 hosts by daily production. This hosts has the returning rate fast enough to clear the pending backlogs. ![]() |
![]() ![]() Send message Joined: 24 Jan 00 Posts: 38189 Credit: 261,360,520 RAC: 489 ![]() ![]() |
@JimbocousYour old version of BOINC doesn't contain the updated certificates to make contact with the servers. I believe that there is a work around for that, but it'll be easier just to update to a later BOINC version. ;-) Cheers. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.