The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 51 · 52 · 53 · 54 · 55 · 56 · 57 . . . 94 · Next

AuthorMessage
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029176 - Posted: 25 Jan 2020, 12:23:15 UTC

Another way to reduce database size would be if the servers paired hosts crunching the same workunit smarter. Send the wu to hosts with similar average turnaround times.
ID: 2029176 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 2029177 - Posted: 25 Jan 2020, 12:23:33 UTC - in response to Message 2029175.  

The size reduction wouldn't be that much.
56% of the tasks in my 'Validation pending' list are ones I returned over 1 week ago.


I am definitely not helping in that regard.... since the SSD where my BOINC install is on has died unexpectedly, so trying to recover that is a nightmare.....
ID: 2029177 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1859
Credit: 268,616,081
RAC: 1,349
United States
Message 2029178 - Posted: 25 Jan 2020, 12:24:02 UTC - in response to Message 2029175.  

The size reduction wouldn't be that much.
56% of the tasks in my 'Validation pending' list are ones I returned over 1 week ago.

Facts are always better:
Just did a spot check of one of my boxes.
Of ~5600 waiting validation, more than 2000 are older than 1 Jan, 25 days ago. Many are back to mid-October.
Somehow, I think that's probably typical, and it certainly seems significant to me.
ID: 2029178 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22815
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2029179 - Posted: 25 Jan 2020, 12:35:23 UTC - in response to Message 2029178.  

On your highest scoring computer 8 tasks date back to October, or 0.14% of your pendings!!!!
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2029179 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029180 - Posted: 25 Jan 2020, 12:39:01 UTC

Is it more efficient for all the splitters to bunch up splitting the same file? If they spread out on different files, it would probably dilute these overflow storms significantly.
ID: 2029180 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029181 - Posted: 25 Jan 2020, 12:41:57 UTC - in response to Message 2029179.  

On your highest scoring computer 8 tasks date back to October, or 0.14% of your pendings!!!!
The impact on database size scales by the length of time. A single task lingering for 3 months has the same effect as 90 one day tasks.
ID: 2029181 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2029197 - Posted: 25 Jan 2020, 17:14:47 UTC - in response to Message 2029176.  

Another way to reduce database size would be if the servers paired hosts crunching the same workunit smarter. Send the wu to hosts with similar average turnaround times.


. . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog.

. . Just a thought.

Stephen

? ?
ID: 2029197 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22815
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2029198 - Posted: 25 Jan 2020, 17:25:42 UTC

There is a very simple reason why that is a bad idea (and this has been discussed many, many times in the past) - using random diversity in processing is there to reduce the possibility of common-mode errors. Recently we've seen a prime example of such errors in the way certain AMD GPUs were producing wrong results, and were "ganging up" on other devices producing correct results.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2029198 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2029199 - Posted: 25 Jan 2020, 17:38:49 UTC - in response to Message 2029198.  

There is a very simple reason why that is a bad idea (and this has been discussed many, many times in the past) - using random diversity in processing is there to reduce the possibility of common-mode errors. Recently we've seen a prime example of such errors in the way certain AMD GPUs were producing wrong results, and were "ganging up" on other devices producing correct results.


. . Which is why I believe the grouping would be more effective. The tasks would be paired with different devices not identical devices but they would NOT be wildly different causing tasks to sit pending for weeks or months.

Stephen

. .
ID: 2029199 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19714
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2029202 - Posted: 25 Jan 2020, 17:49:25 UTC - in response to Message 2029199.  

There is a very simple reason why that is a bad idea (and this has been discussed many, many times in the past) - using random diversity in processing is there to reduce the possibility of common-mode errors. Recently we've seen a prime example of such errors in the way certain AMD GPUs were producing wrong results, and were "ganging up" on other devices producing correct results.


. . Which is why I believe the grouping would be more effective. The tasks would be paired with different devices not identical devices but they would NOT be wildly different causing tasks to sit pending for weeks or months.

Stephen

. .

It will increase the chances of an AMD/ATI GPU match as they are liable to have similar times.
ID: 2029202 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22815
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2029203 - Posted: 25 Jan 2020, 17:59:26 UTC - in response to Message 2029199.  

Not so - it is highly probable that similar devices would end up in the same set, so INCREASE the chances of similar devices pairing with each other and INCREASE the probability of another event similar to the one we have just suffered from where pairs of similar AMD devices were paired and dumped what is almost certainly incorrect data into the database.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2029203 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22815
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2029205 - Posted: 25 Jan 2020, 18:12:04 UTC - in response to Message 2029181.  

Not true - a single result occupies "one unit" of database space, while 90 occupy 90 units of database space.

Re-sends, for whatever reason, are very expensive in terms of the number of queries they require, and that is actually a far higher figure than hat required for a validation & purge cycle - hence the desire to make sure as many results as possible are not re-sent by having long deadlines.; and one result of decreasing deadlines would be to increase the re-send rate If you think back a couple of months, there was something of a hiatus after the aborted server software update when re-send was turned on and the qps on the server went up dramatically.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2029205 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2029208 - Posted: 25 Jan 2020, 18:42:17 UTC - in response to Message 2029205.  

... there was something of a hiatus after the aborted server software update when re-send was turned on and the qps on the server went up dramatically.
That was 're-send of existing task' after the original sending was lost - whether by operator fumble or comms failure. That requires an exhaustive comparison of server records and 'other tasks' reported by hosts on RPC - that's the expensive one.

Deadlines - short or long - are easy. "Not coming back? OK, bye-bye". Create a replacement task and bung it on the back of the to-send queue for anyone to collect.
ID: 2029208 · Report as offensive
Profile Oz
Avatar

Send message
Joined: 6 Jun 99
Posts: 233
Credit: 200,655,462
RAC: 212
United States
Message 2029210 - Posted: 25 Jan 2020, 18:50:15 UTC - in response to Message 2028145.  

I am sure it will all be fixed during the next "maintenance" on tuesday....








ROTFLMAO



See? I said it would be all fixed during the next "maintenance" but since I received no response or acknowledgment, I responded to myself - now my clique is more exclusive than yours! I will be here another 20 years NOT waiting for an answer.









Thanks again to everyone at sah for keeping things running for over twenty years. I whine a lot but I really do understand how much you have done...
Member of the 20 Year Club



ID: 2029210 · Report as offensive
Profile xpozd
Avatar

Send message
Joined: 26 Jan 15
Posts: 88
Credit: 280,183
RAC: 1
Canada
Message 2029213 - Posted: 25 Jan 2020, 19:43:12 UTC - in response to Message 2029023.  

@Jimbocous
thanks for the reply,
I re-ran the Lunatics installer and re-generated the app_info.xml.
then tried getting new tasks again but i keep getting the same notices.

- Project communication failed: attempting access to reference site
- Internet access OK - project servers may be temporarily down.

this seems like a longer than usual time to go with no tasks at all.

  • win7starter
  • boinc: 7.14.2
  • boinc tasks: 1.78
  • Lunatics Win32 v0.44

ID: 2029213 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029215 - Posted: 25 Jan 2020, 19:54:22 UTC - in response to Message 2029197.  

. . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog.
This wouldn't help much. Two equally powerful hosts could have wildly different queue sizes. Better to group them by their average turnaround times. If a host returns its results in 12 hours from obtaining them, what does it matter if it processed two or two thousand tasks during those those 12 hours?
ID: 2029215 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2029216 - Posted: 25 Jan 2020, 20:01:21 UTC

For Me the problems with the Failing Uploads seems to be getting Worse. This morning I found all machines, except the fastest one, working fine. I found the top Mining machine was clogged with failed Uploads, dozens of them. The only machine without any Uploads waiting on retries was the slowest one. Trying to clear the Uploads on the one machine also Failed, countless times. I tried everything, then tried using my USB/Ethernet adapter which finally allowed the Uploads to clear. But, even with the USB adapter I now have an average of 6 retries waiting on that machine. It seems if you get very many they just Fail altogether and then rapidly start piling up until the Downloads stop. At that point it becomes difficult to get the Uploads to clear.
It's Not getting any better...
ID: 2029216 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029217 - Posted: 25 Jan 2020, 20:05:55 UTC - in response to Message 2029205.  

Not true - a single result occupies "one unit" of database space, while 90 occupy 90 units of database space.
Space and spacetime are different things. One result lasting 90 days consumes one unit of row-days every day for 90 days, which could have supported 90 results if each lasted only a day. If results on average lasted twice longer, then the average number of results in the database at a time would also double.

There are other kinds of resources consumed when results are created, deleted or their state changed which don't depend on the time the results spend in the database but the recent problems were caused by the database swelling too big to fit in RAM which severely affected the server performance. This depends purely on the row counts, so long lived results are bad.
ID: 2029217 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2029218 - Posted: 25 Jan 2020, 20:06:32 UTC - in response to Message 2029215.  
Last modified: 25 Jan 2020, 20:11:00 UTC

. . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog.
This wouldn't help much. Two equally powerful hosts could have wildly different queue sizes. Better to group them by their average turnaround times. If a host returns its results in 12 hours from obtaining them, what does it matter if it processed two or two thousand tasks during those those 12 hours?

There is an easy way to solve this. Just make all resends to be sended to the top 50 hosts by daily production. This hosts has the returning rate fast enough to clear the pending backlogs.
ID: 2029218 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 38189
Credit: 261,360,520
RAC: 489
Australia
Message 2029219 - Posted: 25 Jan 2020, 20:16:20 UTC - in response to Message 2029213.  

@Jimbocous
thanks for the reply,
I re-ran the Lunatics installer and re-generated the app_info.xml.
then tried getting new tasks again but i keep getting the same notices.

- Project communication failed: attempting access to reference site
- Internet access OK - project servers may be temporarily down.

this seems like a longer than usual time to go with no tasks at all.
Your old version of BOINC doesn't contain the updated certificates to make contact with the servers. I believe that there is a work around for that, but it'll be easier just to update to a later BOINC version. ;-)

Cheers.
ID: 2029219 · Report as offensive
Previous · 1 . . . 51 · 52 · 53 · 54 · 55 · 56 · 57 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.