The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 87 · 88 · 89 · 90 · 91 · 92 · 93 . . . 107 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14686
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2046289 - Posted: 22 Apr 2020, 10:02:37 UTC - in response to Message 2046287.  

Exactly. They seem to be using brute force, rather than nuanced sophistication. Let's see if we can be more sophisticated than them, and help them out by returning any resends as quickly as possible.
ID: 2046289 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2046293 - Posted: 22 Apr 2020, 10:40:34 UTC

Also what I have read in Boinc source suggests that Boinc has a mechanism where tasks can be flagged to be sent to 'reliable' hosts only. Realiable meaning host that doesn't produce lot of errors or invalids and doesn't have a long turnaround time. That's exactly what should be used for these 'presends' but doesn't seem to be used.
ID: 2046293 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2046294 - Posted: 22 Apr 2020, 10:41:29 UTC - in response to Message 2046289.  
Last modified: 22 Apr 2020, 10:52:43 UTC

and help them out by returning any resends as quickly as possible.

I will complain of course, but you need to test whatever bad liquor you are drinking.
Yesterday you asked exactly the inverse. LOL
 I'd also say it's time to switch off the 'process resends first' option - 

Anyway the cache level is at 2.8k. About half a day to complete all the tasks mainly because all who left are vlars most Arecibo.
ID: 2046294 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2046295 - Posted: 22 Apr 2020, 10:47:34 UTC - in response to Message 2046293.  
Last modified: 22 Apr 2020, 11:01:12 UTC

Also what I have read in Boinc source suggests that Boinc has a mechanism where tasks can be flagged to be sent to 'reliable' hosts only. Realiable meaning host that doesn't produce lot of errors or invalids and doesn't have a long turnaround time. That's exactly what should be used for these 'presends' but doesn't seem to be used.

Someone will claim about this as an elitist measure, and the fire on the thread will restart.
ID: 2046295 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14686
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2046297 - Posted: 22 Apr 2020, 11:02:36 UTC - in response to Message 2046294.  

and help them out by returning any resends as quickly as possible.
I will complain of course, but you need to test whatever bad liquor you are drinking.
Yesterday you asked exactly the inverse. LOL
 I'd also say it's time to switch off the 'process resends first' option - 
LOL. Yesterday's remark was aimed at (a limited number of) bunkerers. Today's remark was aimed at the broader population, who don't have a choice because they've only got resends - our first-run tasks ran out three weeks ago!

I had some white wine last night, but this morning I've only touched coffee. Hic.
ID: 2046297 · Report as offensive     Reply Quote
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24922
Credit: 3,081,182
RAC: 7
Ireland
Message 2046298 - Posted: 22 Apr 2020, 11:07:13 UTC - in response to Message 2046295.  

Also what I have read in Boinc source suggests that Boinc has a mechanism where tasks can be flagged to be sent to 'reliable' hosts only. Realiable meaning host that doesn't produce lot of errors or invalids and doesn't have a long turnaround time. That's exactly what should be used for these 'presends' but doesn't seem to be used.

Someone will claim about this as an elitist measure, and the fire on the thread will restart.
In the past, that may have been true, but not now.
I stopped crunching on 10th after the last initial task completed. My thoughts were & still are, let the fast hosts get the resends so that the project can get completed.
My only fear is that if/when Seti returns, it'll be part of Science United rather than Boinc. :-(
ID: 2046298 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2046299 - Posted: 22 Apr 2020, 11:07:58 UTC

I had to turn that option off in my client at the end of March because I had collected enough tasks to reach the deadline limit. Processing resends with deadlines beyond the choke point first would have made me miss some deadlines.
ID: 2046299 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14686
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2046300 - Posted: 22 Apr 2020, 11:08:03 UTC - in response to Message 2046295.  

Also what I have read in Boinc source suggests that Boinc has a mechanism where tasks can be flagged to be sent to 'reliable' hosts only. Realiable meaning host that doesn't produce lot of errors or invalids and doesn't have a long turnaround time. That's exactly what should be used for these 'presends' but doesn't seem to be used.
Someone will claim about this as an elitist measure, and the fire on the thread will restart.
I think that our general experience, over the years since Credit New was introduced, is that the 'reliable host' flag is itself unreliable, and too often allows new work to be sent to hosts who haven't yet returned the old work - or have only returned a small proportion of it. We had a sticky thread about people sending PMs to the owners of unreliable hosts, until recently.
ID: 2046300 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2046301 - Posted: 22 Apr 2020, 11:12:58 UTC - in response to Message 2046297.  

I've only touched coffee. Hic.

Coffee + aspirins for me too. My head hurts.
Have no idea why it's hurt each morning.
ID: 2046301 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2046302 - Posted: 22 Apr 2020, 11:19:45 UTC - in response to Message 2046298.  

My only fear is that if/when Seti returns, it'll be part of Science United rather than Boinc. :-(
SU isn't an alternative to Boinc. It's an alternative to account managers. Projects that SU makes your host crunch are normal Boinc projects you can attach 'manually' too.
ID: 2046302 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2046303 - Posted: 22 Apr 2020, 11:19:55 UTC - in response to Message 2046300.  
Last modified: 22 Apr 2020, 11:30:15 UTC

Also what I have read in Boinc source suggests that Boinc has a mechanism where tasks can be flagged to be sent to 'reliable' hosts only. Realiable meaning host that doesn't produce lot of errors or invalids and doesn't have a long turnaround time. That's exactly what should be used for these 'presends' but doesn't seem to be used.
Someone will claim about this as an elitist measure, and the fire on the thread will restart.
I think that our general experience, over the years since Credit New was introduced, is that the 'reliable host' flag is itself unreliable, and too often allows new work to be sent to hosts who haven't yet returned the old work - or have only returned a small proportion of it. We had a sticky thread about people sending PMs to the owners of unreliable hosts, until recently.

Maybe is wise on this last days down the WU limit a lot more while keep sending the resends, something like 15 per device instead of the actual 150, that will allow the user pick a small bath of files, return and pick more. Instead of a large bath (like the 300 WU some received) who takes days/weeks to crunch on the regular hosts. By doing this the data will be crunched a lot faster since the WU's will be spread to a lot more hosts and the WU will not rest for a long time in any host.
ID: 2046303 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2046304 - Posted: 22 Apr 2020, 11:22:22 UTC - in response to Message 2046300.  

I think that our general experience, over the years since Credit New was introduced, is that the 'reliable host' flag is itself unreliable, and too often allows new work to be sent to hosts who haven't yet returned the old work - or have only returned a small proportion of it.
How can the option prove itself unreliable if it hasn't been used?
ID: 2046304 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2046305 - Posted: 22 Apr 2020, 11:36:55 UTC - in response to Message 2046298.  
Last modified: 22 Apr 2020, 11:37:23 UTC

My only fear is that if/when Seti returns, it'll be part of Science United rather than Boinc. :-(

<Panic mode ON> Hope NO or we will be definitely doomed.
ID: 2046305 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2046307 - Posted: 22 Apr 2020, 11:43:45 UTC - in response to Message 2046303.  
Last modified: 22 Apr 2020, 11:44:27 UTC

Maybe is wise on this last days down the WU limit a lot more while keep sending the resends, something like 15 per device instead of the actual 150, that will allow the user pick a small bath of files, return and pick more. Instead of a large bath (like the 300 WU some received) who takes days/weeks to crunch on the regular hosts. By doing this the data will be crunched a lot faster since the WU will be spread to a lot more hosts and the WU will not rest for a long time in any host.
The main problem is that their script is sending the resends in big bunches and the scheduler request cooldown is 30 minutes. So some 'lucky' hosts will receive a ridiculous amount and the majority gets nothing.

They should remove this raffle by making the scheduler buffer a lot smaller so that a host can get only a few tasks per request and everyone would get a little. If everyone has a little instead of just the lucky ones having a lot, then much bigger proportion of the setiathome distributed supercomputer capacity would be in use.

I'm sure they do have statistics about how many hosts are contacting on average in the time between two script runs. If they limit the number of tasks one scheduler request can receive to the number of tasks one script run resends divided by number of scheduler requests happening between two runs, then everyone would get something and every task would still be taken by someone.

If they just reduced the server side limit for max tasks in cache to some small number, then "spoofers" would get most of the stuff. Good for me and my team but not good in general.
ID: 2046307 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2046310 - Posted: 22 Apr 2020, 11:55:50 UTC - in response to Message 2046307.  
Last modified: 22 Apr 2020, 12:00:42 UTC

If they just reduced the server side limit for max tasks in cache to some small number, then "spoofers" would get most of the stuff. Good for me and my team but not good in general.

It's hard to any spoofer to build a even small cache DL 15 WU each 30 min. Since most of us has fast hosts who crunch this 15 WU in less than 30 min. That is why i suggest a very small number of WU (maybe instead 2 for each CPU and 10 of each GPU) . So the supercomputer could activate all his nodes (hosts) and end the task ASAP. The idea is to distribute the work for all possible hosts at the time. As each host returns it's job a new set could be sended to it.
ID: 2046310 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2046315 - Posted: 22 Apr 2020, 12:11:42 UTC
Last modified: 22 Apr 2020, 12:15:31 UTC

@Richard H

Is there an option in BOINC code to Cancel unstarted tasks which have already met Quorum ?

I'm fairly sure that I have seen this used before on some other projects, possibly LHC

This would help some of the owners of machines with very large, multi day caches to continue to reduce the backlog quicker, while not needing to process unnecessary work.

I have a few examples here https://setiathome.berkeley.edu/results.php?hostid=8800543&offset=0&show_names=1&state=0&appid= from 31/3.
There are only 7, so they are quite easy to see.
ID: 2046315 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2046321 - Posted: 22 Apr 2020, 12:36:42 UTC - in response to Message 2046318.  
Last modified: 22 Apr 2020, 12:44:55 UTC


I see that regularly on WCG, so there is such an option. But the client in question must be alive, and actually do the cancelling after the server request.


Thanks, I thought I had seen it before.

I think there may be over 1 million tasks in this condition now "Results waiting for db purging" out of the total "Results returned and awaiting validation"

I'm sure that value has been increasing significantly over the last few days, but I can't find a Munin graph for it. https://munin.kiska.pw/munin/setiathome-day.html
ID: 2046321 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2046323 - Posted: 22 Apr 2020, 12:48:00 UTC - in response to Message 2046318.  
Last modified: 22 Apr 2020, 12:50:22 UTC

Is there an option in BOINC code to Cancel unstarted tasks which have already met Quorum ?
I'm fairly sure that I have seen this used before on some other projects, possibly LHC.
The message client produces when this happens is

"Server requested abort of unknown task %s"

And the 'unknown' here suggests that this mechanism is used to abort tasks that have no matching result row in the database. So more like a bug resolving mechanism than an option for server admins to use.

Edit - actually I misread the code. The message is only printed when the client failed to find the task it was asked to abort. The 'unknown' refers to this.
ID: 2046323 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2046328 - Posted: 22 Apr 2020, 13:15:05 UTC - in response to Message 2046323.  

I think the Unknowns may be Ghost tasks that have been cancelled.

I understand that this will only be helpful for big caches, but it should be helpful to reduce wasted effort, by ensuring that their tasks waiting to be processed are still needed.

The remainder will eventually time out, in weeks or months, if they never contact the servers again.
ID: 2046328 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2046330 - Posted: 22 Apr 2020, 13:29:30 UTC

If they really wanted to clear those results from the database faster, they would have reduced the ridiculously long deadlines. Some of the extra resends they are sending now have deadlines in late July.
ID: 2046330 · Report as offensive     Reply Quote
Previous · 1 . . . 87 · 88 · 89 · 90 · 91 · 92 · 93 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.