Panic Mode On (108) Server Problems?

Message boards : Number crunching : Panic Mode On (108) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 32 · Next

AuthorMessage
Profile Chris S Crowdfunding Project Donor*Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 19 Nov 00
Posts: 40776
Credit: 41,422,439
RAC: 1,236
United Kingdom
Message 1904369 - Posted: 2 Dec 2017, 10:42:03 UTC

The Grand Old Duke of Seti
He had 10,000 crunchers
He led them to believe in lots of work
And he let them down again
And when it was up it was up
And when it was down it was down
And when it was only halfway up
It was neither up nor down
ID: 1904369 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12194
Credit: 125,032,810
RAC: 38,954
United Kingdom
Message 1904382 - Posted: 2 Dec 2017, 12:30:35 UTC

It's unfortunate that, as happened so often after past outages, the luck of the draw has thrown tapes full of Arecibo shorties into the splitter just as we need a steady supply of good, chewy, work. One of my GPU machines has finally filled its 200 task queue, and it's got precisely 100 shorties and 100 guppies - nothing in between.
ID: 1904382 · Report as offensive
Profile Advent42
Avatar

Send message
Joined: 23 Mar 17
Posts: 175
Credit: 4,015,683
RAC: 0
Ireland
Message 1904395 - Posted: 2 Dec 2017, 13:48:58 UTC - in response to Message 1904382.  

Yeah!!
The search continues...:-)
ID: 1904395 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 3705
Credit: 84,737,600
RAC: 130,955
Australia
Message 1904421 - Posted: 2 Dec 2017, 16:04:15 UTC - in response to Message 1904066.  

It's all fine. The project will be back in January. Maybe a little earlier than that.
Which January?


. . :)

Stephen

:)
ID: 1904421 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 3705
Credit: 84,737,600
RAC: 130,955
Australia
Message 1904425 - Posted: 2 Dec 2017, 16:18:19 UTC - in response to Message 1904225.  

I too have wondered why SETI has stuck with the extremely long deadlines I assess were implemented for the original hardware used on the project. That kind of hardware is 18 years in the past and does not need to continue to be supported. I agree with you Jeff, I would expect the sizes of databases and the strain they put on the project would be greatly lessened if the deadlines were reduced by a month, lets say from the current 2 month deadline.


. . That would get my support :)

Stephen

..
ID: 1904425 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 3705
Credit: 84,737,600
RAC: 130,955
Australia
Message 1904427 - Posted: 2 Dec 2017, 16:27:54 UTC - in response to Message 1904227.  

I too have wondered why SETI has stuck with the extremely long deadlines I assess were implemented for the original hardware used on the project. That kind of hardware is 18 years in the past and does not need to continue to be supported. I agree with you Jeff, I would expect the sizes of databases and the strain they put on the project would be greatly lessened if the deadlines were reduced by a month, lets say from the current 2 month deadline.

As I understand it, the reason there has been no adjustment is because Eric does not wish to disenfranchise anybody from participating in this project.

And that would include folks with very meager hardware resources. Not everybody can afford what some of us are able to.

That is why.


. . Hi Mark,

. . I have heard that reasoning before but there is no rational reason for any rig, no matter how slow, to download more work than they can process in a month. If the gear can only process 2 tasks per week than set the cache so that you only download half a dozen jobs and a one month deadline is still not an issue. Since every time you upload/report results you get fresh work (OK I'm an optimist 8^} ) such a rig would still be productive. Why should any rig have 100 tasks cached if it would take that rig 6 months to process them? I completely agree that such disproportionate downloading makes a great argument for shorter deadlines.

Stephen

..
ID: 1904427 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 3705
Credit: 84,737,600
RAC: 130,955
Australia
Message 1904430 - Posted: 2 Dec 2017, 16:38:32 UTC - in response to Message 1904244.  

Don't forget it isn't just the raw crunching time you need to consider for deadlines - it's also all the dead time when the computer is switched off or in use. And fir Android, when it's away from the charger.


. . But if any given device cannot process a single task within a month, whether due to insufficient crunching power or lack of run time, then is that device really fit for purpose ??

Stephen

??
ID: 1904430 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12194
Credit: 125,032,810
RAC: 38,954
United Kingdom
Message 1904439 - Posted: 2 Dec 2017, 17:10:34 UTC - in response to Message 1904427.  

Why should any rig have 100 tasks cached if it would take that rig 6 months to process them?
If the pattern is persistent, it wouldn't be able to. Work is requested by time (use the <sched_op_debug> flag, and really read the Event Log): the maximum time request is 20 days, not 6 months.
ID: 1904439 · Report as offensive
Profile Jeff Buck Special Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1904447 - Posted: 2 Dec 2017, 17:44:16 UTC

Given the arguments in favor of the roughly 8-week deadlines for normal AR and VLAR MB tasks, in order to accommodate even the most laggardly of hosts, can anyone then explain the 3-week deadlines for AP tasks? On my machines, regardless of the CPU or GPU or OS, AP tasks take longer to run than the longest-running of those MB tasks, in some cases, 2 or 3 times as long. If 3 weeks is an adequate deadline for APs, why not for MBs? Mind you, I'm not advocating for that short a deadline for either of those categories of tasks, but that's always struck me as a glaring inconsistency.
ID: 1904447 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 3705
Credit: 84,737,600
RAC: 130,955
Australia
Message 1904448 - Posted: 2 Dec 2017, 18:00:28 UTC - in response to Message 1904439.  
Last modified: 2 Dec 2017, 18:08:07 UTC

Why should any rig have 100 tasks cached if it would take that rig 6 months to process them?
If the pattern is persistent, it wouldn't be able to. Work is requested by time (use the <sched_op_debug> flag, and really read the Event Log): the maximum time request is 20 days, not 6 months.


. . But that is the system's weakness. If a rig can crunch a WU in 2 hours (say an old i7 using just one CPU core) and they set their work request to the 20 day maximum allowed they will get the full server limited allocation of 100 tasks, even if they are returning only a few per week or only invalid results. I have come across many wingmen like that. If the allocation were made on average return time as others have suggested, rather than average run time, then the allocation numbers would be appropriately reduced.

. . BTW, just how do you set flags in the event log??

Stephen

. .
ID: 1904448 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12194
Credit: 125,032,810
RAC: 38,954
United Kingdom
Message 1904449 - Posted: 2 Dec 2017, 18:05:19 UTC - in response to Message 1904447.  

Given the arguments in favor of the roughly 8-week deadlines for normal AR and VLAR MB tasks, in order to accommodate even the most laggardly of hosts, can anyone then explain the 3-week deadlines for AP tasks? On my machines, regardless of the CPU or GPU or OS, AP tasks take longer to run than the longest-running of those MB tasks, in some cases, 2 or 3 times as long. If 3 weeks is an adequate deadline for APs, why not for MBs? Mind you, I'm not advocating for that short a deadline for either of those categories of tasks, but that's always struck me as a glaring inconsistency.
No, but...

Some of it is covered in the (extremely ancient) Astropulse FAQ page. Astropulse had been around as a concept for some years, but this particular implementation was written as a grad-student project by Josh von Korff. Because it formed part of his Examined coursework for whichever degree it was, Eric - as his supervisor - very deliberately:

(a) left him to work out the solutions to his own problems
(b) required him to handle deployment, snagging, and dealing with user feedback as part of his training.

Josh got it working and deployed, passed his degree, and moved on to continue his academic career at another institution.

I suppose the justification at the time (about 10 years ago, just before GPUs were added to the crunching mix) was that you could opt in or out of AP, only those with the most powerful CPUs would choose to opt in - others with deadline trouble could content themselves with MB only.
ID: 1904449 · Report as offensive
Profile Jeff Buck Special Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1904453 - Posted: 2 Dec 2017, 18:37:57 UTC - in response to Message 1904449.  

No, but...

Some of it is covered in the (extremely ancient) Astropulse FAQ page. Astropulse had been around as a concept for some years, but this particular implementation was written as a grad-student project by Josh von Korff. Because it formed part of his Examined coursework for whichever degree it was, Eric - as his supervisor - very deliberately:

(a) left him to work out the solutions to his own problems
(b) required him to handle deployment, snagging, and dealing with user feedback as part of his training.

Josh got it working and deployed, passed his degree, and moved on to continue his academic career at another institution.

I suppose the justification at the time (about 10 years ago, just before GPUs were added to the crunching mix) was that you could opt in or out of AP, only those with the most powerful CPUs would choose to opt in - others with deadline trouble could content themselves with MB only.
Actually, it looks like it was, at least initially, only an opt-out choice, unless you were running optimized apps. All others got AP tasks automatically, if their systems met the requirements.

I'm intrigued by a couple of statements in there. First, that "The initial deadline for Astropulse tasks will be 14 days.", and then there's "... If our server judges that your computer cannot complete an Astropulse workunit in 22.5 days (75% of the maximum 30 days)...". So, the 14 days apparently got bumped up early on, but what's the meaning of "maximum 30 days"? Was that the maximum for any S@h task in those olden days? If so, how did we jump to 8 weeks, even as CPUs got faster and GPUs came into the mix?

The bottom line, to me, is that decisions regarding task deadlines are among those that were made a long time ago, and no longer take into account the processing environment as it exists today, both in terms of end-user hardware and the project's periodic database woes. Many aspects of the project are moving forward. These legacy decisions should be revisited to evaluate whether the reasons behind them still make sense.
ID: 1904453 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10345
Credit: 138,919,040
RAC: 84,144
Australia
Message 1904457 - Posted: 2 Dec 2017, 18:48:21 UTC - in response to Message 1904453.  

So, the 14 days apparently got bumped up early on, but what's the meaning of "maximum 30 days"? Was that the maximum for any S@h task in those olden days? If so, how did we jump to 8 weeks, even as CPUs got faster and GPUs came into the mix?

Those numbers were probably based on the original pre-BOINC Seti work times, then beefed up for the original BOINC Seti. Since then, there have been several versions of Seti, each one involving more processing than the previous one and longer runtimes than the previous version for given hardware. Hence the long deadlines, based on the crunching time for that much older hardware.

The bottom line, to me, is that decisions regarding task deadlines are among those that were made a long time ago, and no longer take into account the processing environment as it exists today, both in terms of end-user hardware and the project's periodic database woes. Many aspects of the project are moving forward. These legacy decisions should be revisited to evaluate whether the reasons behind them still make sense.

That's the best argument yet for changing the deadlines IMHO. Current basic Android devices would be on par with what was a highend P4 computation wise. More recent Android devices are not only higher performing, but multi core as well; let alone current CPUs with AVX, AVX2 and IPC (instructions per clock) improvements. And then there are GPUs.
Many of the deadlines were based on WU run times for what was the lowend hardware of the day. Current lowend hardware matches, or even exceeds, highend hardware of that period.
Grant
Darwin NT
ID: 1904457 · Report as offensive
Grumpy Old Man
Volunteer tester
Avatar

Send message
Joined: 1 Nov 08
Posts: 7269
Credit: 45,283,118
RAC: 1,412
Sweden
Message 1904458 - Posted: 2 Dec 2017, 18:50:36 UTC

Sending resent tasks to reliable hosts primarily would really do the trick. Waiting for months for unreliable hosts to timeout, isn't the best way to handle resends.
Pairing reliable hosts for _0 and_1 tasks would also help very much.
ID: 1904458 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12194
Credit: 125,032,810
RAC: 38,954
United Kingdom
Message 1904460 - Posted: 2 Dec 2017, 18:52:44 UTC - in response to Message 1904453.  

The bottom line, to me, is that decisions regarding task deadlines are among those that were made a long time ago, and no longer take into account the processing environment as it exists today, both in terms of end-user hardware and the project's periodic database woes. Many aspects of the project are moving forward. These legacy decisions should be revisited to evaluate whether the reasons behind them still make sense.
I totally agree. But they need to be rational, considered revisitations, taking into account the needs of everyone - the project itself, the users who post here, the users who don't post here, the users with the latest hardware, the users with one clunky hand-me-down, the users who are exclusively dedicated to SETI, the users who spread themselves thinly across multiple projects.....

And everyone in between.

What the project needs most of all is time to think, and data to base their decisions on (which means fixing those broken webpages like client types and science status) which haven't updated since before Matt Lebofsky was diverted away from us.

And that means more people. And more people means more money. Thinking caps on, please.
ID: 1904460 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 5567
Credit: 386,149,532
RAC: 1,034,072
United States
Message 1904490 - Posted: 2 Dec 2017, 21:05:55 UTC - in response to Message 1904458.  

Sending resent tasks to reliable hosts primarily would really do the trick. Waiting for months for unreliable hosts to timeout, isn't the best way to handle resends.
Pairing reliable hosts for _0 and_1 tasks would also help very much.

That's a question I've always had about SETI. Does the project use or have any reliable host check mechanism in place? I know that both Einstein and Milkyway, the other projects I crunch for use mechanisms to evaluate hosts to make sure they produce accurate and reliable data.

I suspect not though because of the ever growing "Invalid Host Messaging" thread.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1904490 · Report as offensive
Grumpy Old Man
Volunteer tester
Avatar

Send message
Joined: 1 Nov 08
Posts: 7269
Credit: 45,283,118
RAC: 1,412
Sweden
Message 1904492 - Posted: 2 Dec 2017, 21:22:11 UTC - in response to Message 1904490.  
Last modified: 2 Dec 2017, 21:28:02 UTC

Sending resent tasks to reliable hosts primarily would really do the trick. Waiting for months for unreliable hosts to timeout, isn't the best way to handle resends.
Pairing reliable hosts for _0 and_1 tasks would also help very much.

That's a question I've always had about SETI. Does the project use or have any reliable host check mechanism in place? I know that both Einstein and Milkyway, the other projects I crunch for use mechanisms to evaluate hosts to make sure they produce accurate and reliable data.

I suspect not though because of the ever growing "Invalid Host Messaging" thread.

Yeah, the "Consecutive valid tasks" check and the "Max tasks per day", is really pretty useless, since the next day you will get another chance, to trash more WU's.
And if you complete just a few ones OK, you'll get more to trash until your "Max tasks per day" gets too low, but then again, a few more OK tasks will give you more tasks again.

It must be a better way to stop unreliable hosts from trashing millons of tasks combined.
A longtime reliability check would IMO be much better, and then first pair reliable hosts with each other, and of course sending resends to reliable hosts only.

Edit: And also, send resends to reliable hosts which also returns their tasks pretty quick.
ID: 1904492 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1094
Credit: 9,307,398
RAC: 4,639
New Zealand
Message 1904508 - Posted: 2 Dec 2017, 22:19:53 UTC - in response to Message 1904361.  

But, as expected, there's trouble in paradise.
SSP stopped updating: [As of 2 Dec 2017, 9:40:04 UTC]

And what usually follows after that.....well, you all know that :-(

Well, I am hoping the success was not that short lived, and the SSP snag is just due to the heavy load things must be under.
Kitties are hopeful.
Meow.

The SSP just updated. Thanks Dog for that :-)

When you next feed your Dog maybe you could give it some extra food as a reward just an idea.
ID: 1904508 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 257
Credit: 2,544,358
RAC: 142
Australia
Message 1904535 - Posted: 3 Dec 2017, 1:55:50 UTC - in response to Message 1904448.  
Last modified: 3 Dec 2017, 1:56:21 UTC

Why should any rig have 100 tasks cached if it would take that rig 6 months to process them?
If the pattern is persistent, it wouldn't be able to. Work is requested by time (use the <sched_op_debug> flag, and really read the Event Log): the maximum time request is 20 days, not 6 months.

***SNIP***
. . BTW, just how do you set flags in the event log??

Stephen

. .


In the manager(assuming you use the "Advanced View") go Options -> Event Log Options. And check sched_op_debug

Or edit your cc_config, add <sched_op_debug>1</sched_op_debug>
Or change the 0 to a 1 if sched_op_debug already exists

I'll provide a sample:
<cc_config>
    <log_flags>
        <sched_op_debug>1</sched_op_debug>
    </log_flags>
    <!--- Insert other configuration stuff here -->
</cc_config>
ID: 1904535 · Report as offensive
Profile Jeff Buck Special Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1904550 - Posted: 3 Dec 2017, 3:50:18 UTC - in response to Message 1904460.  

I totally agree. But they need to be rational, considered revisitations, taking into account the needs of everyone - the project itself, the users who post here, the users who don't post here, the users with the latest hardware, the users with one clunky hand-me-down, the users who are exclusively dedicated to SETI, the users who spread themselves thinly across multiple projects.....

And everyone in between.

What the project needs most of all is time to think, and data to base their decisions on (which means fixing those broken webpages like client types and science status) which haven't updated since before Matt Lebofsky was diverted away from us.

And that means more people. And more people means more money. Thinking caps on, please.
I think I would slightly modify "everyone" to "almost everyone". As a practical matter, and in a changing environment that is, hopefully, moving forward, some categories of users and/or hardware need to be recognized as no longer viable from time to time. After all, I'm sitting here with a Win98 machine a few inches from my left knee. I still use it occasionally because that was the last OS with a driver that supports my preferred scanner. I seem to recall reading somewhere that there actually is still a way to use that machine for crunching in a very limited way. However, I'm pretty sure that, even if I could, it would require far too much effort on on my part. Why? Because, the project chose to no longer support such "obsolete" hardware in its "stock" environment. There's just no simple "download and go" option there. That means that not every user with a "clunky hand-me-down" should expect that the project will forever support their (relatively) diminishing contribution.

Ah, yes, actual DATA. That's a commodity that's sadly lacking for most of us down here on the mushroom farm. Virtually every post discussing some aspect of how the project should be run, or should be changed, is laden with guesswork, supposition, and assumptions about what the original "intent" of the ancients, I mean, the designers of the system, was. References to actual evidence to support those assumptions are pretty rare. That's certainly been true regarding the debate over task deadlines that has taken place from time to time, and which I've never participated in before. However, seeing a task count in the RTS buffer which seemed like it could only represent timed-out resends presented an opportunity to perhaps inject some real data into the discussion. But I agree that, ideally anyway, more data is needed to make a decision. That data is probably all available, in some form or another, on the servers, in the databases, or in the logs, repositories not readily available to mushrooms. :^)

A thinking cap......hmmmmm. Now, where did I put mine? Is that the one with the little propeller on top?
ID: 1904550 · Report as offensive
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 32 · Next

Message boards : Number crunching : Panic Mode On (108) Server Problems?


 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.