Okay..so, time for me to bit*h and moan


log in

Advanced search

Message boards : Number crunching : Okay..so, time for me to bit*h and moan

Author Message
Draconian
Volunteer tester
Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1310749 - Posted: 27 Nov 2012, 13:22:41 UTC

First and foremost - nothing I say here is personal or an attack on anyone that runs this project - you do a heck of a job and are under appreciated - you give a TON! My respect.
However....
With the frequent outages - please - open up the queues - I want to crunch 24/7 and - well...I'm out except for one Astro unit. 200 workunits doesn't cut it for this box - 6 core, 12 threads, and a 680. It's..getting hungry...
Question - when I look at other folks queues - why do I see...MANY more units - several hundred to thousands....? All I get - no matter what I set - is 200..

Why do you send me workunits that expire 5 minutes AFTER you send them - hello?

A lot of the recent failures have been with the scheduler - so - why not open up the queue - give me 5 days of data - and set a MANDATORY "do not contact" setting until my systems are half full? I would still have 2 and a half days data at that point - enough to get through most failures - and it would lower stress on the scheduler. It doesn't make ANY sense that my system contacts the server and asks for more work when my queue is nearly topped off - reporting 4 completed and asking for more (when...I still have the 196 to go). If there is a way to set the system to NOT allow the user to request communication until their queue is half full - it would be great.

Again though - my respect. These are just the thoughts of someone that wants to crunch 24/7 and my systems are hungry. I donated a little money - but - if there is any other way I can help - I'm here. I have a background in communications (20 years in the Air Force) and can do anything from ordering circuits to installing them, troubleshooting them and their systems and engineering comm (fiber, DWDM, whatever) Here to help if needed (doubtful - but hey...ya just never know). It isn't doubting what you do at all - but - as I have learned through my career - sometimes - fresh eyes....

____________

Draconian
Volunteer tester
Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1310750 - Posted: 27 Nov 2012, 13:41:37 UTC - in response to Message 1310749.

And - with the above - I know there is a concern regarding bandwidth - however - slow and steady is just fine. Nobody needs to download at 500KB/sec - all we ever need to download at is enough to get the data when it is ready to be crunched. Data sitting idle on the system - doesn't make it crunch faster - all that should need to happen is that when the system is ready to crunch - the data is there. If the data arrived at the system at 500KB/sec or 30KB/sec - it's irrelevant - as long as it is there.
____________

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11668
Credit: 14,367,185
RAC: 13,301
United States
Message 1310827 - Posted: 27 Nov 2012, 21:11:11 UTC - in response to Message 1310749.

First and foremost - nothing I say here is personal or an attack on anyone that runs this project - you do a heck of a job and are under appreciated - you give a TON! My respect.
However....
With the frequent outages - please - open up the queues - I want to crunch 24/7 and - well...I'm out except for one Astro unit. 200 workunits doesn't cut it for this box - 6 core, 12 threads, and a 680. It's..getting hungry...
Question - when I look at other folks queues - why do I see...MANY more units - several hundred to thousands....? All I get - no matter what I set - is 200..

They probably got those before the limits were put in place. Many of those units may even be ghosts that aren't really on the machines in question.

Why do you send me workunits that expire 5 minutes AFTER you send them - hello?

I thought everyone who is a regular in the forums knew this by now... You get short timeouts like that when your computer asks for both CPU and GPU work and the scheduler assigns it but the message doesn't get back to your computer. Five minutes later, it asks again, but this time only for CPU. The scheduler realizes it can't send the previously assigned GPU tasks on a CPU-only request, so it times them out immediately.

A lot of the recent failures have been with the scheduler - so - why not open up the queue - give me 5 days of data - and set a MANDATORY "do not contact" setting until my systems are half full? I would still have 2 and a half days data at that point - enough to get through most failures - and it would lower stress on the scheduler. It doesn't make ANY sense that my system contacts the server and asks for more work when my queue is nearly topped off - reporting 4 completed and asking for more (when...I still have the 196 to go). If there is a way to set the system to NOT allow the user to request communication until their queue is half full - it would be great.

There should be a way to set your cache configuration to make it work this way, but I'm not sure how you would do it or if it would matter with the limits on.

Again though - my respect. These are just the thoughts of someone that wants to crunch 24/7 and my systems are hungry. I donated a little money - but - if there is any other way I can help - I'm here. I have a background in communications (20 years in the Air Force) and can do anything from ordering circuits to installing them, troubleshooting them and their systems and engineering comm (fiber, DWDM, whatever) Here to help if needed (doubtful - but hey...ya just never know). It isn't doubting what you do at all - but - as I have learned through my career - sometimes - fresh eyes....

Fresh eyes would probably help. Even getting Matt's eyes back from Europe would probably be a shot in the arm right now.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile BilBg
Volunteer tester
Avatar
Send message
Joined: 27 May 07
Posts: 2713
Credit: 6,148,344
RAC: 5,726
Bulgaria
Message 1310910 - Posted: 28 Nov 2012, 4:14:30 UTC - in response to Message 1310827.


Why do you send me workunits that expire 5 minutes AFTER you send them - hello?

I thought everyone who is a regular in the forums knew this by now... You get short timeouts like that when your computer asks for both CPU and GPU work and the scheduler assigns it but the message doesn't get back to your computer. Five minutes later, it asks again, but this time only for CPU. The scheduler realizes it can't send the previously assigned GPU tasks on a CPU-only request, so it times them out immediately.

:) You are close to the explanation but not exactly
Yes - it happens only if those tasks were ghosts and only if they were VLARs (first request was for CPU and some VLARs was assigned to CPU but become ghosts)
Then second request was for GPU (and VLARs are not sent to GPUs) "so it times them out immediately"


____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

Message boards : Number crunching : Okay..so, time for me to bit*h and moan

Copyright © 2014 University of California