Panic Mode On (97) Server Problems?

Message boards : Number crunching : Panic Mode On (97) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 29 · 30 · 31 · 32 · 33 · Next

AuthorMessage
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1688843 - Posted: 7 Jun 2015, 7:50:08 UTC - in response to Message 1688840.  

Meow! The kitties have 157 to play with...............


You lucky sod!... :)

I would have to agree. Could I suggest that the work unit numbers to get shifted to own thread. I feel this is no bearing on how the servers perform. Don't get me wrong I am pleased that people are getting these work units

The servers have been performing very well over the past week and a half not more.

??
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1688843 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1688858 - Posted: 7 Jun 2015, 8:23:10 UTC
Last modified: 7 Jun 2015, 8:23:33 UTC

Oh i got 1 AP.

Nice.


With each crime and every kindness we birth our future.
ID: 1688858 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 1688865 - Posted: 7 Jun 2015, 8:40:45 UTC - in response to Message 1688843.  

Meow! The kitties have 157 to play with...............


You lucky sod!... :)

I would have to agree. Could I suggest that the work unit numbers to get shifted to own thread. I feel this is no bearing on how the servers perform. Don't get me wrong I am pleased that people are getting these work units

The servers have been performing very well over the past week and a half not more.

??

Thinking about it may be a week and a half was a bit generous. Servers have definitely been keeping the ready to send the buffer full for the last 3 or so days. I forgot about when we almost ran out of work
ID: 1688865 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1688867 - Posted: 7 Jun 2015, 9:01:19 UTC - in response to Message 1688865.  

Meow! The kitties have 157 to play with...............


You lucky sod!... :)

I would have to agree. Could I suggest that the work unit numbers to get shifted to own thread. I feel this is no bearing on how the servers perform. Don't get me wrong I am pleased that people are getting these work units

The servers have been performing very well over the past week and a half not more.

??

Thinking about it may be a week and a half was a bit generous. Servers have definitely been keeping the ready to send the buffer full for the last 3 or so days. I forgot about when we almost ran out of work

The GPUs got a bit shy, but with 9 rigs going, the CPU buffer is always doing pretty well. I can run the GPUs dry and still have days worth of work on the CPUs.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1688867 · Report as offensive
Profile Cactus Bob
Avatar

Send message
Joined: 19 May 99
Posts: 209
Credit: 10,924,287
RAC: 29
Canada
Message 1688869 - Posted: 7 Jun 2015, 9:10:21 UTC

Well I got 14 of the AP puppies. The most I have had ever. Mind you I just returned a couple months ago after being AWAL for several years.

This is on 1 machine so 157 on 9 machines seems close. I haven't tweaked anything to get more AP's but maybe It would be worth a shot.

May the AP force be with you

Bob
------------------------
Sig files are overrated, maybe
Sometimes I wonder, what happened to all the people I gave directions to?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SETI@home classic workunits 4,321
SETI@home classic CPU time 22,169 hours
ID: 1688869 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1688872 - Posted: 7 Jun 2015, 9:33:26 UTC

My APs are buried in cache.

But, I'll turn them in sooner than most.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1688872 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1688901 - Posted: 7 Jun 2015, 12:59:06 UTC

I managed to snag 80 APs before they ran out. I disagree about workload distribution not being on this thread. It should be the function of the scheduling server to distribute work efficiently. If that is not occurring for whatever reason it's germane to this thread.

Perhaps a re-think of the scheduling program to take into account the number of errors produced by a user and apportion the work units accordingly. i.e. 0 errors = full distribution, 25% error rate = 25% reduction in WUs delivered to those machines, etc.

Just an idea.

"Sour Grapes make a bitter Whine." <(0)>
ID: 1688901 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1688904 - Posted: 7 Jun 2015, 13:19:53 UTC
Last modified: 7 Jun 2015, 13:26:12 UTC

What I think ... If you get an error, you should get a test file (with known results), and until you return a valid result for the "test" you don't get more work, you just get "test" files.

Send in a valid test WU, then you get work.

Basically, Prove to me you can provide scientific results, I don't want crap!
ID: 1688904 · Report as offensive
wiesel111

Send message
Joined: 5 Jan 08
Posts: 9
Credit: 1,227,675
RAC: 1
Germany
Message 1688918 - Posted: 7 Jun 2015, 14:22:41 UTC - in response to Message 1688144.  
Last modified: 7 Jun 2015, 14:25:53 UTC

wow I got a AP from computer 7567951 I did some checking and found computers 7567951 , 7567912 ,7568646 ,7568596 and 7567941 Computer information is identical right down to the floating point and integer speeds hmmmm interesting


add 2 more 7568519 and 7568642


Add 8 more computers:

7568512, 7567924, 7568508, 7568640, 7568643, 7568513, 7568516, 7568501

So 15 identical computers so far...



Here are some more: 7568511, 7568499, 7567886, 7568508, 7567913

Now we have 20 identical...


21 + "7568637"

Here a small update of the identical computer. One was doubled (7568508), but I find some more: 7572890, 7567931, 7568597, 7568503, 7569519, 7572867


After the new ap's from today I found 8 more (7567911, 7567922, 7567938, 7568510, 7568518, 7568644,7569479, 7572925)

All 35 identical in chronological order of building:

7567886, 7567911, 7567912, 7567913, 7567922, 7567924, 7567931, 7567938, 7567941, 7567951, 7568499, 7568501, 7568503, 7568504, 7568508, 7568510, 7568511, 7568512, 7568513, 7568516, 7568518, 7568519, 7568596, 7568597, 7568637, 7568640, 7568642, 7568643, 7568644, 7568646, 7569479, 7569519, 7572867, 7572890, 7572925
ID: 1688918 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1688926 - Posted: 7 Jun 2015, 14:44:18 UTC - in response to Message 1688808.  

I don't know if this can be considered exploiting the system, but it's not really a new concept (I've known about it for several years, myself).

So with APs being split presently, I tried snagging as many as I could get. When asking for 2.6M seconds of work, I got 6 consecutive "no work available" replies. I changed the 10-day cache down to 1-day, and the next request resulted in getting a single AP. And so did the second, and third.

I figured now that I've got something, I could go back to 10-day and just let it continue filling, but that resulted in three consecutive "no work available" replies. Dropped back to 1-day, and got three more consecutive replies, each with one AP.

Once I got to a point where BOINC decided my 1-day cache was full, I changed it to two days, to keep the "requesting work for [x] seconds" value fairly low, and again, success.

I don't know what the rough cut-off value is, but I know that it is quite difficult to get work when asking for 2.6M seconds of work on an empty cache, unless there are 10+ tapes being split, but asking for smaller amounts of work seems to end up with a higher success-rate.

I'm not sure it has any effect. Looking over the logs while I was asleep. I managed 92 new AP while they were going out & I set a 10 day cache for my venues.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1688926 · Report as offensive
Seahawk
Volunteer tester
Avatar

Send message
Joined: 8 Jan 08
Posts: 937
Credit: 8,157,029
RAC: 5
United States
Message 1688927 - Posted: 7 Jun 2015, 14:46:55 UTC

I got 10 wu_0 ap buried in 190 MB. Its been so long since I've seen a wu_0 that I had to look several times to make sure.
I used to be a cruncher like you, then I took an arrow to the knee.
ID: 1688927 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1689005 - Posted: 7 Jun 2015, 19:31:08 UTC - in response to Message 1688926.  

I'm not sure it has any effect. Looking over the logs while I was asleep. I managed 92 new AP while they were going out & I set a 10 day cache for my venues.

But did you already have something in your cache (MBs)? I had more success getting something on an empty cache by requesting a small amount of work at a time, rather than 2.6M seconds. I remember when there are 10+ tapes available, it often took 10-20 "no work available" replies before I'd start to get some APs, but once they started coming in, the cache would fill moderately quickly.



I managed to snag 80 APs before they ran out. I disagree about workload distribution not being on this thread. It should be the function of the scheduling server to distribute work efficiently. If that is not occurring for whatever reason it's germane to this thread.

Perhaps a re-think of the scheduling program to take into account the number of errors produced by a user and apportion the work units accordingly. i.e. 0 errors = full distribution, 25% error rate = 25% reduction in WUs delivered to those machines, etc.

Just an idea.

Here's an idea: how about the quota system? It's already in-place, it just needs to have the minimum value be able to go down to 1, and more importantly, be enforced (I'm about 98% sure it is not enforced at all presently).

But that's a rant I've had a few times in another thread already. I'll stop now.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1689005 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1689007 - Posted: 7 Jun 2015, 19:36:15 UTC - in response to Message 1689005.  


Here's an idea: how about the quota system? It's already in-place, it just needs to have the minimum value be able to go down to 1, and more importantly, be enforced (I'm about 98% sure it is not enforced at all presently).


I'm pretty sure that some older incarnations of Boinc do not work with the limits in place on the servers at all.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1689007 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1689039 - Posted: 7 Jun 2015, 23:18:43 UTC - in response to Message 1689007.  


Here's an idea: how about the quota system? It's already in-place, it just needs to have the minimum value be able to go down to 1, and more importantly, be enforced (I'm about 98% sure it is not enforced at all presently).


I'm pretty sure that some older incarnations of Boinc do not work with the limits in place on the servers at all.

I think I've heard that. Old builds like 5.10.45 can get thousands of WUs presently, and I think it is due to the fact that those really old builds don't send a list of what they have back to the server during scheduler contacts. I think that started in the 6.x series. I know 6.2.19 sends its list during contact.

But other than that, I don't think the quota system is enforced. Shortly after APv7 was released, my "max tasks per day" was in the mid-40s, and I got close to 70 APs in the course of an hour or two. If the quota system was enforced.. I shouldn't have been able to get more than "max tasks per day." That's what I'm basing my theory off of, so.. I could be wrong.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1689039 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1689042 - Posted: 8 Jun 2015, 0:12:29 UTC - in response to Message 1689039.  

... I don't think the quota system is enforced. Shortly after APv7 was released, my "max tasks per day" was in the mid-40s, and I got close to 70 APs in the course of an hour or two. If the quota system was enforced.. I shouldn't have been able to get more than "max tasks per day." That's what I'm basing my theory off of, so.. I could be wrong.

If that was a single core system, perhaps you're right. Otherwise, the "max tasks per day" is multiplied by the number of cores for CPU tasks, and for GPU is further multiplied by the project's "gpu_multiplier" setting.

I fully agree that quota system is woefully inadequate for hosts which turn in a lot of results as successfully processed which are subsequently found to be invalid. I can remember at least two extended discussions on the boinc_dev mailing list where various arguments were presented concerning possible adjustments. Those were probably before GPU processing became a major factor. In any case, IMO it would take concern from multiple projects to convince Dr. Anderson a change should be made.
                                                                   Joe
ID: 1689042 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1689046 - Posted: 8 Jun 2015, 1:08:38 UTC - in response to Message 1689042.  

... I don't think the quota system is enforced. Shortly after APv7 was released, my "max tasks per day" was in the mid-40s, and I got close to 70 APs in the course of an hour or two. If the quota system was enforced.. I shouldn't have been able to get more than "max tasks per day." That's what I'm basing my theory off of, so.. I could be wrong.

If that was a single core system, perhaps you're right. Otherwise, the "max tasks per day" is multiplied by the number of cores for CPU tasks, and for GPU is further multiplied by the project's "gpu_multiplier" setting.

I fully agree that quota system is woefully inadequate for hosts which turn in a lot of results as successfully processed which are subsequently found to be invalid. I can remember at least two extended discussions on the boinc_dev mailing list where various arguments were presented concerning possible adjustments. Those were probably before GPU processing became a major factor. In any case, IMO it would take concern from multiple projects to convince Dr. Anderson a change should be made.
                                                                   Joe

Okay, yeah, that's kind of what I thought was going on in the other thread that I was discussing/ranting about this in. The quotas should be per application, not per device/cores.

So that makes it even scarier for those runaway machines that spew out thousands upon thousands of -9 overflow tasks, because the quota will not go below 33, and if they have multiple GPUs * gpu_multiplier = entirely useless to even have a quota system in the first place.

Thanks for clearing that one up.. I've been trying to wrap my head around the problem to try to suggest reasonable solutions, and I needed more information.

So those are my suggestions: make the quotas per application, not device, and let it go down to 1. That won't fix the problem on runaway machines, but it will do some serious damage control and essentially keep them from being a problem.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1689046 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1689453 - Posted: 9 Jun 2015, 11:35:07 UTC

As of 9 Jun 2015, 11:30:04 UTC

Replica seconds behind master Offline 0m

:(
Aloha, Uli

ID: 1689453 · Report as offensive
WezH
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 576
Credit: 67,033,957
RAC: 95
Finland
Message 1689498 - Posted: 9 Jun 2015, 15:04:12 UTC - in response to Message 1688918.  

After the new ap's from today I found 8 more (7567911, 7567922, 7567938, 7568510, 7568518, 7568644,7569479, 7572925)

All 35 identical in chronological order of building:

7567886, 7567911, 7567912, 7567913, 7567922, 7567924, 7567931, 7567938, 7567941, 7567951, 7568499, 7568501, 7568503, 7568504, 7568508, 7568510, 7568511, 7568512, 7568513, 7568516, 7568518, 7568519, 7568596, 7568597, 7568637, 7568640, 7568642, 7568643, 7568644, 7568646, 7569479, 7569519, 7572867, 7572890, 7572925


Last 5 are not identical to first 30.

Anyway, first 30 (feels like classroom) identical computers has currently 537 AP units in progress. Almost 8% of all workunits in field.

I did see 20 Valid WU's and 558 WU's 203 (0xcb) EXIT_ABORTED_VIA_GUI.

Is there a problem with GTX660 driver, ver 350.12 on every machine?

Or what?
ID: 1689498 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1689505 - Posted: 9 Jun 2015, 15:21:50 UTC

Is there a problem with GTX660 driver, ver 350.12 on every machine?



From what I have seen on the threads.....YES.

"Sour Grapes make a bitter Whine." <(0)>
ID: 1689505 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1689510 - Posted: 9 Jun 2015, 15:36:46 UTC - in response to Message 1689498.  
Last modified: 9 Jun 2015, 15:37:20 UTC



Is there a problem with GTX660 driver, ver 350.12 on every machine?

Or what?



Ver 350.12 is OpenCl 1.2

If you want to crunch APs with this version you need to modify your apps with the ones provided by Raistmer.

Otherwise you need to roll back to an earlier version that still has OpenCL 1.1

Think that is Ver 347.88.

Zalster
ID: 1689510 · Report as offensive
Previous · 1 . . . 29 · 30 · 31 · 32 · 33 · Next

Message boards : Number crunching : Panic Mode On (97) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.