Panic Mode On (58) Server problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (58) Server problems?

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next
Author Message
Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2356
Credit: 8,947,854
RAC: 3,984
United States
Message 1160251 - Posted: 8 Oct 2011, 17:06:10 UTC

All seems well (aside from the much nicer limits for you power crunchers). The avian friend is happy, and my cache is full..with the correct ETAs I might add.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4347
Credit: 1,125,998
RAC: 841
United States
Message 1160253 - Posted: 8 Oct 2011, 17:09:38 UTC - in response to Message 1160186.


As you know, I reverted my host 3751792 to stock when you reported the 'no work to stock GPUs' bug earlier this week.

I'm seeing correct runtime estimates with a DCF currently around 1.2 (it'll drop back again when I reach the next batch of shorties). Since I'm running GPU only, and the card is well fast enough to show APR/estimate anomalies, I'm assuming that the 'non-anonymous-platform' bits of [24217] were never applied here, even though we know the change in the ratio limit from 2 to 10 is active. I think the verdict on the APR outlier code is the old Scottish standby of 'not proven', but we ought to test it sometime, for both the stock and anonymous platform cases.
...

For any host with more than 20 validations for an app_version, it's nearly inconceivable that APR/estimates anomalies will make much difference.

For non-anonymous hosts the APR is sent to the host as the <flops> for that app version, so the ratio limit of 10 would have to be exceeded between two requests to the Scheduler for any capping to take place. 1.01^231 = 9.96 so even 231 near zero validated runtimes wouldn't get into the capping. OTOH assuming the APR was about right before that, that shift of ~10 would put the host on the border of the range where -177 elapsed time exceeded errors could happen.

For anonymous hosts where the users are allowing totally inadequate <flops> based on the Whetstone benchmark to be sent by the core client, it is effectively guaranteed that estimates for most GPU processing will be a mess. Those who are getting by because DCF down near 0.02 is enough to compensate for the bad <flops> are merely in danger of DCF going even lower and restricting work fetch, whatever happens to APR matters not since the user has chosen to operate in the zone where the server estimate is always capped.

For all host app versions, the first 10 validations are critical. If the APR isn't somewhat reasonable at that point it will take a lot of good runtimes to shift it to a better approximation. That's where the runtime_outlier logic will be helpful, and I too hope the Astropulse validator code will be updated soon. The project gets about 600 new hosts a day (either really new or new hostID), it's not nice to leave them exposed to a known weakness in the system.
Joe

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8814
Credit: 53,521,691
RAC: 46,080
United Kingdom
Message 1160263 - Posted: 8 Oct 2011, 17:49:35 UTC - in response to Message 1160253.

For all host app versions, the first 10 validations are critical. If the APR isn't somewhat reasonable at that point it will take a lot of good runtimes to shift it to a better approximation. That's where the runtime_outlier logic will be helpful, and I too hope the Astropulse validator code will be updated soon. The project gets about 600 new hosts a day (either really new or new hostID), it's not nice to leave them exposed to a known weakness in the system.
Joe

And when v7 goes live on SETI, the project will get about a quarter of a million new hosts in the first month - at least, as far as the application_details are concerned. We really ought to prevail on David to consider that number before the event.....

Anonymous_platform hosts return their GPU hardware characteristics in sched_request. I really don't see why that can't be used to seed APR with a first approximation, instead of using a totally irrelevant CPU metric.

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,647,395
RAC: 516
United States
Message 1160264 - Posted: 8 Oct 2011, 17:59:24 UTC - in response to Message 1160263.

Are there any plans of making version 6.12 usable by high end hosts before trying to roll to V7?
____________

Janice

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8814
Credit: 53,521,691
RAC: 46,080
United Kingdom
Message 1160267 - Posted: 8 Oct 2011, 18:05:18 UTC - in response to Message 1160264.

Are there any plans of making version 6.12 usable by high end hosts before trying to roll to V7?

Versions of what?

The discussion with Joe was about the SETI science application - currently at v6.03 for CPUs, v6.08/09/10 for CUDA GPUs.

Version 6.12 sounds like a BOINC version number - I'm not having any problems with BOINC v6.12.34, though I don't run what you would call a 'high end host'.

What issues make it unusable? I haven't seen any reported on the boinc_alpha mailing list: that would be a better venue for discussing boinc issues than here, though I can pass on messages if needed.

Profile KWSN Ekky Ekky Ekky
Avatar
Send message
Joined: 25 May 99
Posts: 928
Credit: 12,571,227
RAC: 10,962
United Kingdom
Message 1160271 - Posted: 8 Oct 2011, 18:15:34 UTC - in response to Message 1160267.

Still nothing uploading from here. I do not believe Bruno is allowing them. I see his vote monitor function is disabled.
____________

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3752
Credit: 21,468,810
RAC: 14,716
Sweden
Message 1160275 - Posted: 8 Oct 2011, 18:21:31 UTC - in response to Message 1160271.

Still nothing uploading from here. I do not believe Bruno is allowing them. I see his vote monitor function is disabled.


Uploads works fine for almost everyone else. You're having other problems, perhaps the HE issue.
____________

Profile KWSN Ekky Ekky Ekky
Avatar
Send message
Joined: 25 May 99
Posts: 928
Credit: 12,571,227
RAC: 10,962
United Kingdom
Message 1160277 - Posted: 8 Oct 2011, 18:25:49 UTC

Why would it start now? Never happened before.

____________

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3752
Credit: 21,468,810
RAC: 14,716
Sweden
Message 1160283 - Posted: 8 Oct 2011, 18:31:15 UTC - in response to Message 1160277.
Last modified: 8 Oct 2011, 18:33:28 UTC

Why would it start now? Never happened before.


It never happened before to others either, until it suddenly did. My car haven't broken down either before, but it will sooner or later.

The HE connection problem hits randomly it seems. Some are hit by it, others never.
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8814
Credit: 53,521,691
RAC: 46,080
United Kingdom
Message 1160285 - Posted: 8 Oct 2011, 18:33:59 UTC - in response to Message 1160277.

Why would it start now? Never happened before.

We can't answer the 'why' until we've worked out what 'it' is.

For starters, have you tried the basic network tests (ping and tracert) to the upload server? Check

ping 208.68.240.16
tracert setiboincdata.ssl.berkeley.edu

Profile KWSN Ekky Ekky Ekky
Avatar
Send message
Joined: 25 May 99
Posts: 928
Credit: 12,571,227
RAC: 10,962
United Kingdom
Message 1160286 - Posted: 8 Oct 2011, 18:34:16 UTC - in response to Message 1160283.

Why would it start now? Never happened before.


It never happened before to others either, until it suddenly did. My car haven't broken down either before, but it will sooner or later.

Not very helpful! If you are right, then I have no idea how to do anything about it. Other threads are gobbledegook to me!
____________

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4347
Credit: 1,125,998
RAC: 841
United States
Message 1160293 - Posted: 8 Oct 2011, 18:57:39 UTC - in response to Message 1160264.

Are there any plans of making version 6.12 usable by high end hosts before trying to roll to V7?

I presume you're referring to the increased backoffs in BOINC 6.12.x, and as that's a fundamental design of the series I don't expect the BOINC devs to modify it. They're in bugfixing only mode for that branch, and of course assuming that because 6.12 is the recommended version it's reasonable to consider its effects as if all users have adopted the recommendation.

The issue isn't really the backoffs so much as work delivery here, and I hope that some progress in being able to deliver what is assigned can be made before S@h v7 is rolled out. I don't know what's possible within the University of California hierarchy though.
Joe

Profile KWSN Ekky Ekky Ekky
Avatar
Send message
Joined: 25 May 99
Posts: 928
Credit: 12,571,227
RAC: 10,962
United Kingdom
Message 1160294 - Posted: 8 Oct 2011, 18:59:29 UTC - in response to Message 1160285.

Why would it start now? Never happened before.

We can't answer the 'why' until we've worked out what 'it' is.

For starters, have you tried the basic network tests (ping and tracert) to the upload server? Check

ping 208.68.240.16
tracert setiboincdata.ssl.berkeley.edu

You are dealing with an idiot here.
How do I do that and what do any results mean?
____________

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4347
Credit: 1,125,998
RAC: 841
United States
Message 1160296 - Posted: 8 Oct 2011, 19:03:13 UTC - in response to Message 1160263.

... when v7 goes live on SETI, the project will get about a quarter of a million new hosts in the first month - at least, as far as the application_details are concerned. We really ought to prevail on David to consider that number before the event.....

And v7 should have CPU, OpenCL ATI, CUDA NVIDIA, and maybe OpenCL NVIDIA application versions, so multiply that quarter of a million by maybe 2 to get the effective "active applications" count...
Joe

Profile KWSN Ekky Ekky Ekky
Avatar
Send message
Joined: 25 May 99
Posts: 928
Credit: 12,571,227
RAC: 10,962
United Kingdom
Message 1160300 - Posted: 8 Oct 2011, 19:12:28 UTC - in response to Message 1160294.

I found out how to do it!
What next?

Check

ping 208.68.240.16
tracert setiboincdata.ssl.berkeley.edu

You are dealing with an idiot here.
How do I do that and what do any results mean?[/quote]
Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation. All rights reserved.

C:\Windows\system32>ping 208.68.240.16

Pinging 208.68.240.16 with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.

Ping statistics for 208.68.240.16:
Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),

C:\Windows\system32>tracert setiboincdata.ssl.berkeley.edu

Tracing route to setiboincdata.ssl.berkeley.edu [208.68.240.16]
over a maximum of 30 hops:

1 57 ms 99 ms 99 ms 192.168.254.254
2 46 ms 48 ms 49 ms anchor-hg-3-lo100.router.demon.net [194.159.161.
34]
3 47 ms 47 ms 47 ms anchor-access-4-s2010.router.demon.net [194.217.
23.37]
4 48 ms 46 ms 47 ms gi7-0-0-dar3.lah.uk.cw.net [194.159.161.90]
5 47 ms 48 ms 46 ms xe-0-1-0-xur1.lns.uk.cw.net [193.195.25.70]
6 52 ms 48 ms 48 ms lonap.he.net [193.203.5.128]
7 134 ms 130 ms 130 ms 10gigabitethernet6-3.core1.ash1.he.net [72.52.92
.137]
8 207 ms 210 ms 201 ms 10gigabitethernet7-4.core1.pao1.he.net [184.105.
213.177]
9 * * * Request timed out.
10 * * * Request timed out.
11 * * * Request timed out.
12 * * * Request timed out.
13 * * * Request timed out.
14 * * * Request timed out.
15 * * * Request timed out.
16 * * * Request timed out.
17 * * * Request timed out.
18 * * * Request timed out.
19 * * * Request timed out.
20 * * * Request timed out.
21 * * * Request timed out.
22 * * * Request timed out.
23 * * * Request timed out.
24 * * * Request timed out.
25 * * * Request timed out.
26 * * * Request timed out.
27 * * * Request timed out.
28 * * * Request timed out.
29 * * * Request timed out.
30 * * * Request timed out.

Trace complete.

C:\Windows\system32>
____________

Kevin Olley
Send message
Joined: 3 Aug 99
Posts: 368
Credit: 35,506,046
RAC: 11,923
United Kingdom
Message 1160301 - Posted: 8 Oct 2011, 19:15:28 UTC - in response to Message 1160267.

Are there any plans of making version 6.12 usable by high end hosts before trying to roll to V7?

Versions of what?

The discussion with Joe was about the SETI science application - currently at v6.03 for CPUs, v6.08/09/10 for CUDA GPUs.

Version 6.12 sounds like a BOINC version number - I'm not having any problems with BOINC v6.12.34, though I don't run what you would call a 'high end host'.

What issues make it unusable? I haven't seen any reported on the boinc_alpha mailing list: that would be a better venue for discussing boinc issues than here, though I can pass on messages if needed.


ATM only 5 of the 40 "Top Hosts" are running v6.12 the rest are running v6.10.

I think the main problem is the increased back off times, Its no good looking at a backed off download queue when when your CPU's - GPU's are sitting back scratching their nether regions.

What could show interesting figures is if the "in progress" figures on the "tasks" page for each machine showed how many in progress tasks were still awaiting download.



____________
Kevin


Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8814
Credit: 53,521,691
RAC: 46,080
United Kingdom
Message 1160303 - Posted: 8 Oct 2011, 19:19:44 UTC - in response to Message 1160294.

Why would it start now? Never happened before.

We can't answer the 'why' until we've worked out what 'it' is.

For starters, have you tried the basic network tests (ping and tracert) to the upload server? Check

ping 208.68.240.16
tracert setiboincdata.ssl.berkeley.edu

You are dealing with an idiot here.
How do I do that and what do any results mean?

I refuse to accept that I'm dealing with an idiot. I may very well be dealing with someone who has expertise in some subject area different from computing, but that's not the same thing at all.

OK, one at a time.

First open a "Command Prompt" window - similar to what we used to use as a 'DOS prompt'. There are many ways of doing that, so - since I don't know whether you'll be using your Vista machine or one of your XP machines for this - here's a way which should work with the default settings on any of them.

Click the 'Start' button, click on 'All programs'. From the list, click on 'Accessories' (yellow folder icon), and you should see 'Command Prompt' near the top of the (alphabetical) list. Click it.

In the command prompt window which opens, type that first line I gave you, exactly as it stands:

ping 208.68.240.16

and press the return key at the end. Then wait.

After a few seconds, you should see four lines of results.

Either: lines of numbers, starting with 'Reply from'. That's good.
Or: "Request timed out". That's bad.

Which do you get?

Profile KWSN Ekky Ekky Ekky
Avatar
Send message
Joined: 25 May 99
Posts: 928
Credit: 12,571,227
RAC: 10,962
United Kingdom
Message 1160306 - Posted: 8 Oct 2011, 19:21:48 UTC - in response to Message 1160303.

Either: lines of numbers, starting with 'Reply from'. That's good.
Or: "Request timed out". That's bad.

Which do you get?

All bad then!
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8814
Credit: 53,521,691
RAC: 46,080
United Kingdom
Message 1160311 - Posted: 8 Oct 2011, 19:39:45 UTC - in response to Message 1160300.

I found out how to do it!
What next?

There! I said I wasn't dealing with an idiot - you beat me to it :-)

Both of those are classic symptoms of the Hurricane Electric connection problem - especially, since you get the line referencing "10gigabitethernet7-4.core1.pao1.he.net", and nothing but asterisks below that.

You could wait until Jeff's new memory has arrived, and until they've figured out a way to break into the security cage - or, since we're on a roll, you could try using a proxy.

Look in the 'Temporary Fix...' thread, and see what proxies have been mentioned as working recently. The newest one (at the time of writing) seems to be

216.24.193.211:8080

Open BOINC Manager, in Advanced View. Assuming it's one of your BOINC v6.12.34 machines, go to the Tools menu, and click on 'Display and network options'.

Click on the third tab, 'HTTP Proxy'.

Check 'Connect via HTTP proxy server'
Put 216.24.193.211 in the address box.
Put 8080 in the Port box.

(that's the two halves of the proxy line above, splitting it at the ':'. If you try a different proxy, do the same thing - splitting it into 'address' and 'port' - with any other proxy description)

Leave the rest blank, and click 'OK'. Now retry your uploads. Judging by what people have said in the threads, you may need to experiment with different proxies until you find one which works for you. It may also be slow, but if it works at all, that's better than nothing.

Profile James Sotherden
Avatar
Send message
Joined: 16 May 99
Posts: 9124
Credit: 37,610,342
RAC: 35,117
United States
Message 1160312 - Posted: 8 Oct 2011, 19:40:22 UTC

I was down to my last work unit wich was an AP couldnt get any work all day. Now this might be a coincedence but I pinged per John instructions and now I have a ton of work downloading, But I also finished that AP at the same time.

What ever happend Im happy.
____________

Old James

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (58) Server problems?

Copyright © 2014 University of California