The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 94 · Next

AuthorMessage
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 2024567 - Posted: 23 Dec 2019, 4:10:09 UTC - in response to Message 2024563.  

The replica is still 11 hours behind so it's likely that you have no ghosts at all. ;-).

Ah, right.

Didn't even notice that part.

Just wanted to make sure something wasn't broken on my end. I see 700k for RTS.. so I figured I'd get at least one or two here and there.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 2024567 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 27600
Credit: 261,360,520
RAC: 489
Australia
Message 2024568 - Posted: 23 Dec 2019, 4:13:32 UTC - in response to Message 2024567.  

The replica is still 11 hours behind so it's likely that you have no ghosts at all. ;-).
Ah, right.

Didn't even notice that part.

Just wanted to make sure something wasn't broken on my end. I see 700k for RTS.. so I figured I'd get at least one or two here and there.
Also the return rate is also down to less than 1/10th of normal due to Anonymous platforms being ignored. ;-)

Cheers.
ID: 2024568 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024569 - Posted: 23 Dec 2019, 4:14:46 UTC - in response to Message 2024539.  

Yes it does only show a single card now so it either didn't reinitialise after a reboot and/or Iona removed it, but with the replica still being a good 11hrs behind only Iona can tell us sooner what her solution was and if it's working (and I didn't waste my time going through a large random selection of her errored and validated tasks Stderr outputs for nothing). ;-)
Cheers.


. . Fingers still crossed ....

Stephen

:)
ID: 2024569 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 2024570 - Posted: 23 Dec 2019, 4:16:55 UTC - in response to Message 2024568.  

The replica is still 11 hours behind so it's likely that you have no ghosts at all. ;-).
Ah, right.

Didn't even notice that part.

Just wanted to make sure something wasn't broken on my end. I see 700k for RTS.. so I figured I'd get at least one or two here and there.
Also the return rate is also down to less than 1/10th of normal due to Anonymous platforms being ignored. ;-)

Cheers.

Okay, so something IS broken.

I haven't been here in a while and tried skimming through the most recent posts but it didn't look applicable.

I did see a comment about "maybe they'll fix it for the new year" though, so maybe that was related.

Hopefully things get fixed.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 2024570 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4263
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024572 - Posted: 23 Dec 2019, 4:48:20 UTC - in response to Message 2024557.  

I did it. It has no effect.


it looks like you're right. whether it's in the app_config or cmdline text file, it looks like -nobs isnt being implemented. GPU utilization is down.

oh well. the price to pay until anonymous platform is fixed.

this method is certainly better than the stock apps, and a lot better than nothing.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024572 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11154
Credit: 29,581,041
RAC: 66
United States
Message 2024574 - Posted: 23 Dec 2019, 4:58:09 UTC
Last modified: 23 Dec 2019, 4:58:52 UTC

IMOH, all that can be done from this end has been done. Juan has made a personal sacrifice at the pub, hair has been set on fire, the moon has been howled at and even breath has been held until faces have turned purple. https://boinc.berkeley.edu/dev/forum_thread.php?id=8105&postid=94471
My 2 Seti hosts are now crunching Einstein joining my sole Einstein host so good science is being done here.
Much to my great pleasure life remains quite OK.
ID: 2024574 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9949
Credit: 103,452,613
RAC: 328
United Kingdom
Message 2024586 - Posted: 23 Dec 2019, 5:58:12 UTC

I just shutdown my one current Linux host, and "removed" the app_info.xml from the two Windows machines.

The Windows machines immediately started downloading and processing tasks. As I am running the latest Nvidia drivers anyway I may as well leave these two running stock.

Obviously getting a massive mix of all the tasks "types", but as yest unable to see any results till the replica catches up.
ID: 2024586 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 27600
Credit: 261,360,520
RAC: 489
Australia
Message 2024587 - Posted: 23 Dec 2019, 6:21:27 UTC
Last modified: 23 Dec 2019, 6:21:54 UTC

I've got about 6hrs worth of CPU tasks left to do on my old 2500K system before I shut that down in the morning and take it downstairs to strip down and blowout the dust and soot (I can't imagine what I'll find after the last 3 months since its last blowout, but I'm imagining the worse) and then I'll switch over to it before doing the same to my almost as old 3570K main system.

Cheers.
ID: 2024587 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13373
Credit: 208,696,464
RAC: 304
Australia
Message 2024593 - Posted: 23 Dec 2019, 7:23:37 UTC - in response to Message 2024438.  
Last modified: 23 Dec 2019, 7:29:06 UTC

The response time to a request is very slow. It used to be so fast that I couldn't read to keep up with the log, now it pauses for so long, that I wonder if it is still doing something. 20-30 seconds sounds about right.
I think the slow response time is purely because this glitch has also turned 'resend lost tasks' back on, when we have a huge number of tasks in the database.
Which is why I suggested we get "Resend lost tasks" turned off, and then see of Anonymous hosts can get work again.

There appear to be 2 issues at play, one is a bug in the code that stops Anonymous hosts from getting work when there are long delays with the Scheduler response, and as you determined under certain conditions even with quick responses on another project.
The other being the fact that the Scheduler is taking an extremely long time to respond, hence my suggestion to disable Resend lost tasks again & see if that helps with the Scheduler response times (and sorting out the Validation, Assimilation, Deletion, Purge issues may also help).
Grant
Darwin NT
ID: 2024593 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 64945
Credit: 55,293,173
RAC: 49
United States
Message 2024595 - Posted: 23 Dec 2019, 7:46:38 UTC

10hrs here, it's really hexing. Gpus are on empty.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 2024595 · Report as offensive
halfempty
Avatar

Send message
Joined: 2 Jun 99
Posts: 97
Credit: 35,236,901
RAC: 114
United States
Message 2024598 - Posted: 23 Dec 2019, 7:57:14 UTC

Running stock apps and the systems are crunching again. I don't remember Cuda50 being so painful, but at least they're downloading.
ID: 2024598 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14511
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024599 - Posted: 23 Dec 2019, 8:17:08 UTC - in response to Message 2024516.  
Last modified: 23 Dec 2019, 8:31:40 UTC

Some initial thoughts on Retvari's workround:

1) Sounds good. I'm going to test it myself later. So far, this is just theoretical.
2) It is likely only to work if you have a single, monolithic, executable. Apps which rely on external libraries - FFTW, CudaFFT - (and many SETI apps do) won't have the right links made in the slot (working) directory. They may work if you can put the libraries in the directory search path.
3) There are only two ways of specifying which GPU to use - command line and init_data.xml. It depends on the API version declared in the app_version. Command line is ancient history and should have been phased out years ago. If the app is checking init_data.xml, multiple GPUs should work as normal.
4) It should be possible to pass command line parameters like -nobs via app_config.xml
5) Disregard Retvari's references to BOINC Manager - it's the client which has to be stopped and restarted. This may require action at the service control level, if BOINC has been installed that way.

I'll let you know how I get on, for both Linux and Windows.
ID: 2024599 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14511
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024606 - Posted: 23 Dec 2019, 11:01:12 UTC

OK, I've applied Retvari's workround on my Linux Mint host running special sauce and the spoofed client. There's good news and bad news.

Good news (1): It's running
Good news (2): It's picking up the command line. -nobs isn't specifically acknowledged, but I've got an -unroll in there too, and that's reported.

Bad news: It's not received the instruction to run on device 0 / device 1

The device number is being passed correctly in init_data.xml, but the special sauce app isn't looking in the right place. I see Petri's app_info.xml doesn't contain an API version specifier, and it worked before, so I assume it's only listening for a command line. That's ancient, and should be corrected. Petri needs to compile against a newer BOINC API library and make the appropriate coding adjustments.

But at least my machine is running 2-up on device 0 - so at about half normal speed - and some warmth is creeping back into my workroom.
ID: 2024606 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 2024609 - Posted: 23 Dec 2019, 11:35:12 UTC

Attn: Richard

Einstein is showing "Server version 611" on my Linux host.
Kevin


ID: 2024609 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2024610 - Posted: 23 Dec 2019, 12:06:33 UTC

My Windows box running the standard Seti@Home apps continues to process.

I believe that is "normal".

After getting advice on another thread I have re-started Einstein@Home to keep my GPUs busy and am running World Community Grid to keep the rest of CPU threads busy.

Now that I think I understand a simple re-naming of the app_info.xml file in the Boinc project directory will allow me to process some Seti@Home tasks I may experiment with that on my Linux box(es).

I am assuming it is unlikely that anything will get "fixed" this week at the server level.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2024610 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2024611 - Posted: 23 Dec 2019, 12:09:25 UTC - in response to Message 2024606.  

Bad news: It's not received the instruction to run on device 0 / device 1
Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU.
ID: 2024611 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14511
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024616 - Posted: 23 Dec 2019, 12:28:02 UTC - in response to Message 2024609.  

Attn: Richard

Einstein is showing "Server version 611" on my Linux host.
Yup. Einstein stopped using the central BOINC server code about nine years ago. They've gone their own way, and made their own updates, without changing the version number setting. They've also never adopted per-app-version runtime estimates, which means that clients try to normalise everything using a single DCF - that can never work. Estimates are all over the place (and jump up and down) if you run more than one application.
ID: 2024616 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14511
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024618 - Posted: 23 Dec 2019, 12:32:31 UTC - in response to Message 2024611.  
Last modified: 23 Dec 2019, 13:09:00 UTC

Bad news: It's not received the instruction to run on device 0 / device 1
Please try to force your system to ask for cuda tasks only. I think you can achieve that by uninstalling opencl. Judging by the stderr output, if the special app runs instead the original CUDA60 app, it will run on the designated GPU.
No. It took me bloody ages to get that driver installed (about the first thing I ever tried to do in Linux), and I'm not changing it for a temporary glitch. Running at half-throttle is fine, and kinder on the servers. I have one other host running stock so I can keep an eye on things, and all the others are waiting for the recovery. Which will be fun in itself...

Edit - that was a bit harsh. I'm feeling better now I've had a bite of lunch. Don't let me stop anybody else testing this aspect of Retvari's suggestion, if they're prepared to sacrifice their OpenCL driver.

However, it depends whether the Linux Cuda app was compiled against the modern API or not. If it's even modestly modern, it'll tell BOINC to use the modern calls, and we're back at square 1. I'll check it out if the server ever chooses to send me Cuda work, but so far it's alternating between sah and SoG.
ID: 2024618 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2024619 - Posted: 23 Dec 2019, 13:05:46 UTC

Even in "stock" the approximate ratio of how fast a GPU task is processed and how fast a CPU task is still the famous "Gpu's will run 3X or more" times faster than CPU's.

It's looking like I am getting 1.5 to 2.5 hours on the cpu tasks (down from around 1 hour to 1.5 hours).
And upwards to 8 minutes+ on the gpu tasks (down from 1.5 minutes to 3.5 minutes, mostly 1.5 minutes).

So I am crunching Seti@Home with "all my might" (and one hand tied behind my back).
:)
Tom
A proud member of the OFA (Old Farts Association).
ID: 2024619 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024623 - Posted: 23 Dec 2019, 13:33:25 UTC - in response to Message 2024598.  

Running stock apps and the systems are crunching again. I don't remember Cuda50 being so painful, but at least they're downloading.


. . We quickly get spoiled by the faster apps that have superseded cuda50.

Stephen

:)
ID: 2024623 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.