The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 94 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2026799 - Posted: 8 Jan 2020, 12:07:58 UTC

Referring back to the server issue of 20 December (Anonymous Platform failure after upgrade), I have written up the story so far at #3419. Nils Høimyr of LHC has produced some useful diagnostics, but LHC now feel that they have to refer the problem back to David Anderson.
ID: 2026799 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2026806 - Posted: 8 Jan 2020, 13:18:37 UTC - in response to Message 2026795.  

I'm going to hang out in the wilds for a bit. I'll do some howling at the moon and dancing around the fire in hopes that these silly rituals help keep the seti machines working well :-).


. . Have fun and don't bring any stray coyotes home :)
Stephen


What about "friendly" stray Coyotes?


. . If she is running about in the woods howling you just never know ...

Stephen

? ?
ID: 2026806 · Report as offensive
jdzukley Project Donor

Send message
Joined: 6 Apr 11
Posts: 19
Credit: 26,357,809
RAC: 74
United States
Message 2026828 - Posted: 8 Jan 2020, 14:59:43 UTC

From my perspective, and while I do not want to discount that there are issues and problems out there... If you look at the current server status page lately and currently, we have maxed out the site. We are all crunching all the work the site can give to us. The ready to send que is very low, and generally is, and the tasks counts being created are high. This is all great stuff as far as I am concerned!
ID: 2026828 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 2026830 - Posted: 8 Jan 2020, 15:19:43 UTC - in response to Message 2026828.  

From my perspective, and while I do not want to discount that there are issues and problems out there... If you look at the current server status page lately and currently, we have maxed out the site. We are all crunching all the work the site can give to us. The ready to send que is very low, and generally is, and the tasks counts being created are high. This is all great stuff as far as I am concerned!


Munin graphs kinda confirm this:






ID: 2026830 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19065
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2026891 - Posted: 9 Jan 2020, 0:34:48 UTC - in response to Message 2026859.  

Beta is back, or at least there's life there. It's slowly coming to life, but don't expect that everything works yet,
It's been down since Dec 20, so it will take time to recover.

Reported 50 tasks, got two new tasks from Beta.
ID: 2026891 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2026911 - Posted: 9 Jan 2020, 2:49:47 UTC - in response to Message 2026859.  

Beta is back, or at least there's life there. It's slowly coming to life, but don't expect that everything works yet,
It's been down since Dec 20, so it will take time to recover.
Curious as to what the issue turned out to be.
Grant
Darwin NT
ID: 2026911 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2026913 - Posted: 9 Jan 2020, 2:52:13 UTC
Last modified: 9 Jan 2020, 2:54:14 UTC

Ready-to-send buffer still not refilling, mostly due to the sustained very high return rate over the last few hours (165k+), and the splitters just aren't capable of the sustained output required to meet that demand, and fill the RTS buffer.
Just as well the return rate isn't that little big higher, or we'd be struggling to get work (more than we are- just had a period of "Project has no tasks available" or reporting 10 WUs and only getting 2 back).
Grant
Darwin NT
ID: 2026913 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34770
Credit: 261,360,520
RAC: 489
Australia
Message 2026917 - Posted: 9 Jan 2020, 3:15:12 UTC

If a way can be found found to stop splitters from bunching up on 1 file (such as blc54_2bit_guppi_58692_62680_HIP24094_0035 ATM) then that would help a great deal.

Cheers.
ID: 2026917 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2026926 - Posted: 9 Jan 2020, 3:48:26 UTC - in response to Message 2026917.  

If a way can be found found to stop splitters from bunching up on 1 file (such as blc54_2bit_guppi_58692_62680_HIP24094_0035 ATM) then that would help a great deal.
It's been a long, long time since the 1 file, 1 splitter days.
Grant
Darwin NT
ID: 2026926 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2026957 - Posted: 9 Jan 2020, 10:51:24 UTC

I guess this would be a Server Issue, though it may need to be moved to another thread if it gets out of hand.
Bringing up a new (Ubuntu) client and, due to some hardware issues I was fighting, I aborted the initial 5 CPU tasks I was given so as to eliminate CPU load as a cause for the slow-downs I was seeing.
Set the box to a venue that allows GPU only. So far, so good.
Finally solved (I believe) the hardware issue, and decided it was time to get some CPU tasks going as well.
Set the box to a different venue that allows both CPU and GPU work, but got only GPU work.
If I recall correctly, CPU tasks won't be sent until the GPU cache is full or other limits kick in.
Set the box to a third venue that allows only CPU work. Had the box updated to the new venue, and it correctly asks for only CPU work now.
So far so, good. (And I do understand that it takes 2 updates for this to happen, 1 for the server to change the venue at the client, and the second for the client to do a request for the new venue.)
However, when I hit the server with a request for CPU tasks, it responds by sending new work which is still classed as GPU work. Not a one-time thing, it's consistent after restarting the client, resetting the preferences venue flags several times and changing venues again as well. In example:
(client ID: 8881207)

977	SETI@home	01/09/20 3:24:20 AM	Sending scheduler request: To report completed tasks.	
978	SETI@home	01/09/20 3:24:20 AM	Reporting 2 completed tasks	
979	SETI@home	01/09/20 3:24:20 AM	Requesting new tasks for CPU
980	SETI@home	01/09/20 3:24:21 AM	Scheduler request completed: got 3 new tasks	
981	SETI@home	01/09/20 3:24:24 AM	Started download of blc75_2bit_guppi_58693_06659_HIP100511_0136.26291.0.22.45.111.vlar	
982	SETI@home	01/09/20 3:24:24 AM	Started download of blc64_2bit_guppi_58693_08596_HIP98819_0142.23980.409.22.45.58.vlar	
983	SETI@home	01/09/20 3:24:24 AM	Started download of blc64_2bit_guppi_58693_07938_HIP100511_0140.25367.409.22.45.71.vlar	
984	SETI@home	01/09/20 3:24:30 AM	Finished download of blc64_2bit_guppi_58693_08596_HIP98819_0142.23980.409.22.45.58.vlar	
985	SETI@home	01/09/20 3:24:31 AM	Finished download of blc75_2bit_guppi_58693_06659_HIP100511_0136.26291.0.22.45.111.vlar	
986	SETI@home	01/09/20 3:24:31 AM	Finished download of blc64_2bit_guppi_58693_07938_HIP100511_0140.25367.409.22.45.71.vlar	

I look in the project directory, the files are present, but no CPU work is processing. Finding the task names via BOINCTasks shows them as CUDA90 tasks waiting to run.
I get that I might be in the penalty box for CPU work as 100% (5) of the CPU tasks I had I aborted, but I can't see how client configuration could cause the scheduler to send me GPU work telling me it's CPU work.
I verified the venue selections were correct by watching for logged scheduler requests as above, which correctly reflected my venue settings after updates reset them at the client.
Just can't think of anything I might do that would cause the Scheduling server to do this.
Am I missing something?
ID: 2026957 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2026958 - Posted: 9 Jan 2020, 10:57:13 UTC - in response to Message 2026957.  

Looking at that computer on the website - All tasks for computer 8881207 - you currently have at least three tasks on the front page assigned to the CPU. That looks good, but I'm not looking through all 504 tasks in progress...
ID: 2026958 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2026960 - Posted: 9 Jan 2020, 11:09:01 UTC - in response to Message 2026958.  

Looking at that computer on the website - All tasks for computer 8881207 - you currently have at least three tasks on the front page assigned to the CPU. That looks good, but I'm not looking through all 504 tasks in progress...

Yeah, naturally the moment that message posted I got 10 supposed CPU tasks that were GPU tasks, and the next session I got 4 that were for real.
It looks to me like perhaps regardless of the venue it was going to fill the cache of GPU before releasing any more CPU. Just seemed strange that it would send that way ignoring the venue. Strange interaction, but thanks for taking a peek.
ID: 2026960 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2026965 - Posted: 9 Jan 2020, 12:45:24 UTC - in response to Message 2026960.  

... I got 10 supposed CPU tasks that were GPU tasks ...
What's your evidence for that? You can't tell from the download file names - all tasks are identical until they are assigned to a particular class of device, by the server on allocation.

<sched_op_debug> in the Event Log gives a useful summary:

09/01/2020 12:09:04 | SETI@home | Sending scheduler request: To report completed tasks.
09/01/2020 12:09:04 | SETI@home | Reporting 14 completed tasks
09/01/2020 12:09:04 | SETI@home | Requesting new tasks for NVIDIA GPU
09/01/2020 12:09:04 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
09/01/2020 12:09:04 | SETI@home | [sched_op] NVIDIA GPU work request: 8165.73 seconds; 0.00 devices
09/01/2020 12:09:04 | SETI@home | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
09/01/2020 12:09:07 | SETI@home | Scheduler request completed: got 15 new tasks
09/01/2020 12:09:07 | SETI@home | [sched_op] estimated total CPU task duration: 0 seconds
09/01/2020 12:09:07 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 8702 seconds
09/01/2020 12:09:07 | SETI@home | [sched_op] estimated total Intel GPU task duration: 0 seconds
That's a proper 'GPU only' transaction.
ID: 2026965 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2026990 - Posted: 9 Jan 2020, 18:15:10 UTC - in response to Message 2026960.  
Last modified: 9 Jan 2020, 18:16:09 UTC

It looks to me like perhaps regardless of the venue it was going to fill the cache of GPU before releasing any more CPU. Just seemed strange that it would send that way ignoring the venue. Strange interaction, but thanks for taking a peek.
It will always fill the GPU cache before it starts on the CPU (unless you have a really fast CPU and a really, really slow GPU). You just need to make sure to set the cache settings small enough that it doesn't take long to fill the GPU, to get it started on the CPU. Once the settings are changed back, it will continue to fill the new GPU cache, then start filling the CPU cache again.
Grant
Darwin NT
ID: 2026990 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2026992 - Posted: 9 Jan 2020, 18:20:03 UTC

Oh, and the splitter output has fallen off to not much more than nothing and we should be out of work in the next 30min or so if they don't come back to life.
Grant
Darwin NT
ID: 2026992 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2027000 - Posted: 9 Jan 2020, 19:45:17 UTC - in response to Message 2026965.  

... I got 10 supposed CPU tasks that were GPU tasks ...
What's your evidence for that? You can't tell from the download file names - all tasks are identical until they are assigned to a particular class of device, by the server on allocation.


I understand that the filename indicates nothing. Not going from that. The evidence was:
1) requested CPU work, not GPU work, as indicated by log, because
2) venue flag excluded GPU work by not being selected, and
3) tasks were received, and their presence verified in project directory, yet
4) listing shows tasks as GPU jobs in a) boincmgr and b) boinctasks.

Agreed that turning the debug on would have been useful, but I have to assume that the fact that CPU tasks were the only ones requested is correct in the log. Not a big thing, but my only point in noting this was that the work type flag was (apparently) overridden by other considerations in the scheduler, which is strange.
ID: 2027000 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2027001 - Posted: 9 Jan 2020, 19:49:30 UTC - in response to Message 2026990.  

It looks to me like perhaps regardless of the venue it was going to fill the cache of GPU before releasing any more CPU. Just seemed strange that it would send that way ignoring the venue. Strange interaction, but thanks for taking a peek.
It will always fill the GPU cache before it starts on the CPU (unless you have a really fast CPU and a really, really slow GPU). You just need to make sure to set the cache settings small enough that it doesn't take long to fill the GPU, to get it started on the CPU. Once the settings are changed back, it will continue to fill the new GPU cache, then start filling the CPU cache again.
And I did note my awareness of that in the original message. But the point of the message was that it ignored my exclusion of GPU tasks via venue flag, and responded to a request for CPU work by sending tasks for GPU. FWIW, my cache settings are .33 days.
ID: 2027001 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2027020 - Posted: 9 Jan 2020, 22:30:46 UTC - in response to Message 2026965.  

... I got 10 supposed CPU tasks that were GPU tasks ...
What's your evidence for that? You can't tell from the download file names - all tasks are identical until they are assigned to a particular class of device, by the server on allocation.

(client ID: 8881207)

977	SETI@home	01/09/20 3:24:20 AM	Sending scheduler request: To report completed tasks.	
978	SETI@home	01/09/20 3:24:20 AM	Reporting 2 completed tasks	
979	SETI@home	01/09/20 3:24:20 AM	Requesting new tasks for CPU
980	SETI@home	01/09/20 3:24:21 AM	Scheduler request completed: got 3 new tasks	
981	SETI@home	01/09/20 3:24:24 AM	Started download of blc75_2bit_guppi_58693_06659_HIP100511_0136.26291.0.22.45.111.vlar
https://setiathome.berkeley.edu/workunit.php?wuid=3831002783
	
982	SETI@home	01/09/20 3:24:24 AM	Started download of blc64_2bit_guppi_58693_08596_HIP98819_0142.23980.409.22.45.58.vlar
https://setiathome.berkeley.edu/workunit.php?wuid=3831002683
	
983	SETI@home	01/09/20 3:24:24 AM	Started download of blc64_2bit_guppi_58693_07938_HIP100511_0140.25367.409.22.45.71.vlar
https://setiathome.berkeley.edu/workunit.php?wuid=3831002789

Seems to me this pretty well proves the point ... please note that the WU assignment clearly shows them assigned to GPU.
ID: 2027020 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2027023 - Posted: 9 Jan 2020, 22:40:54 UTC - in response to Message 2027020.  

Thanks - that's why I asked the question. Seems like there is something screwy with the scheduler, and we need hard evidence like this to track it down.

I spent some time exchanging emails with Eric this evening, using Beta to confirm the 'Anonymous Platform' error that plagued us before Christmas. He started Beta, I said 'nada'. He said it had fallen over, and restarted it again. I crashed it. He said 'damn', but we got a log out of it.
ID: 2027023 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2027026 - Posted: 9 Jan 2020, 23:09:29 UTC - in response to Message 2027023.  
Last modified: 9 Jan 2020, 23:10:18 UTC

Thanks - that's why I asked the question. Seems like there is something screwy with the scheduler, and we need hard evidence like this to track it down.

I spent some time exchanging emails with Eric this evening, using Beta to confirm the 'Anonymous Platform' error that plagued us before Christmas. He started Beta, I said 'nada'. He said it had fallen over, and restarted it again. I crashed it. He said 'damn', but we got a log out of it.

Yup, it's all good. Took me a bit to realize all I had to do was search for the tasks on the site and correlate them. I learned that if you ask the site to display tasks by name it auto-sorts the results. Makes it pretty quick to find them.
Hard to say if it's worth spending much effort to fix this, but, on the other hand it's often true that what causes this might be causing other things. At least it's another piece of info that might be useful in future. And each time I learn a bit more, which is the main reason I'm here, after all, besides the science of it. Have to keep the brain cells engaged, else they rust. If you ever need someone to help bang on stuff or help prove a case, just ask. If it's within my skill set I'm always willing.
Appreciate your taking time to help dig it out!
ID: 2027026 · Report as offensive
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.