The Server Issues / Outages Thread - Panic Mode On! (117)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 30 · 31 · 32 · 33 · 34 · 35 · 36 . . . 52 · Next

AuthorMessage
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2022018 - Posted: 7 Dec 2019, 5:32:05 UTC

I've set myself to no new task for a while, until the situation gets better. I hope everyone who needs WUs can get them.
ID: 2022018 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 2022025 - Posted: 7 Dec 2019, 5:52:11 UTC - in response to Message 2022018.  
Last modified: 7 Dec 2019, 5:55:13 UTC

I've set myself to no new task for a while, until the situation gets better.
Just done that for my Windows system, it's got more than usual to chew on.
Just hoping the Linux system can start picking up some work more frequently before it runs out of work again.

Edit- too late, it's out of GPU work again.
It's odd how one system gets work almost every time, the other almost never.
Grant
Darwin NT
ID: 2022025 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2022027 - Posted: 7 Dec 2019, 6:12:15 UTC

At what point does the results out in the field begin to be an issue??I'm assuming the db will get too large and the system will crash or slow to a crawl. It is late in California (Friday 10pm ish), so hopefully tomorrow someone can look at the issue.

I'll just add the disclaimer that while no one guarantees us WUs, and I'm certainly not demanding they come in on the weekend and fix things, I'm sure they want to keep the project up.
ID: 2022027 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36876
Credit: 261,360,520
RAC: 489
Australia
Message 2022033 - Posted: 7 Dec 2019, 6:52:02 UTC - in response to Message 2022020.  

I've set myself to no new task for a while, until the situation gets better. I hope everyone who needs WUs can get them.
I did that with my 2500K rig when it hit 750 in progress.
Well I just set my 3570K rig to no new tasks with both rigs having over 600 tasks each in progress now. :-O

Cheers.
ID: 2022033 · Report as offensive
Cherokee150

Send message
Joined: 11 Nov 99
Posts: 192
Credit: 58,513,758
RAC: 74
United States
Message 2022038 - Posted: 7 Dec 2019, 7:33:52 UTC - in response to Message 2022033.  
Last modified: 7 Dec 2019, 7:34:40 UTC

One of my computers now has 94% more tasks than it is supposed to. My other one is now 122% over limit. That's 16 hours of non-stop processing for one, and 45 hours for the other one!
Of course, I have now set them to no new tasks.

Much more important, however, is that I think someone really should contact the staff right away. The spliters are now creating as many as 80 tasks per second, and there are over 5.8 million units in the field. If we remember back to when they put the limits on, it was because the number of units out in the field and being returned overloaded the system. SETI was literally choking on units! It caused them a lot of grief and took a long time to clear things out.

So, I reiterate, I think that whoever can reach the staff should contact them as soon as possible.
Does anyone else concur?
ID: 2022038 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 2022039 - Posted: 7 Dec 2019, 7:44:26 UTC - in response to Message 2022029.  

All hosts are down on gpu work because of the scheduler sending 0 tasks upon request.
Yeah, my Linux host is still struggling to get work. Anywhere from 4-8 requests to get any and then sometimes it's 45, other times only 2.
Grant
Darwin NT
ID: 2022039 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2022041 - Posted: 7 Dec 2019, 8:29:14 UTC - in response to Message 2022033.  

I've set myself to no new task for a while, until the situation gets better. I hope everyone who needs WUs can get them.
I did that with my 2500K rig when it hit 750 in progress.
Well I just set my 3570K rig to no new tasks with both rigs having over 600 tasks each in progress now. :-O
Cheers.


. . Well, with the system WU allocation limiters out on strike what I am seeing is that I am being assigned the full allocation I am requesting, that is, half a days work as set in the work fetch settings. So I am sure if anyone wishes to limit the size of their cache, and this is recommended, then try reducing the size of your work fetch to a value that is close to the limits that should be there.

Stephen

:)
ID: 2022041 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 2022048 - Posted: 7 Dec 2019, 9:38:42 UTC - in response to Message 2022043.  
Last modified: 7 Dec 2019, 9:52:25 UTC

Now I have another host with more than 100 cpu tasks. But the host that has been empty for hours of cpu work never gets any if requested. And then sets a 1400 second backoff timer. Strange.

And still no cpu tasks.
I reduced my cache setting so my Windows system could get some CPU tasks,
But still no joy with the Linux system- the fact is for every time my Linux system gets work, my Windows system gets it 3-5 times more often.
I'll probably have to reduced the cache setting even further so the Linux system can get some CPU work, then gradually bump it up till I can get 24hours worth (or as close to it as the new limits will allow)- the problem is the ridiculous number of requests it takes to actually get some work...
Grant
Darwin NT
ID: 2022048 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2022050 - Posted: 7 Dec 2019, 9:52:03 UTC - in response to Message 2022043.  

Sat 07 Dec 2019 12:57:51 AM PST | SETI@home | Project requested delay of 303 seconds
That's a server backoff.

Sat 07 Dec 2019 12:57:51 AM PST | SETI@home | [work_fetch] backing off CPU 1400 sec
and that's a client backoff. Normal, and unrelated to the servers troubles.
ID: 2022050 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 2022070 - Posted: 7 Dec 2019, 11:24:27 UTC - in response to Message 2022064.  

It kinda doesn't make sense. My newer, faster Linux PC should have WAY more WUs than my older one then. It's got less than half what the older one has.
If the system isn't meeting the requests for work, then that will happen regardless of what your cache settings & the server limits may be.

My systems are a good example- the slower Windows system gets work on pretty much every other request. The faster Linux system on every 4-10 requests. End result- Windows system has a full cache, Linux system can't get close to a full cache and is about to run out of CPU work.
Grant
Darwin NT
ID: 2022070 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 2022073 - Posted: 7 Dec 2019, 11:50:58 UTC

Results-out-in-the-field over 6 million, still no sign of the Ready-to-send buffer refilling.
Grant
Darwin NT
ID: 2022073 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2022115 - Posted: 7 Dec 2019, 14:54:23 UTC - in response to Message 2022113.  
Last modified: 7 Dec 2019, 14:57:07 UTC

Well, I'm down below 800 tasks on my 2 x GPU machine, and I'm back to 'no work available', with no mention of a limit. Looks like "Zalster's Theorem" is right, but I won't know for certain until I get back from lunch.


I think I concur with the 4x GPU limit now.

my "slow" system with [64] spoofed GPUs started going over the 6400 task limit. So I changed the system to [16] spoofed GPUs and now it's reporting that it has reached the limit for tasks in progress.

If that is right then we reach a new limit for the cache size, 64 x 400 = 25600 WU!!! LOL
Now we are ready for the next generation of GPU's

That rises 2 questions:

- What is the new limit for the CPU? 400 WU too?
- If you run a >100 thread CPU it will run a WU on each thread?
ID: 2022115 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2022118 - Posted: 7 Dec 2019, 15:03:02 UTC - in response to Message 2022115.  
Last modified: 7 Dec 2019, 15:03:14 UTC

Based on Cliff's info, it appears that the limits are now 200CPU + 400*[nGPU]
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2022118 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2022121 - Posted: 7 Dec 2019, 15:12:31 UTC - in response to Message 2022118.  

Based on Cliff's info, it appears that the limits are now 200CPU + 400*[nGPU]

OK Thanks. To test i will change my spoofed count to 24 that will give a 9600 GPU + 200 CPU WU size...
Below the 10000 limit... just in case...

One point is to be observed on the next outage. What is the impact of a new large number of hosts with such big caches could done in the DB.
ID: 2022121 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2022127 - Posted: 7 Dec 2019, 15:54:51 UTC - in response to Message 2022113.  

Well, I'm down below 800 tasks on my 2 x GPU machine, and I'm back to 'no work available', with no mention of a limit. Looks like "Zalster's Theorem" is right, but I won't know for certain until I get back from lunch.
I think I concur with the 4x GPU limit now.
So do I - I returned from lunch to find exactly 800 GPU tasks on the 2 x GPU machine - I don't run SETI CPU tasks on that machine, which makes it easier.
ID: 2022127 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2022134 - Posted: 7 Dec 2019, 16:35:07 UTC - in response to Message 2022123.  

I doubt the project DB's has been sufficiently upgraded/fixed, to be able to cope with this in the long run.
I fear it will show the same problems we had before the 100/100 limits were put in place.
Crasches, slowdowns, and general crappiness.
I don't think it's quite as bad as that, but I do agree that we should be proceeding at caution, and keeping a careful eye on the database.

We had a similar - but much worse - situation in November 2013. That one was worse because comms delays caused a huge number of ghost workunits to be created: they existed in the database, but were not downloaded, so they didn't inhibit clients from requesting more.

That was also a weekend, and by the time I wrote (with some trepidation) to Eric, there were over 10.5 million tasks supposedly 'out in the field'. That stat comes from message 1302186: from memory, it took about a week to recover from the mess.

That mess-up arose from an ill-judged intervention by David Anderson, and so far David is the only admin to have participated in captainiom / JSM's thread at GitHub. I've posted a consolidated concern which will be notified to both David and Eric, to say "Please keep an eye on the effects of this". Not a nice thing to have to do, again, at a weekend.
ID: 2022134 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2022136 - Posted: 7 Dec 2019, 16:38:28 UTC
Last modified: 7 Dec 2019, 16:41:50 UTC

no official answers, but it looks like the group has figured out new limits. I'll have to think about where I want to set my own limits.

In the last 9 hours someone on the seti end added more files to be split, so someone has been keeping an eye on things.
The results out in the field is over 6 million and it is still splitting fine (at a high rate). It still can't keep up, as it has a large backlog of "holes" to fill. I usually set my cache at 10 days, because I knew I wanted the max allowed, and I figured it would never reach that.

Is there a daily limit?? Will this allow machines with high error rates to go wild??

edit - we are now splitting the blc14s that they pulled a full days ago, so this could add to the issues.
ID: 2022136 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2022140 - Posted: 7 Dec 2019, 16:47:42 UTC - in response to Message 2022134.  

I've posted a consolidated concern which will be notified to both David and Eric, to say "Please keep an eye on the effects of this". Not a nice thing to have to do, again, at a weekend.


Thank you Richard.
ID: 2022140 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 2022153 - Posted: 7 Dec 2019, 18:25:18 UTC

I looked at my old log entries and and for me eastern time, it seems to have started at -
06-Dec-2019 23:55:00 [SETI@home] Sending scheduler request: To report completed tasks.
06-Dec-2019 23:55:00 [SETI@home] Reporting 2 completed tasks
06-Dec-2019 23:55:00 [SETI@home] Requesting new tasks for CPU and NVIDIA GPU
06-Dec-2019 23:55:02 [SETI@home] Scheduler request completed: got 66 new tasks


I don't buy computers, I build them!!
ID: 2022153 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 2022154 - Posted: 7 Dec 2019, 18:27:38 UTC

Not a nice thing to have to do, again, at a weekend.


At my last company, no new or changed software was ever rolled out on a Friday, Wednesday was preferred, as being the farthest point from a weekend.

Monday, people recovering from the weekend, Tuesday getting up to speed, Thursday thinking about the weekend, Friday, winding down. ;-)
ID: 2022154 · Report as offensive
Previous · 1 . . . 30 · 31 · 32 · 33 · 34 · 35 · 36 . . . 52 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.