Ian&Steve C. 的帖子

161) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039398)
发表于:21 Mar 2020 作者: Ian&Steve C.
Post:
down to just 2 BLC tapes now.
162) 留言板 : Number crunching : Bitcoin GPU-based Mining Machines good for BOINC / SETI? (消息 2039370)
发表于:21 Mar 2020 作者: Ian&Steve C.
Post:
Sounds like you’ve now taken the record for most number of GPUs on a single system. Usurping Tbar
163) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039331)
发表于:21 Mar 2020 作者: Ian&Steve C.
Post:
Alas, no good ANSWER, only ideas. Or...we could just limp along, and close out with this. I don't think the SETI team at Berkeley cares one way or the next. Either way they get their data, and close down the task of processing incoming data. perhaps occasionally throwing a little work around out there for those of us who don't just disconnect on that day.


They really don’t have a choice. They are in forced self quarantine in California. They can’t go in to mess with anything in person. And they may not have 100% access to everything remotely.
164) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039323)
发表于:21 Mar 2020 作者: Ian&Steve C.
Post:
its very clear that database size grew out of control after the first cache limit changes and BEFORE the quorum change. notice how there was a slight lag before it really took off. it was unintentionally "saved" by the server change to 715 which prevented a large number of the fastest hosts on the project from getting work. and started growing out of control again shortly after the server was reverted and the AP hosts could get work again.

this all fits my theory that the database was already near limits in terms of RAM utilization. going over that limits makes everything slower. for an entire year the assimilation and validation queue never went over ~6 million, and usually around 4 million. then once they change the limits it shoots up to over twice that in 2-3 weeks. the issues wont go away until that number goes back down to where it used to be. and what's preventing it from going back down is the large assimilation queue hogging up space in the database. but its a catch 22, because it cant knock the queue down fast enough with the database being so full and running so slow while still serving out new tasks.

even if all task distribution stopped right now, it would take longer than the remaining life of the project to get back to where it used to be. so it's all pointless.

Remember, the majority of people at SETI run Stock with a 10 day cache and have machines that in NO WAY resemble yours.

nor yours *shrug*. not sure what your point is.

but I'm curious where you get the idea that the majority of SETI users run 10-day cache. running stock, I agree with you, as I'm sure most people just download BOINC for their OS, attach whatever project they want, then just let it run. but the default cache settings are quite low, like 0.3 days or something, I doubt most people are manually changing that to 10 days.
165) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039314)
发表于:21 Mar 2020 作者: Ian&Steve C.
Post:
Also, just ignore the Fact that the only change to the Server was the quorum settings.


several things were changed at many different times. not sure why fact is worthy of a capital F.

device limits raised from 100/100->200/400 ~ December 7th 2019
device limits lowered from 200/400->200/300 ~ December 10th 2019
Server version changed from 709 to 715 ~ December 20th 2019
"change to the validator that raises the effective quorum for overflow results" ~ December 21st 2019
Server version changed from 715 to 709 + unnamed "fix" ~ December 26th 2019
device limits lowered from 200/300->150/250 ~ January 16th 2020
device limits lowered from 150/250->150/150 ~ January 16th 2020


pictures for those who need it. I've added the above events over the SSP history. where does it look like the problems started?

1st red dashed line = Dec 7th
2nd red dashed line = Dec 10th
1st blue dotted line = Dec 20th
2nd blue dotted line = Dec 26th
3rd red dashed line = Jan 16th



the values climbed drastically after the first two global device limit changes.
the values peaked and started to freefall when the server changed to 715 and quorum change implemented the next day. most likely as AP users could no longer get work for several days.
server was reverted back to normal 709 about mid way through the drop, and continued to fall after. takes time for the AP hosts to fill back up.
queues start rising again because the large cache settings are still in place.
the staff tries to mitigate the problem by reducing the limits, but the damage was already done. the sink's drain is already clogged and cant drain as fast as it used to.
166) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039269)
发表于:20 Mar 2020 作者: Ian&Steve C.
Post:
Only 528 blc channels left to split, and then we can have some peace here :-)
Arecibo work will go away so quickly that we won't even notice when new Arecibo work is added.


It’ll definitely be weird seeing a stats page with no BLC tapes on it.
167) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039239)
发表于:20 Mar 2020 作者: Ian&Steve C.
Post:
Looks pretty Obvious to Me the lines start going up after Dec 21st. Why are you defending something that adds tasks to the database when the problem is tasks being added to the database? Are you married to that change to the point you can't let it go? If you remember a number of people couldn't get work when at Server version 715, it wasn't until the change back to 709 on the 26th that workflow resumed to normal. Where is Dec 26th on that chart?


Follow the link I provided and click the charts and zoom in to see the dates. The green line suddenly rises on dec 7th. I trust you’re smart enough that you don’t need me to hold your hand.

I’m not married to any change. I don’t care if they backout the quorum change. I just don’t think it’ll help. Honestly nothing will help this late in the game, even if the whole staff wasn’t in self-quarantine.
168) 留言板 : Number crunching : Top Participant (消息 2039228)
发表于:20 Mar 2020 作者: Ian&Steve C.
Post:
I didn't notice ;)

But that's not the goal. I might pass his RAC briefly by the end, but I'm really just trying to get 3rd place in total points and kick out CharityEngine1 before the servers poop themselves haha

We're already outproducing W3Perl on daily production, just RAC is behind.
169) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039225)
发表于:20 Mar 2020 作者: Ian&Steve C.
Post:
because along the way "something" happened that affected the assimilation process(es).

it's still assimilating, but not fast enough to keep up with the rate in which new work is being processed/returned and added to the assimilation queue.

just look at the SSP history charts. it's pretty obvious.

https://munin.kiska.pw/munin/Munin-Node/Munin-Node/workunits_setiathomev8.html


the sudden rise in validation and assimilation queues happened on December 7th. right in line with when they first started making changes to the server, and well before any quorum settings were changed (that we know of).

the assimilation numbers briefly came under control through to the beginning of January, and didnt start to go out of control until Jan 10th or so, and it's been steadily increasing since then. Nothing about our current situation lines up with the changes to the validators on Dec 21st.

my guess is that the server was close to maxing out it's RAM but relatively stable with enough headroom that things were generally smooth and all systems operated at full speed. and when they lifted the limits + fallout from the server downtimes and other issues, it pushed RAM use over the edge, which slowed down ALL processes. which effectively reduced the assimilator speed, while the rate at which new tasks were added to it's queue (return rate) stayed about the same.

it's a sink being filled faster than it can drain.this wont be fixed until the project is over, and the faucet filling the sink is turned off.
170) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039212)
发表于:20 Mar 2020 作者: Ian&Steve C.
Post:
thanks, edited.

but it looks like it was December 20th according to your comments on that report. you reproduced it on LHC Dec 22
171) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039207)
发表于:20 Mar 2020 作者: Ian&Steve C.
Post:
Also, just ignore the Fact that the only change to the Server was the quorum settings.


several things were changed at many different times. not sure why fact is worthy of a capital F.

device limits raised from 100/100->200/400 ~ December 7th 2019
device limits lowered from 200/400->200/300 ~ December 10th 2019
Server version changed from 709 to 715 ~ December 20th 2019
"change to the validator that raises the effective quorum for overflow results" ~ December 21st 2019
Server version changed from 715 to 709 + unnamed "fix" ~ December 26th 2019
device limits lowered from 200/300->150/250 ~ January 16th 2020
device limits lowered from 150/250->150/150 ~ January 16th 2020
172) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2039143)
发表于:20 Mar 2020 作者: Ian&Steve C.
Post:
Looking at my own task counts, there does seem to be an unusual discrepancy between what the server thinks I have in progress (2957), and what I can see locally (1742). Ghosts! When/if the server emerges from its typical morning constipation, I'll try to identify which computer(s) is/are most affected, and try my own ghost-busting technique.


do you really have that many ghosts? or is it simply a product of the replica delay? its 67,500 seconds behind, so any task counts on your host viewed from the website are as of 18.75 hrs ago. maybe you just crunched ~1200 tasks and sent them back that haven't been refilled due to server issues.
173) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2038768)
发表于:18 Mar 2020 作者: Ian&Steve C.
Post:
even if we run out of BLC work, it looks like the Arecibo automation is still running. 2 more tapes have been added.

it also looks like they removed any throttling. they put the brick on the gas pedal and let it go! hahaha.
174) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2038538)
发表于:17 Mar 2020 作者: Ian&Steve C.
Post:
Should see some improvement on March 23 cause that is when thousands of my quorum=1 tasks that validated back at the end of January will have my wingmen time out or finally report their tasks.

That should reduce the size of the database I would hope for the last week of Seti.


or crash it from the resends lol
175) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2038530)
发表于:17 Mar 2020 作者: Ian&Steve C.
Post:
No outage again this Tuesday?


probably not https://news.berkeley.edu/coronavirus/
176) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2038487)
发表于:17 Mar 2020 作者: Ian&Steve C.
Post:
*IF* downtime hits.

https://news.berkeley.edu/coronavirus/

forums are very sluggish now though.
177) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2038342)
发表于:16 Mar 2020 作者: Ian&Steve C.
Post:
I'm looking at the task list on the website. not on the host.

but just as I posted that, the SSP got an update, showing 32,000 now.
178) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2038338)
发表于:16 Mar 2020 作者: Ian&Steve C.
Post:
The SSP said: Replica seconds behind master 42,550


that value has been frozen for several hours. but if you look at the newest tasks you can see in your task list, that gives you an idea how far it really is behind.

the newest tasks I can see in my lists are reported at 10:43 UTC
its now 17:50 UTC

that means the replica is about 7 hrs behind right now. or about 25,200 seconds. and it's still catching up.
179) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2038331)
发表于:16 Mar 2020 作者: Ian&Steve C.
Post:
according to the newest tasks I can see in my list, at least the replica is down to about 25,800s behind. a little over 7 hrs. it's making progress there at least.
180) 留言板 : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119) (消息 2038194)
发表于:16 Mar 2020 作者: Ian&Steve C.
Post:
RTS is nice and plump
splitters are splitting
replica delay is dropping

no panic, no posts.


前 20 · 后面 20


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.