The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 38 · 39 · 40 · 41 · 42 · 43 · 44 . . . 94 · Next

AuthorMessage
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028419 - Posted: 18 Jan 2020, 23:49:55 UTC

Here is an explanation for the intermittent work generation we are experiencing now: https://setiathome.berkeley.edu/forum_thread.php?id=85093
ID: 2028419 · Report as offensive
bluestar

Send message
Joined: 5 Sep 12
Posts: 7443
Credit: 2,084,789
RAC: 3
Message 2028424 - Posted: 19 Jan 2020, 1:00:45 UTC
Last modified: 19 Jan 2020, 1:02:20 UTC

And just a billion individual results for that of the Master Database only in the way, for not any value and meaning, and perhaps it should be thrown away,
when only those capacity problems, for also intermittent problems as well.

Why being left stuck for that of perhaps 96 GB of RAM, when it still needs a query for that of a lookup, and for that only a disk instead, for just the Terabytes we could make it.
ID: 2028424 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 21763
Credit: 7,508,002
RAC: 20
United Kingdom
Message 2028435 - Posted: 19 Jan 2020, 2:49:01 UTC - in response to Message 2028424.  

And just a billion individual results for that of the Master Database only in the way, for not any value and meaning, and perhaps it should be thrown away,
when only those capacity problems, for also intermittent problems as well.

Why being left stuck for that of perhaps 96 GB of RAM, when it still needs a query for that of a lookup, and for that only a disk instead, for just the Terabytes we could make it.

1TByte of RAM anyone?

A new fundraiser??


Keep searchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 2028435 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1647
Credit: 12,921,799
RAC: 89
New Zealand
Message 2028436 - Posted: 19 Jan 2020, 3:48:45 UTC - in response to Message 2028325.  

Validation backlog did shed a digit. It is under 10 million now.

Validation backlog is now a smidge under 9.7 million. Any idea what number we are aiming for, to fit things back into memory?
ID: 2028436 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2028437 - Posted: 19 Jan 2020, 4:06:08 UTC - in response to Message 2028436.  
Last modified: 19 Jan 2020, 4:11:23 UTC

Validation backlog is now a smidge under 9.7 million. Any idea what number we are aiming for, to fit things back into memory?
No idea, but when things are working normally the "Results returned and awaiting validation" is usually a lot less than the "Results out in the field" number (and with the limited availability of work, that will be considerably lower than usual).
So i'd say another 4 million. Then the Assimilators need to do their thing, then the deleters, then the Purgers. And then things will be back down to manageable levels.

Edit- it has taken 3.5 days to go from 10.66 million to just under 9.8 million, the limit on work availability has been around for about a couple of days.
Grant
Darwin NT
ID: 2028437 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028440 - Posted: 19 Jan 2020, 4:27:43 UTC
Last modified: 19 Jan 2020, 4:34:46 UTC

"Results waiting for db purging" is very high too. 'Natural' size of it should be about 24 times the average value of "Results received in last hour" because the validated stuff is supposed to be kept around for 24 hours.

This work availability throttling hasn't yet had any effect on the rate my crunchers are returning results. My queues have gone very short many times but the hosts have so far been able to receive a bunch of new work just before running out. Right now I'm having almost full queues.
ID: 2028440 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2028446 - Posted: 19 Jan 2020, 5:01:29 UTC

My Linux system is mostly out of GPU work, every so often it picks up some (along with a bit more CPU work somehow). My Windows system is slow enough that it gets more work before it gets even close to running out.
Quite a few batches of 20+ resends.
Grant
Darwin NT
ID: 2028446 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2028453 - Posted: 19 Jan 2020, 5:33:54 UTC
Last modified: 19 Jan 2020, 5:37:03 UTC

Looks like my Amd 3950x is running 21 cpu tasks and had 6 gpu tasks but is regressing to E@H again.

Apparently I am getting cpu tasks more reliably than gpu tasks or just chewing through the gpu tasks so fast I am not catching them in the act.

And my Einstein@Home RAC has topped 80,000. That is way to much for a "backup" project :)

Tom
A proud member of the OFA (Old Farts Association).
ID: 2028453 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2028456 - Posted: 19 Jan 2020, 6:13:29 UTC - in response to Message 2028453.  

And my Einstein@Home RAC has topped 80,000. That is way to much for a "backup" project :)
No, it's just that Seti pays way under the defined rate.
Grant
Darwin NT
ID: 2028456 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22816
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2028459 - Posted: 19 Jan 2020, 8:59:06 UTC - in response to Message 2028424.  

The table that Eric is talking about as being held in RAM is the "temporary" one that hold work that is out in the field, so it shouldn't be that big. We've had similar issues in the past, and they were solved by adding more RAM. But there is a question that needs to be answered, will the motherboards of the servers in question support more than 96GB? If not then a pair of new servers with adequate RAM (and potential for more) and do a server shuffle so that lower powered servers get an upgrade, and a couple at the bottom are retired. The trouble with that is it would take a good few hours of work to move everything around, and some things might not like being moved onto new hardware.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2028459 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2028461 - Posted: 19 Jan 2020, 9:19:21 UTC - in response to Message 2028459.  

Shrink the database first while they think about what to do next, and keep it lean and mean for the time being! It'll be much easier to move things around if they're smaller. I'd also suggest pausing the reprocessing of old Arecibo tapes while things are rocky - I'm still seeing a lot of shorties (by which I mean VHAR) and overflows from noisy tapes. They both produce a lot of results in very little time, just the opposite of what we need at the moment. But keep the new recordings flowing from Arecibo: recent tapes have included lots of VLAR work, which is both slower and more likely to include interesting data. Win win.
ID: 2028461 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 2028463 - Posted: 19 Jan 2020, 10:13:52 UTC - in response to Message 2028453.  


And my Einstein@Home RAC has topped 80,000. That is way to much for a "backup" project :)

Tom


Yes, Well, Wait until you find you are climbing up their top 50 computers list.

Looks like I am on the bottom of the queue when they hand out tasks here:-(
Kevin


ID: 2028463 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1859
Credit: 268,616,081
RAC: 1,349
United States
Message 2028465 - Posted: 19 Jan 2020, 11:13:49 UTC
Last modified: 19 Jan 2020, 11:17:20 UTC

Perhaps I missed it, but I don't see anyone mentioning what might be the best way to shrink things a bit.
I grab a task from Einstein, the deadline is around 10-14 days.
I grab a task here, the deadline is 1-2 months. That's too high.
They're very upfront at e@h about not wanting to raise their deadlines, specifically due to database server loading issues.
Perhaps lowering it here only makes sense.
SETI is arguably the most visible BOINC project.
As a result, I suspect this project gets the largest percentage of people new to the concept, and thus more likely to decide it isn't for them and go away, leaving work in the db to time out.
The long deadlines made sense in the earlier days of the project when task run times were high, and computers weaker. Perhaps it's time to revisit the decision.
ID: 2028465 · Report as offensive
Brandaan

Send message
Joined: 5 Jan 20
Posts: 17
Credit: 384,179
RAC: 0
Belgium
Message 2028466 - Posted: 19 Jan 2020, 11:38:31 UTC - in response to Message 2028465.  
Last modified: 19 Jan 2020, 11:39:08 UTC

Perhaps I missed it, but I don't see anyone mentioning what might be the best way to shrink things a bit.
I grab a task from Einstein, the deadline is around 10-14 days.
I grab a task here, the deadline is 1-2 months. That's too high.
They're very upfront at e@h about not wanting to raise their deadlines, specifically due to database server loading issues.
Perhaps lowering it here only makes sense.
SETI is arguably the most visible BOINC project.
As a result, I suspect this project gets the largest percentage of people new to the concept, and thus more likely to decide it isn't for them and go away, leaving work in the db to time out.
The long deadlines made sense in the earlier days of the project when task run times were high, and computers weaker. Perhaps it's time to revisit the decision.


I agree, there needs to be an overhaul to catch up with the modern times.
ID: 2028466 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2028472 - Posted: 19 Jan 2020, 12:05:12 UTC - in response to Message 2028465.  

Perhaps I missed it, but I don't see anyone mentioning what might be the best way to shrink things a bit.
I grab a task from Einstein, the deadline is around 10-14 days.
I grab a task here, the deadline is 1-2 months. That's too high.
They're very upfront at e@h about not wanting to raise their deadlines, specifically due to database server loading issues.
Perhaps lowering it here only makes sense.
SETI is arguably the most visible BOINC project.
As a result, I suspect this project gets the largest percentage of people new to the concept, and thus more likely to decide it isn't for them and go away, leaving work in the db to time out.
The long deadlines made sense in the earlier days of the project when task run times were high, and computers weaker. Perhaps it's time to revisit the decision.
The other way to catch up with the computing power the state of the art computers do provide is to make the workunits longer.
Provided that their length is not hard coded into the apps. (Is the length of the tasks hard coded into the apps?)
The state of the art GPUs can process a workunit (with the special app) in less than a minute (~30 secs), so the overhead of getting this workunit to be actually processed takes comparable time (~3 sec) to the processing itself. This approach would lower the impact of this overhead, and make the tables shorter at the same time.
The number of the max queued tasks per GPU/CPU could be reduced as well.
ID: 2028472 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2028475 - Posted: 19 Jan 2020, 12:32:54 UTC - in response to Message 2028472.  

The other way to catch up with the computing power the state of the art computers do provide is to make the workunits longer.
That might cause problems with one of the stated aims of the project: the long-term monitoring of repeated observations of signals from the same point in the sky. The signal processing has to be consistent over the entire long-term run for the re-observations to be comparable.

One alternative which Einstein tried was to bundle multiple workunits into a single downloadable task. That would reduce the total number of scheduler requests from fast computers, though I'm not sure how the bundling and unbundling would impact of other server tasks. The time spent by volunteers' computers setting up each run will be of minimal concern to the project: fast processors already have the 'mutex' build available to them, but report that the setup time is largely disk limited, and negligible on SSD drives.
ID: 2028475 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2028476 - Posted: 19 Jan 2020, 12:39:01 UTC

From observation only. I have no solid data to be sure.
Maybe some of you remember what i posted several times on this thread.
Each time the total number of WU rises above 23 MM weird things happening.
In the beginning of the week that number goes to a lot more than 30MM.
By the last SSP it is at 27-28MM.
At this rate in a couple of days we will reach the 23 MM barrier and things must start be back to normal.
I Hope!
ID: 2028476 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2028478 - Posted: 19 Jan 2020, 12:46:46 UTC - in response to Message 2028475.  
Last modified: 19 Jan 2020, 13:00:57 UTC

The time spent by volunteers' computers setting up each run will be of minimal concern to the project: fast processors already have the 'mutex' build available to them, but report that the setup time is largely disk limited, and negligible on SSD drives.

Agree, with the mutex builds that set up time is meaning less since your host actually DL and prepare to crunch the new WU while keep the other crunching. I could tell for sure, and that happening even with and extremely large cache (up to 20k WU) on slow SSD/HD devices like the one i was testing the last weeks.

<edit> Of topic but IMHO the next bottleneck of the project in the following years is the grow of the GPU capacity. Today top GPU could crunch a WU in less than 30 secs. So a host with 10 of this GPUs produces 100 WU on each 5 min, the ask for new job cicle. With the arrival of the ampres GPU`s that number will be rise even more. Feed them with this 5 min cicle will be an impossible task on such multi GPU`s coming monsters who probably will run with a lot of cores CPU (maybe more than 1) too.
ID: 2028478 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028479 - Posted: 19 Jan 2020, 12:50:20 UTC - in response to Message 2028461.  

But keep the new recordings flowing from Arecibo: recent tapes have included lots of VLAR work, which is both slower and more likely to include interesting data. Win win.
This has the extra bonus that new Arecibo data would produce Astropulse work too and those being many times slower to crunch would be the third win.
ID: 2028479 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028482 - Posted: 19 Jan 2020, 13:18:23 UTC - in response to Message 2028476.  

From observation only. I have no solid data to be sure.
Maybe some of you remember what i posted several times on this thread.
Each time the total number of WU rises above 23 MM weird things happening.
In the beginning of the week that number goes to a lot more than 30MM.
By the last SSP it is at 27-28MM.
At this rate in a couple of days we will reach the 23 MM barrier and things must start be back to normal.
I Hope!
What fields are you summing? I have been tracking the decrease of the sum of all non overlapping result fields on ssp (Eric said the problem was in the result table not fitting in RAM) and that is currently 22.6 million. Ssp doesn't reveal the total number of workunits because there's no fields for several states but we can estimate that to be around 10.2 million assuming the average replication is the same as in the 'waiting for db purging' state (the only state where ssp reports both result and workunit counts).
ID: 2028482 · Report as offensive
Previous · 1 . . . 38 · 39 · 40 · 41 · 42 · 43 · 44 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.