The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 107 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2040264 - Posted: 25 Mar 2020, 8:54:33 UTC - in response to Message 2040263.  

Sweet dreams. I'm on the other end of that timeline - just starting the day. More coffee, to dispel the traces of last night's sleepyjuice.
ID: 2040264 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2040266 - Posted: 25 Mar 2020, 9:06:12 UTC - in response to Message 2040264.  

I'm supposed to be on that side of the world myself. I actually live in Thailand (long story), but I'm back in the US doing something I didn't think I'd have to, complete a degree. Seems the Thai government won't let most foreigners work over there without a STEM degree, and though I've never needed it because I have at least half a dozen certifications from the major vendors not to mention 30 years experience dealing with nearly every main operating system to Supercomputers running TRIX or Secure Solaris (now solaris with security extensions), the people who write laws don't understand how people in the computer industry actually get jobs. So I find myself in Monterey doing JC stuff, and trying to get into Berkeley. So...Have a good day sir, and enjoy that coffee.
Guy
ID: 2040266 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2040269 - Posted: 25 Mar 2020, 9:49:26 UTC

How can this be possible? https://setiathome.berkeley.edu/result.php?resultid=8682469710

The server is sending the host GPU work that its advertized GPU can't run. And the host is not using anonymous platform!
ID: 2040269 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2040270 - Posted: 25 Mar 2020, 10:15:12 UTC - in response to Message 2040269.  

The staff team don't have the time - have never had the time - to program the servers with every transient edge-case, like 'bad driver #xx from manufacture y'. They don't - can't - bother to even try.

So the system only works at the broad-brush level: got an ATI card? Here's an ATI app. The BOINC system as a whole is designed - not necessarily well designed, but designed - to tolerate the small number of failures that drop through the cracks, without losing any science. That task will be finished off by somebody else.
ID: 2040270 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2040271 - Posted: 25 Mar 2020, 10:22:08 UTC - in response to Message 2040270.  

The staff team don't have the time - have never had the time - to program the servers with every transient edge-case, like 'bad driver #xx from manufacture y'. They don't - can't - bother to even try.
Seti@Home staff is not even responsible for that but the Boinc devs.
ID: 2040271 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2040278 - Posted: 25 Mar 2020, 10:48:42 UTC - in response to Message 2040271.  

Seti@Home staff is not even responsible for that but the Boinc devs.
No, this one is down to SETI. The BOINC software provides the tools to make the app selection as precise as you like, to match the hardware - see Specifying plan classes in C++.

But it's up the the project - SETI - to specify exactly what the rules are for their particular application set. We have an exceptionally wide set of applications to choose from, and an exceptionally complex set of rules for what will run on what. In this particular example case, the app selected through that process was correct for the broad category of hardware (ATI), but wrong for the specific case (model too old). That would be a SETI-specific rule, if the SETI staff had the time to write it. They didn't, and don't.
ID: 2040278 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19724
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2040283 - Posted: 25 Mar 2020, 11:19:34 UTC

If you want to know about the BOINC code from over 6 years ago, then you might find some links here at RomWorld,
An example is BOINC Client: The evils of 'Returning Results Immediately'
ID: 2040283 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2040290 - Posted: 25 Mar 2020, 11:49:40 UTC - in response to Message 2040283.  

Good one! I think he wrote that in response to what I was posting at the time.
ID: 2040290 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2040293 - Posted: 25 Mar 2020, 12:27:58 UTC

That confirms that I did the right thing when I modded my client to not report any results if it has reported any in the last 30 minutes.

It contacts the scheduler whenever Boinc thinks it wants new work, which is every 5 minutes (server specified cooldown) because my GPU chews through a single task in lot less than 5 min but will only report results on approximately every sixth contact.
ID: 2040293 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2040299 - Posted: 25 Mar 2020, 12:50:12 UTC - in response to Message 2040293.  

At the time Rom wrote that blog, the automatic reporting interval - the maximum wait time for a completed task - was 24 hours. The default reporting interval for a standard client is now 1 hour.

I set my machines, where appropriate, to have

Store up to an additional 0.05 days of work
- just over an hour. Sometimes, the client gets hungry first, and sometimes the maximum reporting interval is reached first. Either way, both reporting and work fetch are combined into a single database interaction.
ID: 2040299 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2040304 - Posted: 25 Mar 2020, 13:25:00 UTC - in response to Message 2040299.  
Last modified: 25 Mar 2020, 13:29:35 UTC

Either way, both reporting and work fetch are combined into a single database interaction.
It that blog is to be believed, then work fetch takes a fixed number of db queries per task regardless of how many you get at a time. I have also observed that when the servers were heavily loaded after Tuesday downtimes (back when we still had them) the more tasks you asked, the more likely it was for the entire scheduler request to fail.

That's why I modified the reporting interval only, not the work fetch interval.

Also grabbing lot of tasks at once can make the next host hitting the scheduler after you get nothing.
ID: 2040304 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2040308 - Posted: 25 Mar 2020, 14:21:00 UTC - in response to Message 2040304.  

Again, fair comment. My machines are, in general, not requesting a huge amount of work at the end of the hour. If there's a shortage, they repeat the hourly top-up request five minutes later, and usually catch up quite quickly and resume the pattern.

I'm trying to remember what I might have been posting to provoke Rom's blog. I like to think it might have been returning/reporting work quickly, so it could be assimilated and purged as quickly as possible, keeping the database size down. Some things never change!

Talking of which - kudos to the servers today. They're sending out a shorty storm, and my cache has noticeably increased in task numbers for the same time requested. Return rate is almost back up to 150,000 an hour, and the message boards seem responsive. They're doing something right.
ID: 2040308 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2040329 - Posted: 25 Mar 2020, 15:49:23 UTC - in response to Message 2040308.  

Talking of which - kudos to the servers today. They're sending out a shorty storm, and my cache has noticeably increased in task numbers for the same time requested. Return rate is almost back up to 150,000 an hour, and the message boards seem responsive. They're doing something right.
The database is swelling a lot. It has about 25.6 million results now and the splitters don't seem to be throttled at all so the database is growing without bound.

The last time the database went into disk thrashing mode due to spilling out of RAM causing a day long period of no new tasks the result table size was about half a million rows lower than now.
ID: 2040329 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2040441 - Posted: 26 Mar 2020, 1:08:38 UTC

TBar was apparently right with his wild guess about the momentary leveling of assimilation queue being a turning point. The queue has been shrinking for the last two and a half days. But too slowly to make any meaningful difference before the end of work distribution.
ID: 2040441 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2040451 - Posted: 26 Mar 2020, 2:19:13 UTC - in response to Message 2040450.  
Last modified: 26 Mar 2020, 2:19:33 UTC

And the data on the SSP is gone!!!

Nighty night SSP.


lol wow. the website was being quite laggy the last few mins or so too. guess that explains it.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2040451 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2040457 - Posted: 26 Mar 2020, 3:03:10 UTC - in response to Message 2040269.  
Last modified: 26 Mar 2020, 3:06:27 UTC

How can this be possible? https://setiathome.berkeley.edu/result.php?resultid=8682469710

The server is sending the host GPU work that its advertized GPU can't run. And the host is not using anonymous platform!


This kind of goes to the heart of the issue I had with that one plan class they turned on a couple of months ago for me. It tended to error out in most cases for my build, but the replacement was in development and testing so they opened it up, realizing that there was a certain set of computers which was having issues with that specific code. This is how I made sure they don't run. It is better not to run them and throw bad information into the system, and just abort them ASAP so the others can work on them and get them out of the database ASAP.
   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>opencl_ati5_SoG_mac</plan_class>
       <max_concurrent>0</max_concurrent>
       <cmdline> -version</cmdline>
       <ngpus>0</ngpus>
        <avg_ncpus>0</avg_ncpus>
   </app_version>


In most situations, the max_concurrent keeps them from running over other tasks in my client. If my client is low or out of other GPU tasks, it will try to run them anyway. That's where the command line kicks in and causes the program to terminate.
ID: 2040457 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1647
Credit: 12,921,799
RAC: 89
New Zealand
Message 2040458 - Posted: 26 Mar 2020, 3:10:25 UTC - in response to Message 2040457.  

I'm not sure I agree with you. Because every new work unit that is sent out creates another entry in the database so you're creating extra load on the database from the way I understand it
ID: 2040458 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2040461 - Posted: 26 Mar 2020, 3:36:56 UTC

looks like the replica went kablamo.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2040461 · Report as offensive     Reply Quote
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2040463 - Posted: 26 Mar 2020, 3:42:15 UTC - in response to Message 2040461.  

looks like the replica went kablamo.



I think you are right... I can't get much of a status page. So far I'm still sending and receiving work though
ID: 2040463 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13959
Credit: 208,696,464
RAC: 304
Australia
Message 2040464 - Posted: 26 Mar 2020, 4:03:05 UTC - in response to Message 2040451.  

lol wow. the website was being quite laggy the last few mins or so too. guess that explains it.
Web site was MIA for a while there, now it's just excruciatingly slow, with a very messed up & lacking in details Server Status page.
Grant
Darwin NT
ID: 2040464 · Report as offensive     Reply Quote
Previous · 1 . . . 36 · 37 · 38 · 39 · 40 · 41 · 42 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.