Message boards :
Number crunching :
The Saga Begins (LotsaCores 2.0)
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Sorry about that Bill, My big clumsy fingers late at night, lol @Al... If you are feeling brave, you can increase the sbs to 1024. That would be the max value. I'm out of here for a while Al. Good Luck Zalster |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
Looks like the new system is about to overtake the original. Grant Darwin NT |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
It's kind of looking that way. I really haven't changed anything, but over the last week or so I have noticed that the original ones RAC has been dropping. I had been hoping that this would be the first machine I had that would go over the 100k RAC, but after hitting around 97k, it's been dropping, and according to the log on that machine it's now in the high 80's and still trending down. Don't understand why, maybe there is just a weird string of WU's that its been crunching through. Actually, all of my machines have been dropping now that I've looked at them. Is this something that is happening across the board, or am I just the lucky one? |
betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66 |
Creditscrew strikes again. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
It's kind of looking that way. I really haven't changed anything, but over the last week or so I have noticed that the original ones RAC has been dropping. I had been hoping that this would be the first machine I had that would go over the 100k RAC, but after hitting around 97k, it's been dropping, and according to the log on that machine it's now in the high 80's and still trending down. Don't understand why, maybe there is just a weird string of WU's that its been crunching through. Actually, all of my machines have been dropping now that I've looked at them. Is this something that is happening across the board, or am I just the lucky one? Yes. something is afoot. I've see my own RAC dropping from around 115K down to the mid 80Ks |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Actually, all of my machines have been dropping now that I've looked at them. Is this something that is happening across the board, or am I just the lucky one? I think it may have been due to 75% of the pfb splitters getting stuck last week. That skewed the WUs heavily in favor of guppi VLARs, with very few Arecibo non-VLAR tasks getting sent out. Hopefully, with that splitter situation corrected on Sunday, and with APs currently flowing, things will get back to normal shortly (whatever "normal" is). |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
Ahh, ok, that might explain it. Thanks for the heads up, I couldn't figure out what might have been wrong. Hopefully it starts it march at least back to the mid 90's again. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
Actually, all of my machines have been dropping now that I've looked at them. Is this something that is happening across the board, or am I just the lucky one? That was my thought as well- there was very little to almost no Arecibo work for a while there, that would result in a significant drop in Credit. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
I notice a few more "Finish file present too long" errors over the last couple of days. Is your system/data drive(s) an SSD or HDD? If you've got Process Explorer on that system, what does it show for CPU load on "Hardware interrupts & DPCs?" Grant Darwin NT |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
Just installed it, seems to be running between 0.39 and 0.76. Checked the RAC situation just now, it's dove down to around 78k, and the curve hasn't flattened yet. Here is one question I have, I am running Mr Kevvys program on this machine, automatically every 30 mins or so. If I remember correctly, this was sort of the anti-SoG setup, what with the more active management as opposed to more along the lines of set and forget after configuring with SoG? When I was looking at the tasks running this morning, I noticed that 1- they were pretty much all Guppis, which isn't surprising after reading the issues that we are having with getting good AR tasks right now, and 2- the CPU tasks are running on version 8.0, and the GPU tasks are running on 8.12, and says it is SoG. Is this an issue, as I didn't think I installed SoG on this machine because I was going to try using the active management on this one, and the SoG on the 2.0, but here it is. Might this be part of the problem, trying to actively manage the SoG client, hence causing at least some of the plummeting RAC? Not sure if it has any bearing at all, but thought I'd toss it out there for discussion. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
First of all, it's not really SoG you're trying to manage, but guppi VLARs and Arecibo non-VLARs. It doesn't matter if you're running Cuda or SoG on the GPUs. Secondly, yes, the frequency of your rescheduling could have an impact, both on your RAC and on those "finish file present too long" errors you're getting. With so many concurrent tasks running on that box, every time BOINC is shut down during a rescheduling run, you run the risk of catching one or more tasks in their shutdown phase, where the finish file has already been written by the app, but before BOINC has checked its existence. When BOINC comes back up, more than 10 seconds have passed and the error is generated. Rescheduling every 30 minutes could also hurt your overall throughput if you still have checkpoints set to the default, which I think is 300 seconds. Any time tasks are restarted, they have to go back to the last checkpoint so, on average, you could be losing close to 2.5 minutes of processing time for each task that's restarted after a rescheduling run. That could be significant. Personally, I would recommend a far longer interval between rescheduling runs and, if you haven't already made such a change, a shorter checkpoint interval. (But not too short, since there is some additional overhead incurred with each checkpoint. Mine is usually set in the 120 to 150 second range.) |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
Jeff, thanks for the advice, I'll try backing off the schedule to every hour and see what effect that has. Where does one go to shorten the checkpoint interval? |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Where does one go to shorten the checkpoint interval? In BOINC Manager > Options, on the Computing tab, all the way at the bottom there's an option to "Request tasks to checkpoint at most every ___ seconds." |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Jeff has very accurately described the mechanism where BOINC fails in its file management. I too learned that if you shutdown the Manager right before a task finishes up that you risk the "finish file present too long" error message. I am very careful to see where all tasks are in their completion percentage before running the Qopt program. I also know that on the 1070 especially on SoG, that percentage to completion is not tracked accurately from about 90% onward. The actual percentage is likely already at 99% when the Manager indicates 90% or so. If I have any SoG task already past 90% completion, I let them finish up and look at the logfile to make sure they have uploaded completely before running Qopt. I haven't had the finish error since I changed my routines. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
These 'ódd things' come up from time to time, and in general they trace back to some poor design decisions. I'd be happy to detail those if requested, however repeated attempts to submit fixes have been met with narcissistic crap, which I have no time for anymore. [Edit:] Example of thread safety: Request shutdown, worker shuts down, achknowledges shutdown, shutdown happens. Example of crap: Just kill things "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
Just pulled it up, mine is set to 60 seconds. Too often? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Just pulled it up, mine is set to 60 seconds. Too often? Go the rate you're willing to lose. I set mine to 1 hour (3600 seconds), though individual tasks will obviously write more often. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
Jason, sorry, I don't understand the mechanism, could you please explain the upsides/downsides to more/less frequent? Thanks! *edit* I mean, I don't want to lose anything, obviously, but I also want it to run as efficiently as possible. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Jason, sorry, I don't understand the mechanism, could you please explain the upsides/downsides to more/less frequent? Thanks! Sure Al, Let's say all tasks were 2+ hours long and you had a power failure. The last time your tasks saved information would be between when they started and the power failure. That might be every 60 seconds, or whatever you set. When the tasks resumed some information was potentially saved, so can continue from there. If you set it too long and have failures, you potentially waste energy reprocessing work you already did. Too short, you thrash your disks. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Just pulled it up, mine is set to 60 seconds. Too often? Considering how frequently you're running that rescheduler, 60 seconds is probably reasonable. Depending on the timing of each rescheduling run relative to the last checkpoint, each of your running tasks would lose between 0 and 59.999+ seconds of processing each time BOINC restarts so, on average, probably about 30 seconds per task. I suspect that task housekeeping, during both the BOINC shutdown and restart, probably adds some to that, though I don't know how much. The tradeoff, of course, is whatever overhead is involved with writing each checkpoint. I don't know what that amounts to, however. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.