Advice on system optimization needed.

Message boards : Number crunching : Advice on system optimization needed.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 2008642 - Posted: 21 Aug 2019, 16:31:47 UTC - in response to Message 2008580.  

I have been playing around with both methods of providing cpu support to gpus today on the 7.16.1 client. All I can say is that if you have a Intel processor either method works and everything runs fine. If on the other hand you have a AMD processor, you will still be cussing the brain-dead Linux AMD cpu thread scheduler and looking for compromises.

Neither way works the way it should. It would be best to set cpu usage to 100% but you will end up with overcommitted cpu threads. And trying to use a max concurrent breaks things entirely. So the only option is to use cpu% to reduce the number of cpu cores used. But the thread scheduler can't keep the task on the same thread and constantly moves it around. And you end up with both an overcommitted cpu and poor cpu_time/run_time tracking to boot. ...

That's rather odd and unexpected. Linux has long been known to be very good for its CPU scheduling, especially so for being sympathetic to both Intel and AMD (and others) and for being NUMA aware.

Which kernel are you running and what are your symptoms?

Or in reality, are you actually hitting a memory bandwidth limit?


Happy fast crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 2008642 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2008663 - Posted: 21 Aug 2019, 19:23:35 UTC - in response to Message 2008583.  
Last modified: 21 Aug 2019, 19:25:49 UTC

I did find a gtx 1070 single slot called a Katana. There are a couple on ebay but not cheap enough.


Considering additional cards. Single RTX 2080 super or dual RTX 2070? If I get a single 2080 there is a decent chance a 2nd one would follow at some point. I can add 2 additional 2 slot cards or 4 single slot cards to my machine without using riser cards or anything like that. Plenty of power for either option. Cards have to be "Turbo" style. Exhaust exit out the back.

Eric

As far as I know there are no single slot cards of any performance value to Seti. Plenty of dual slot blower style cards. If you have the moolah now to purchase either choice, I would get the two RTX 2070 Supers. If you upgrade one in the future to a RTX 2080, that gives you a leg up on another cruncher build with the card ready to go.

Those were an "urban" myth. They were never produced for distribution as far as I know. I would be very suspicious of those eBay listings. If they are actually physically available, then they would have to be the developmental products of which there were actually very few made. I would expect them to command very high prices because of their rarity.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2008663 · Report as offensive
Profile Eric Claussen

Send message
Joined: 31 Jan 00
Posts: 22
Credit: 2,319,283
RAC: 0
United States
Message 2008673 - Posted: 21 Aug 2019, 20:10:15 UTC - in response to Message 2008663.  

I did find a gtx 1070 single slot called a Katana. There are a couple on ebay but not cheap enough.


[quote][quote]Considering additional cards. Single RTX 20
Eric

Those were an "urban" myth. They were never produced for distribution as far as I know. I would be very suspicious of those eBay listings. If they are actually physically available, then they would have to be the developmental products of which there were actually very few made. I would expect them to command very high prices because of their rarity.


Pictures on ebay but both are overseas. Australia and europe. Too much for a 1070 anyways.

I see 2080 super with a blower fan for 719 new in box. Not seeing any good deals on 2070 supers. Can get refurbished 2070 for about 400.

I'm going to let the RAC stabilize before pulling the trigger on anything. Should be in the 150k range or abit higher if I can keep it running consistently.
ID: 2008673 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2008960 - Posted: 23 Aug 2019, 17:56:34 UTC - in response to Message 2008641.  

^^^^^^
See what I mean.... He's back.
All I did was state No One Else has reported Keith's problem, and reminded him My builds don't have those Hacks if he wants to test with one of them.
I would suggest My build for 19.04, it has the Finish File Fix and seems to work with 18.04. Just remember to Lock the coproc file before starting the Manager with the different version of boinc. Or, build Your Own 7.16.1 without the Hacks, and lock the mentioned file.


your hack doesn't allow the crunching of CPU work, so how could anyone else running it have that problem? like I said, apples to oranges.

as far as I'm aware, you haven't made available any version of your own other than the 7.14.2 which is still what is included in the AIO package on CA. If you have a new version of the client that somehow works with CPU work while implementing your XML file hack, I'd like to test it.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2008960 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2008968 - Posted: 23 Aug 2019, 18:37:58 UTC

I know everyone is free to share his code with anyone he wants but just a warming:

Running or no CPU tasks, release without control the "XML file hack" will hit hard the servers.

my 0.02 cents
ID: 2008968 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2008973 - Posted: 23 Aug 2019, 19:08:27 UTC - in response to Message 2008641.  

^^^^^^
See what I mean.... He's back.
All I did was state No One Else has reported Keith's problem, and reminded him My builds don't have those Hacks if he wants to test with one of them.
I would suggest My build for 19.04, it has the Finish File Fix and seems to work with 18.04. Just remember to Lock the coproc file before starting the Manager with the different version of boinc. Or, build Your Own 7.16.1 without the Hacks, and lock the mentioned file.

Richard Haselgrove reported the same problem I did. Did you ever go read the issue at Github? Have you read through the changes the bugfix introduced into the codebase in work_fetch and the tons of round robin scheduling routines added?
Did you read Juha's comment on the merge?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2008973 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2009003 - Posted: 23 Aug 2019, 23:39:23 UTC - in response to Message 2008642.  
Last modified: 23 Aug 2019, 23:59:13 UTC

...All I can say is that if you have a Intel processor either method works and everything runs fine. If on the other hand you have a AMD processor, you will still be cussing the brain-dead Linux AMD cpu thread scheduler and looking for compromises..
So you're saying this only happens with AMD CPUs? And with just 7.16.1? Sounds like a BOINC problem to me, probably should stay with 7.14.2 for a while longer.

I did say tests with CPUs would need to be Done. I'm not having any trouble at all with my GPU only machines, and.... I don't use AMD CPUs anyway. Got a link to those problems?
ID: 2009003 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2009010 - Posted: 24 Aug 2019, 0:37:22 UTC - in response to Message 2009003.  

No I am saying that this happens with BOINC 7.15.0 or later with my bug merged. The thread scheduler plays nicely with affinity on Intel cpus. Or at least with my one example that I run. Richard only runs Intel processors and had problems with max concurrent also. So not an AMD thing. The problem with AMD processors of FX or greater is that AMD walks the loading around the cores in the microcode and that plays havoc with the Linux thread scheduler. There has been a lot of news on that front on the tech sites. Has to do with how the IOMMU nodes are mapped. Windows thread scheduler ws doing especially bad when the Threadripper cpus arrived. Eventually the thread scheduler will be smart enough to handle AMD processors too. It's just that they are new to the party and the thread scheduler hasn't needed any major updates for a long time for Intel processors since the status quo worked well enough.

Eventually the upstream update to the stock Completely Fair Scheduler that has fixes for AMD processors will make it into the mainstream distros.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2009010 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2009011 - Posted: 24 Aug 2019, 0:41:49 UTC

Since it is not apparent that you have reviewed my earlier comment about the bug, I will post Juha's comment about the merge. In case you don't know Juha is our resident BOINC code wrangler.

Even though I merged this I don't consider the job done yet. As previously discussed in my opinion the work fetch code now needs improving.

If work fetch is blocked as long as there are max_concurrent or more tasks in buffer then before work fetch resumes there has to be an idle device.
Those who want to use max_concurrent will now have to choose between using max_concurrent and having work buffered. My gut feeling is that people will choose work buffer and that means they won't be able to upgrade BOINC beyond 7.14 as long as the work fetch works the way it does now.

Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2009011 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2009017 - Posted: 24 Aug 2019, 1:22:45 UTC - in response to Message 2009011.  

Now you are saying it has to do with the ' max_concurrent ' change? Did it happen before then? If Not, then it is surely a BOINC problem with the way you want to use max_concurrent , not an AMD problem. I'm pretty sure Intel also moves loads around to equalize temps, it's not just an AMD thing. Funny thing is, I'm using project_ max_concurrent on All my machines running GPU tasks, and not having any trouble. Have you tried running the Standard Non-Hacked BOINC 7.16.1 to see if you have the same problem? I'm not using a Hacked client on any of my machines. See what you get with the Standard BOINC build, that's sorta what I posted before, try a standard build even if it means you can't download 'extra tasks'.
ID: 2009017 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2009022 - Posted: 24 Aug 2019, 2:06:52 UTC - in response to Message 2009017.  

I tested on the non-hacked 7.15.0 build for the bugfix. Look at my BOINC simulation. It fails with a max_concurrent statement. I assume everything would be the same with the non-hacked 7.16.1 build since no changes in the work_fetch.cpp module have been made since the 7.15.0 build. The module is the same in 7.16.1.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2009022 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2009023 - Posted: 24 Aug 2019, 2:10:08 UTC - in response to Message 2009022.  

As I replied earlier, the definition of max_concurrent and use has changed for max_concurrent since my bugfix. The definition now hews to the original intent per Dr. Anderson for any release beyond 7.14.2. We have not been able to change his mind on the typical, historical use of max_concurrent. The other developers are not happy with the current state, but is likely not to change back to what it was before my fix.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2009023 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2009026 - Posted: 24 Aug 2019, 2:26:35 UTC - in response to Message 2009023.  

What I'm trying to determine is, do you have any trouble when using BOINC 7.16.1 the way the Developer intends? That means, No hacked client, and Not using max_concurrent.
How does that work?
ID: 2009026 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2009029 - Posted: 24 Aug 2019, 3:00:44 UTC - in response to Message 2009026.  

What I'm trying to determine is, do you have any trouble when using BOINC 7.16.1 the way the Developer intends? That means, No hacked client, and Not using max_concurrent.
How does that work?

I don't know. I haven't tried to use the 7.16.1 client without the hack. As I stated, nothing has changed with the code of the 7.16.1 client compared to the 7.15. 0 unhacked client. So would expect the same results. I can see you will only accept an answer involving a test with an unhacked client. Not willing to do that midstream in the contest. Why don't you run your own simulation with the 7.16.1 client with your kludge on coproc_info.xml and with your max_concurrent statement in the BOINC Client Emulator. See whether you drop work and have unused devices and don't request work.
https://boinc.berkeley.edu/sim_web.php
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2009029 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2009033 - Posted: 24 Aug 2019, 3:24:22 UTC - in response to Message 2009029.  
Last modified: 24 Aug 2019, 3:28:07 UTC

I'm not having any BOINC problems, I don't use AMD CPUs which you claim are the devices having problems, and, I still haven't heard of anyone else having Your problems. You need to be the one doing the Testing. You would think if others were having Your problem, they would have chimed in by now. Anyone else using the Stock BOINC with AMD CPUs having this problem? If you want to wait a few days fine. BTW, the Hacked client produces the same kludge you speak of, I just offered it in hopes You might be more inclined to do the testing if You could download 'extra tasks' without using a Hacked Client. Otherwise, I wouldn't have mentioned it, it's been around for Years, and as You know, I've Never mentioned it before.
ID: 2009033 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2009041 - Posted: 24 Aug 2019, 6:10:30 UTC - in response to Message 2009033.  

Other than for doing it for just the sake of knowing whether the unhacked client does not have the problem with max concurrent, I am not interested since we already know that cpu tasks won't run with your method. And I will always run cpu tasks. So let's drop the issue. You have your method and I have my method. Both work. No point in debating the matter further.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2009041 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2009046 - Posted: 24 Aug 2019, 6:44:59 UTC - in response to Message 2009041.  
Last modified: 24 Aug 2019, 7:09:56 UTC

You're missing the point entirely. The Point is if there is some problem with the Stock 7.16.1 and AMD CPUs, I'd like to know about it before placing 7.16.1 in the BOINC-All-In-One. It has absolutely NOTHING to do with something you keep calling 'mine'. I just noticed the Hacked Client does the same thing as a method that has been known about for Years around here, and since they are about the same, I decided to mention it.
I'm not sure what you think the point is, to me, it's knowing if there is some problem with the Stock 7.16.1. Is there a problem, or is it just a problem with the 'new' max_concurrent? Your first post made it sound as though there was something to be concerned about,
Neither way works the way it should. It would be best to set cpu usage to 100% but you will end up with overcommitted cpu threads. And trying to use a max concurrent breaks things entirely. So the only option is to use cpu% to reduce the number of cpu cores used. But the thread scheduler can't keep the task on the same thread and constantly moves it around. And you end up with both an overcommitted cpu and poor cpu_time/run_time tracking to boot.
Is this different from 7.14.2?
ID: 2009046 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2009052 - Posted: 24 Aug 2019, 7:14:43 UTC

But 7.16.1 is not the current "official" release, but a release candidate, so why not go back to 7.14.x as then you will have a known baseline and not something that hasn't "suffered the tests of time"
(And don't forget that the 7.15.x branch was/is a development branch thus may well have bugs and features all of its own.)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2009052 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2009055 - Posted: 24 Aug 2019, 7:45:33 UTC - in response to Message 2009046.  

Is this different from 7.14.2?

I have 4 AMD systems and one Intel system. All of them run my gpupriority script which sets affinity and nice level. They all worked on 7.8.3. Then the script no longer worked correctly on 7.14.2
Except for one system which just happens to be the oldest Linux host that started out on 16.04. I can still run the script on that host and affinity works on it. Even on 7.16.1. The script always has worked on the Intel system. No surprises and it too started out as a 16.04 host. But the later AMD hosts just won't run the script correctly. And I have never figured out why. Identical memory, cpu and motherboards. If I try to lock affinity on the problem hosts, the cpu usage just plummets to 30% instead of the 60-65% that it should be running. Comment out the affinity lock and the cpu usage goes back to normal. Throw max_concurrent statements into the mix on various client versions adds another wrinkle to the problem. I just wish I could get the problem hosts back to running with affinity. The difference in performance is great. The one system with correctly working affinity processes more cpu tasks per day than the other hosts even with a cpu clock deficit. If you can keep the cpu task on the same core and not have it wander all over shortens the crunching time. Shows up in the host APR and the average times to crunch a cpu task. BT's daily and weekly tallies show it obviously too. I just don't like to leave performance on the table and I would like all clone systems to run exactly the same way. They don't now and I just want to know why?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2009055 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2009076 - Posted: 24 Aug 2019, 12:57:10 UTC - in response to Message 2009055.  

On your first post it sounded as though you were having problems with the over-committed CPU with a 'normal' system. Now it sounds as though you only have the problem when using CPU affinity. Is that the case? I'm just concerned with the average user who usually has never even heard of affinity. If it works normally with a correctable CPU load when not using affinity then all is well. It does sound strange to work differently on different installs, but, most people probably won't notice that problem.
ID: 2009076 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Advice on system optimization needed.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.