OpenCL NV MultiBeam v8 SoG edition for Windows

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 · Next

AuthorMessage
Harri Liljeroos
Avatar

Send message
Joined: 29 May 99
Posts: 3991
Credit: 85,281,665
RAC: 126
Finland
Message 1794507 - Posted: 8 Jun 2016, 18:00:27 UTC - in response to Message 1794471.  

If "1" never shown what about app's device capabilities listing? Does it show sometime other GPU selected?


Yes, sometimes it has used the device 0 and shown it correctly on both lines of stderr.

Unfortunately I had to revert back to the cuda applications, too many driver and computer crashes while running SoG. I may try again after tuning becomes easier. Maybe these cards (GTX970 and GTX650 Ti) are too different to run SoG smoothly on a same computer.
ID: 1794507 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794540 - Posted: 8 Jun 2016, 20:38:52 UTC - in response to Message 1794507.  
Last modified: 8 Jun 2016, 20:39:59 UTC

Maybe these cards (GTX970 and GTX650 Ti) are too different to run SoG smoothly on a same computer.

There is special treatment possible that was developed specially for the case of very different GPUs of same vendor in single host.
Look ReadMe for


For device-specific settings in multi-GPU systems it's possible to override some of command-line options via
application config file.

Name of this config file:
MultiBeam_<vendor>_config.xml where vendor can be ATi, NV or iGPU.
File structure:
<deviceN>
<period_iterations_num>N</period_iterations_num>
<spike_fft_thresh>N</spike_fft_thresh>
<sbs>N</sbs>
<oclfft_plan>
<size>N</size>
<global_radix>N</global_radix>
<local_radix>N</local_radix>
<workgroup_size>N</workgroup_size>
<max_local_size>N</max_local_size>
<localmem_banks>N</localmem_banks>
<localmem_coalesce_width>N</localmem_coalesce_width>
</oclfft_plan>
<no_caching>
</deviceN>
ID: 1794540 · Report as offensive
Harri Liljeroos
Avatar

Send message
Joined: 29 May 99
Posts: 3991
Credit: 85,281,665
RAC: 126
Finland
Message 1794635 - Posted: 9 Jun 2016, 5:29:05 UTC - in response to Message 1794540.  

Thank you for the information. I'll keep it in mind for the next time. For now I don't have time to experiment more.
ID: 1794635 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794862 - Posted: 9 Jun 2016, 23:56:41 UTC

. . Hi Raistmer,

. . This is probably nothing, it is the only error so far in probably over 500 WU's but here it is.

http://setiathome.berkeley.edu/result.php?resultid=4974924416

. . . Very little information in the output but I thought it might be better to add it to the database :)
ID: 1794862 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794874 - Posted: 10 Jun 2016, 0:19:15 UTC - in response to Message 1794862.  

ID: 1794874 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794970 - Posted: 10 Jun 2016, 6:58:51 UTC

0xC0000018
STATUS_CONFLICTING_ADDRESSES
{Conflicting Address Range} The specified address range conflicts with the address space.
ID: 1794970 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794981 - Posted: 10 Jun 2016, 7:28:14 UTC - in response to Message 1794970.  
Last modified: 10 Jun 2016, 7:28:28 UTC

0xC0000018
STATUS_CONFLICTING_ADDRESSES
{Conflicting Address Range} The specified address range conflicts with the address space.


Probably if not repeatable, then a genuine bitflip (e.g from cosmic rays or radioactive carbon in the processor/ram). Workstation grade components with ECC memory reduce the probability of that. We've been referring to that as ' "Eddys in the spacetime continuum", "Eddie Who's Eddie?", "No Not WHo's Eddie, What's Eddie?", "What? What's Eddie doing in the spacetime continuum ?"
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794981 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1794984 - Posted: 10 Jun 2016, 7:35:27 UTC

Just got two bad work units with missing header information it looks like.
4294967290 (0xfffffffa) Unknown exit code

These work units:
Task 4975612198
Task 4975612087
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1794984 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794985 - Posted: 10 Jun 2016, 7:39:15 UTC - in response to Message 1794984.  

Just got two bad work units with missing header information it looks like.
4294967290 (0xfffffffa) Unknown exit code

These work units:
Task 4975612198
Task 4975612087

Both wingmates completed successfully, which suggests that the raw datafile had headers intact.
ID: 1794985 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1794986 - Posted: 10 Jun 2016, 7:42:56 UTC - in response to Message 1794985.  
Last modified: 10 Jun 2016, 7:43:28 UTC

So does that mean that computer mangled the work units just when it grabbed them for processing?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1794986 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794987 - Posted: 10 Jun 2016, 7:45:49 UTC - in response to Message 1794986.  

So does that mean that computer mangled the work units just when it grabbed them for processing?


Many possible layers between the server and client CPU, from download through reading from disk.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794987 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1794989 - Posted: 10 Jun 2016, 7:52:57 UTC - in response to Message 1794987.  

So does that mean that computer mangled the work units just when it grabbed them for processing?


Many possible layers between the server and client CPU, from download through reading from disk.

there's an MD5 check after DL isn't there? And/or a size check?
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1794989 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794990 - Posted: 10 Jun 2016, 7:53:26 UTC - in response to Message 1794987.  

So does that mean that computer mangled the work units just when it grabbed them for processing?

Many possible layers between the server and client CPU, from download through reading from disk.

And we have seen that error message before, in other applications including CUDA, with no conclusive evidence that the data file has suffered any corruption at all.

It seemed (IIRC) to be more prevalent on task restarts than initial runs. I think that the code generating that error message dates from the original Berkeley CPU code: checking that for trigger points might give us a better handle on what's really happening under the hood.
ID: 1794990 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794991 - Posted: 10 Jun 2016, 7:57:36 UTC - in response to Message 1794990.  
Last modified: 10 Jun 2016, 7:58:35 UTC

So does that mean that computer mangled the work units just when it grabbed them for processing?

Many possible layers between the server and client CPU, from download through reading from disk.

And we have seen that error message before, in other applications including CUDA, with no conclusive evidence that the data file has suffered any corruption at all.

It seemed (IIRC) to be more prevalent on task restarts than initial runs. I think that the code generating that error message dates from the original Berkeley CPU code: checking that for trigger points might give us a better handle on what's really happening under the hood.


Any prevalence more common than about once every 3 months on a given host, would indicate either a configuration, system or indeed client or application issue. Less frequently than that on sub-workstation grade componentry indicates noise (radiation).
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794991 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1794992 - Posted: 10 Jun 2016, 7:58:08 UTC - in response to Message 1794987.  

I actually think it was the BOINC shutdown that froze on exit and then blue-screened the computer that did it. Strange thing is that I always wait till a quiescent period in BOINC activity before I initiate a shutdown. That means no work units are close to finishing, all recently completed work units have successfully uploaded and BOINC is not close to asking for network communication. Only when all those cases are met do I shutdown the Manager and close the client. I can only conclude that BOINC was reading those tasks when the computer blue-screened.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1794992 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794993 - Posted: 10 Jun 2016, 7:59:30 UTC - in response to Message 1794992.  
Last modified: 10 Jun 2016, 8:00:22 UTC

I actually think it was the BOINC shutdown that froze on exit and then blue-screened the computer that did it. Strange thing is that I always wait till a quiescent period in BOINC activity before I initiate a shutdown. That means no work units are close to finishing, all recently completed work units have successfully uploaded and BOINC is not close to asking for network communication. Only when all those cases are met do I shutdown the Manager and close the client. I can only conclude that BOINC was reading those tasks when the computer blue-screened.



I'd class that as possibly reproducible [Rather than Eddy/Eddie]. Can you try that ? (could take substantial hammering :) )
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794993 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794994 - Posted: 10 Jun 2016, 8:08:20 UTC - in response to Message 1794993.  
Last modified: 10 Jun 2016, 8:08:52 UTC

At least the error message gives us a file name and line number:

!swi.data_type || !found || !swi.nsamples
File: ..\seti_header.cpp
Line: 216

  //  Allow old style headers to be parsed correctly.
  // jeffc - need this?
  //swi.fft_len=2048;
  //swi.ifft_len=8;

  do {
    fgets(buf, 256, f);
  } while (!feof(f) && !xml_match_tag(buf,"<workunit_header")) ;

Looks like we've dropped through to some legacy code - perhaps we should have branched to a more modern path higher up?
ID: 1794994 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794995 - Posted: 10 Jun 2016, 8:12:30 UTC

Actually, the error message is on line 232 of the current file. Are we using an outdated version of seti_header.cpp?
ID: 1794995 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1794996 - Posted: 10 Jun 2016, 8:13:14 UTC - in response to Message 1794993.  
Last modified: 10 Jun 2016, 8:16:44 UTC

I believe that type of failure and exit status is the first I've experienced. I am running the latest beta BOINC Manager 7.6.29(x64) which I believe has had some code changed recently to fix Manager exits compared to the last stable release 7.6.22(x64). Richard probably could say just what the code jockeys played with in the latest beta.

[Edit] Looks like my copy of the beta is not the latest now. We're up to 7.6.33(x64)
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1794996 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794997 - Posted: 10 Jun 2016, 8:20:22 UTC - in response to Message 1794996.  

I believe that type of failure and exit status is the first I've experienced. I am running the latest beta BOINC Manager 7.6.29(x64) which I believe has had some code changed recently to fix Manager exits compared to the last stable release 7.6.22(x64). Richard probably could say just what the code jockeys played with in the latest beta.

[Edit] Looks like my copy of the beta is not the latest now. We're up to 7.6.33(x64)


Yeah, not a 'normal' situation IMO. Would need to be reproducible on demand to localise better.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794997 · Report as offensive
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 · Next

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.