Discussion of Invalid Host Messaging


log in

Advanced search

Message boards : Number crunching : Discussion of Invalid Host Messaging

Previous · 1 · 2 · 3 · 4 · 5 · Next
Author Message
TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1324689 - Posted: 4 Jan 2013, 20:28:48 UTC - in response to Message 1324587.

I thought the server was supposed to throttle a client with too many errors, is that not true?

It is true, but effective only when nothing but errors are returned for an extended time. There are delays caused by pending tasks from before the problem, and each error only reduces quota by one. For "SETI@home Enhanced 6.10 windows_intelx86 (cuda_fermi)", Thndr's host is down to "Max tasks per day 2" but for GPUs that is multiplied by a project setting which is probably still 8 here. So for that application version it's down to 16 per day, and that's enough less than the limit of 100 in progress to minimize the damage.
Joe


Its still seems odd that there are users with a thousand inconclusives.

As an example, I've got an inconclusive with this user, who has no tasks in progress, but the application details show he can take 100 WU? http://setiathome.berkeley.edu/show_host_detail.php?hostid=6726942

I don't mean to imply any ill will towards anyone working on sorting out their gpu problem. Maybe the project should throttle the GPUs faster and automatically notify you if you get a lot of errors.



11/14/2012 5:28:45 PM | | suspend work if non-BOINC CPU load exceeds 25 %.



This setting can produce errors, it stops and starts the app..
Better to set it to 0 (zero) and use the (CPU) and GPU setting: run when computer is idle, IMHO

The same applies to the default settings of the AstroPulse stock app.
DATA_CHUNK_UNROLL at default:2 DATA_CHUNK_UNROLL at default:2


Depending of the GPU used, these values should be adjusted accordingly.
F.i. start with UNROLL=8, UNROLL=16 is usually the limit.
Info: BOINC provided device ID used Used GPU device parameters are: Number of compute units: 4 Single buffer allocation size: 256MB max WG size: 1024 FERMI path used: yes

UNROLL=4 should work better, IMHO.
(If it's still possible to change these values?!).

I'm still receiving 0xc0000005 errors with my AMD 6850 with an unroll of 12. I seem to be getting more recently. I'm also receiving that error with my ATI 4670. I believe this could be a memory error, maybe associated with BOINC, as I recently received the same 0xc0000005 error with my old Dell not using AstroPulse. At the time the Dell had that error I had 'rearranged' the memory and it wasn't seated properly. The error seems to be happening with my 6850 after the task is finished and the last line of the report is being written. Working on the assumption that the 0xc0000005 error may be caused by BOINC, I recently changed my BOINC version to 7.0.38 after seeing the first change line referring to memory. Unfortunately, I immediately fell victim to the FLOP 'feature' of 7.0.38 and suffered a handful of errors before I added the correct FLOP entries in my App_info file. Anyway, I'm now running 7.0.38 and testing to see if I still receive the 0xc0000005 error with AstroPulse. All I need now is a large number of AstroPulse tasks to test on my 6850...

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1325538 - Posted: 7 Jan 2013, 17:19:44 UTC - in response to Message 1324689.

So, not a single AstroPulse "Computation error" since updating to BOINC 7.0.38. Interesting. I had been receiving around one, sometimes two, a day with 7.0.28. I see the AstroPulse error was vaporized, but, I did receive 4 others with the nVidia card before adding the FLOP entry. You can see when I updated to 7.0.38 here, Task 2780311703 It's become somewhat of a cliffhanger now, will it pass or throw the Error? The suspense is growing with every passing hour, All AstroPulse v6 tasks for computer 6797524
I suppose I could just change the Unroll to 2 and see what happens, but, that would remove the drama...

This could save a large amount of Computer Time if updating BOINC solved the problem :-)

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1325721 - Posted: 8 Jan 2013, 7:04:50 UTC - in response to Message 1325538.

*** I suppose I could just change the Unroll to 2 and see what happens ***

I've completed 6 APs in a row with the Unroll setting at 2. If I would have done that back here ^Something isn't right^ around 5 of those 6 would have ended in a "Computation error".
Here are the first 3 of the 6;
Task 2784334429
Task 2784338147
Task 2784338745

I'm ready to declare Victory.

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3232
Credit: 31,585,971
RAC: 33
Netherlands
Message 1326061 - Posted: 9 Jan 2013, 14:16:13 UTC - in response to Message 1325721.
Last modified: 9 Jan 2013, 14:23:18 UTC

*** I suppose I could just change the Unroll to 2 and see what happens ***

I've completed 6 APs in a row with the Unroll setting at 2. If I would have done that back here ^Something isn't right^ around 5 of those 6 would have ended in a "Computation error".
Here are the first 3 of the 6;
Task 2784334429
Task 2784338147
Task 2784338745

I'm ready to declare Victory.


You could adjust the Fetch and Thread_Block, see that you use 1:3 and
Fetch_Block 2048 and Thread_Block 6144, have you tried other values, f.i. Tread_Block 10240 and Fetch_Block 5120 or 8192 and 4096: (1:2).
Running on device number: 0 DATA_CHUNK_UNROLL at default:2 DATA_CHUNK_UNROLL set to:2 FFA thread block override value:6144 FFA thread fetchblock override value:2048


With UNROLL=2, Tread and Fetch_Block can be bigger!
Which one is the most effective also depends which GPU you're using.
See that you use an 9800GT and a BARTS GPU. Quite different architecture and
thus Regsize.
____________


Knight Who Says Ni N!, OUT numbered.................

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1326114 - Posted: 9 Jan 2013, 17:37:28 UTC - in response to Message 1326061.

*** I suppose I could just change the Unroll to 2 and see what happens ***

I've completed 6 APs in a row with the Unroll setting at 2. If I would have done that back here ^Something isn't right^ around 5 of those 6 would have ended in a "Computation error".
Here are the first 3 of the 6;
Task 2784334429
Task 2784338147
Task 2784338745

I'm ready to declare Victory.


You could adjust the Fetch and Thread_Block, see that you use 1:3 and
Fetch_Block 2048 and Thread_Block 6144, have you tried other values, f.i. Tread_Block 10240 and Fetch_Block 5120 or 8192 and 4096: (1:2).
Running on device number: 0 DATA_CHUNK_UNROLL at default:2 DATA_CHUNK_UNROLL set to:2 FFA thread block override value:6144 FFA thread fetchblock override value:2048


With UNROLL=2, Tread and Fetch_Block can be bigger!
Which one is the most effective also depends which GPU you're using.
See that you use an 9800GT and a BARTS GPU. Quite different architecture and
thus Regsize.

Look at my results from last night, you will see different -ffa_block & -ffa_block_fetch numbers, All AstroPulse v6 tasks for computer 6797524. Since declaring Victory I have been attempting to up the average GPU usage back to around 90%. BOINC 7.0.38 seems to have lowered the average to around 80%. From my experience the -ffa_block & -ffa_block_fetch numbers don't make that much of a difference above the 6144 & 1536 setting I was using for a long time. The Unroll numbers do make a noticeable difference. From what I've read, the optimum Unroll number is equal to your Compute Units, which for the 6850 is 12. I only work Multibeam on the 8800, and try to keep the 6850 on AstroPulses. They use different Apps, different settings. There are a large number of people receiving the "Access Violation (0xc0000005) at address 0x0040xxxx read attempt to address 0x04A3xxxx" Error, all you have to do is look around and you will find plenty. I find them by just looking at the results from my AstroPulse Wingmen.

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23672
Credit: 32,377,354
RAC: 24,412
Germany
Message 1326116 - Posted: 9 Jan 2013, 17:49:36 UTC

GPU usage has nothing to do with the Boinc version.
Blankings are reponsible for GPU usage because its calculation is done by the CPU.

____________

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1326125 - Posted: 9 Jan 2013, 18:20:51 UTC - in response to Message 1326116.

GPU usage has nothing to do with the Boinc version.
Blankings are reponsible for GPU usage because its calculation is done by the CPU.

Sorry Mike. My experience has shown that different BOINC versions do make a difference. I'm well aware of the Blanking slowing down GPU usage as I've watched the process for quite a while now. I just watched one crawl by at around 40% GPU usage.

BTW, remember the problem with running 2 MBs with BOINC 7.0.36? It was solved by going back to BOINC 7.0.28. Well, the problem is back with 7.0.38. Not only that, NOW I'm having problems with running 2 APs with 7.0.38. It's fine until I hit 2 of those Blanked APs at the same time, then I receive a hang. But....But.... I didn't have that problem with APs with BOINC 7.0.28 OR 7.0.36. BOINC versions do make a difference on my MacPro Running Win XPsp3 with an AMD 6850. How many of those do you have around here? A MacPro, running Win XPsp3, with an 6850?

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23672
Credit: 32,377,354
RAC: 24,412
Germany
Message 1326128 - Posted: 9 Jan 2013, 18:27:58 UTC

Do you keep a CPU core free ?

Its more the mixed setup i would say.
Running 2 different GPUs always uses much ressources.
I`m fully aware of the 6850, it has issues running multiple instances.

____________

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1326134 - Posted: 9 Jan 2013, 19:00:06 UTC - in response to Message 1326128.

A while back, I went to using a CPU setting of 60% for Multiprocessors. That gave me about the same as I had using a different setting, 2 CPUs for 603 Tasks and 2 CPUs for the GPUs. I've never had a problem with having 2 CPUs free for the 2 GPUs even with running Multiple Instances on 1 GPU. I'm running the wonderful System Information Viewer continuously, and it has a CPU Process graph for each Process. When the Hang occurs, the CPU process for one AP maxes out in the Red. My guess is BOINC 7.0.38 wants me to sacrifice another CPU for the GPUs. Again, I did't have that problem in 7.0.28 or 7.0.36. I'm really not interested in running 2 APs at the same time, I was just testing to see if I got the (0xc0000005) error. I didn't receive the (0xc0000005) error, I got something else. I'm happy with running 1 AP at a time, at around 90% GPU usage. If you look back at my results when using 7.0.28, you will notice I was getting some fast times. Those times were with the GPU running around 90% with most tasks. Since updating to 7.0.38, the most I've seen is around 80% usage on the fastest APs I've run so far. I should have run across quite a few at 90%, I haven't....So Far.

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1326253 - Posted: 10 Jan 2013, 1:41:08 UTC - in response to Message 1326134.
Last modified: 10 Jan 2013, 2:39:55 UTC

BTW, I still haven't received the dreaded "Access Violation (0xc0000005) at address 0x0040A1FA read attempt to address 0x00399F64" Error on my MacPro since updating to 7.0.38. Later tonight I will be updating the machine with the ATI 4670 to 7.0.38. I'm not sure what's going on with that machine, it's had a terrible week. It usually doesn't give any problems. Now it's given the "Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0040A1FA read attempt to address 0x00399F64" Error, Followed by an Invalid, and then more questionable results. We'll see what happens after the update.

When it rains....
Check this out. I just looked at my last completed AP and look at what my Wingman got;
Access Violation (0xc0000005) at address 0x0040A1FA read attempt to address 0x0052901C
You can't make this stuff up....

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1327159 - Posted: 12 Jan 2013, 19:25:44 UTC

After 9 days, another Error. Well, it's better than it use to be.

Access Violation (0xc0000005) at address 0x0040A1FA read attempt to address 0x003AA2DC

When the other results are in, I'll bet they validate the results that are currently labeled "Invalid". Something is causing that Error after the results are written...

Profile trader
Volunteer tester
Send message
Joined: 25 Jun 00
Posts: 126
Credit: 4,968,173
RAC: 0
United States
Message 1350299 - Posted: 24 Mar 2013, 22:52:13 UTC - in response to Message 1327159.

if you are getting errors using the stock app, and are running windows x64. pm me and i might be able to help

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8439
Credit: 47,998,952
RAC: 61,682
United Kingdom
Message 1350322 - Posted: 24 Mar 2013, 23:53:07 UTC - in response to Message 1350299.

i might be able to help

With a 10-week old problem?

Do share.

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1350361 - Posted: 25 Mar 2013, 2:22:02 UTC
Last modified: 25 Mar 2013, 2:31:18 UTC

Maybe someone could send a PM to Cody Sharp and explain to him how to add the command line entry to his ap_cmdline_6.04_windows_intelx86__opencl_ati.txt file. He just showed up at the top of one of my Workgroups again. I guess his file might have a different name since he's running Win 7 64bit. If he just added one of these lines to that file he could probably avoid most of those Errors. I keep thinking about it every time I see one of his Errors, but, there are so many of those affected Hosts...

Maybe SETI could make a post on the News page about the problem and explain how to add one of the lines;
Command Line Parameters,
High end cards (more than 12 compute units)
-unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -hp

Mid range cards (less than 12 compute units)
-unroll 10 -ffa_block 6144 -ffa_block_fetch 1536 -hp

Entry level GPU (less than 6 compute units)
-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024 -hp

It might save some bandwidth...and give people sending PMs a link to reference.

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23672
Credit: 32,377,354
RAC: 24,412
Germany
Message 1350451 - Posted: 25 Mar 2013, 8:52:22 UTC

The app has lowest values set as default.
You only need to do that if you want to go faster.

____________

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1198
Credit: 44,272,722
RAC: 115,356
United States
Message 1350508 - Posted: 25 Mar 2013, 14:52:11 UTC - in response to Message 1350451.
Last modified: 25 Mar 2013, 15:15:52 UTC

It also causes something else, related to the Error. What happens if you set the settings too high in XP? *Out of Memory* It causes the App to use more memory. But, You should know that though...

It sure helped me back here, Then try adding -unroll 10 -ffa_block 6144 -ffa_block_fetch 1536 to your ap_cmdline_6.04_windows_intelx86__opencl_ati.txt file. Send someone having a lot of these Errors a message, see if it helps them.

Oh, I'm about out of APs again...

Profile trader
Volunteer tester
Send message
Joined: 25 Jun 00
Posts: 126
Credit: 4,968,173
RAC: 0
United States
Message 1350638 - Posted: 25 Mar 2013, 18:54:54 UTC - in response to Message 1350322.

i might be able to help

With a 10-week old problem?

Do share.


An answer via an objective interpolation of your definition of the word share when applied to the meagar number of posts I have precludes me from providing you with the answer that you desire that will not violate the august moderator's interpretation of what a flame/hate mail is, and I actually care about what he thinks.

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1580
Credit: 9,460,812
RAC: 7,119
Message 1350640 - Posted: 25 Mar 2013, 18:59:03 UTC - in response to Message 1350638.

i might be able to help

With a 10-week old problem?

Do share.


An answer via an objective interpolation of your definition of the word share when applied to the meagar number of posts I have precludes me from providing you with the answer that you desire that will not violate the august moderator's interpretation of what a flame/hate mail is, and I actually care about what he thinks.

Which 'he' would that be now, in the last sentence? *puzzled look*
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

andybuttProject donor
Volunteer tester
Avatar
Send message
Joined: 18 Mar 03
Posts: 251
Credit: 112,593,413
RAC: 99,640
United Kingdom
Message 1350644 - Posted: 25 Mar 2013, 19:01:55 UTC - in response to Message 1350638.

we would still like to know your input to the problem whatever Richard said to offend you

Andy
____________

Profile Zapped SparkyProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 7110
Credit: 1,224,628
RAC: 1,244
United Kingdom
Message 1353958 - Posted: 6 Apr 2013, 1:58:08 UTC

Bump.
____________
In an alternate universe, it was a ZX81 that asked for clothes, boots and motorcycle.

Client error 418: I'm a teapot

Tropical Goldfish Fish 15: Squeaky bras 'R us

Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Discussion of Invalid Host Messaging

Copyright © 2014 University of California