Working as Expected (Jul 13 2009)


log in

Advanced search

Message boards : Technical News : Working as Expected (Jul 13 2009)

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next
Author Message
Profile Bill Walker
Avatar
Send message
Joined: 4 Sep 99
Posts: 3403
Credit: 2,150,760
RAC: 2,144
Canada
Message 918449 - Posted: 16 Jul 2009, 13:52:00 UTC - in response to Message 918445.

Wouldn't this mean all your pending uploads would get backed by the same delay time? Then you again get x numbers of jobs trying to upload at the same time. I think the present system, where each failed upload gets a pseudo-random delay time, more effectively spreads out the downstream retries.


YES, and NO... Boinc has an inbuilt setting which only allow x number of simultaneous tranfers (Default = 2 per project)


So after the delay BOINC would try x uploads, then when they fail x more, and so on. That sounds exactly like what happens right now when a frustrated user hits "Retry Now" for all his umpteen pending uploads. Not an improvement, in my opinion.
____________

Profile Virtual Boss*
Volunteer tester
Avatar
Send message
Joined: 4 May 08
Posts: 417
Credit: 6,199,112
RAC: 408
Australia
Message 918451 - Posted: 16 Jul 2009, 13:58:38 UTC - in response to Message 918445.

Wouldn't this mean all your pending uploads would get backed by the same delay time? Then you again get x numbers of jobs trying to upload at the same time. I think the present system, where each failed upload gets a pseudo-random delay time, more effectively spreads out the downstream retries.


YES, and NO... Boinc has an inbuilt setting which only allow x number of simultaneous tranfers (Default = 2 per project)


Just to clarify...(using the default setting of two)

Yes ... ALL would be ready to attempt upload at the same time.

IF the first 2 attempting upload failed, they would all back off again.

or IF the first two succeeded, two more would immediately attempt upload...
continueing until all uploaded ( or one failed, which would initiate a new back off)
____________
Flying high with Team Sicituradastra.

Dave
Send message
Joined: 13 Jul 09
Posts: 1
Credit: 5,218
RAC: 0
United States
Message 918452 - Posted: 16 Jul 2009, 14:16:17 UTC

I have a simple mind and I am not following this discussion too well ;-\
What is the estimate for me to be able to upload completed work units??
Or should I suspend processing until uploads are again functioning??

Many thanks,
Dave G.
"Per Ardua, ad Astra"[/img]

Profile Virtual Boss*
Volunteer tester
Avatar
Send message
Joined: 4 May 08
Posts: 417
Credit: 6,199,112
RAC: 408
Australia
Message 918453 - Posted: 16 Jul 2009, 14:21:30 UTC - in response to Message 918452.

I have a simple mind and I am not following this discussion too well ;-\
What is the estimate for me to be able to upload completed work units??
Or should I suspend processing until uploads are again functioning??

Many thanks,
Dave G.
"Per Ardua, ad Astra"[/img]


Wecome to the message boards.

Uploads are getting through, although it's still a bit patchy.

No need to suspend processing,things should be improving with time.
____________
Flying high with Team Sicituradastra.

lobozmarcin
Volunteer tester
Send message
Joined: 19 Jul 02
Posts: 1
Credit: 1,421
RAC: 0
Poland
Message 918455 - Posted: 16 Jul 2009, 14:39:10 UTC

I have problem:

2009-07-16 15:27:18|SETI@home|Backing off 3 hr 39 min 10 sec on upload of 01dc08ad.14412.20113.8.8.66_1_0
2009-07-16 15:27:19||Internet access OK - project servers may be temporarily down.
2009-07-16 15:27:36|SETI@home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 0 completed tasks
2009-07-16 15:27:41|SETI@home|Scheduler request completed: got 0 new tasks


3 days this same problem
____________

Profile Virtual Boss*
Volunteer tester
Avatar
Send message
Joined: 4 May 08
Posts: 417
Credit: 6,199,112
RAC: 408
Australia
Message 918459 - Posted: 16 Jul 2009, 14:49:46 UTC - in response to Message 918455.

I have problem:

2009-07-16 15:27:18|SETI@home|Backing off 3 hr 39 min 10 sec on upload of 01dc08ad.14412.20113.8.8.66_1_0
2009-07-16 15:27:19||Internet access OK - project servers may be temporarily down.
2009-07-16 15:27:36|SETI@home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 0 completed tasks
2009-07-16 15:27:41|SETI@home|Scheduler request completed: got 0 new tasks


3 days this same problem


Wecome to the message boards.

Your work is slowly getting through, 16 tasks have uploaded and reported today (UTC time)

After a few more have uploaded you will get some new work.

____________
Flying high with Team Sicituradastra.

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12327
Credit: 2,632,842
RAC: 1,201
Netherlands
Message 918460 - Posted: 16 Jul 2009, 14:53:13 UTC - in response to Message 918422.
Last modified: 16 Jul 2009, 14:53:52 UTC

But I accept your point about needing to check the CPU overhead of the unzip process on Bruno when the zips arrive.

Why would they need to be decompressed on Bruno? As zip they take up less space on the disk as well as in throughput. I would expect that once you really need the results that you can import them on whatever server you're using to look at them and decompress them there. Or is that too simple?
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

CryptokiD
Avatar
Send message
Joined: 2 Dec 00
Posts: 134
Credit: 2,814,936
RAC: 0
United States
Message 918468 - Posted: 16 Jul 2009, 15:30:58 UTC
Last modified: 16 Jul 2009, 15:34:54 UTC

you all see those short time work units? they take only a few minutes for the cpu or gpu to complete. why not stack those shorties into a big zipped file and send one single to a user. it would reduce the number of simultaneous connections to the server AND client if you had 40 or 50 work units in 1 large compressed file. when the client is done with the work units they can compress them back into one file and send it back to the server.

if a client got bored and only managed to crunch half the work units, it could still compress the half it finished into one file and send that. when the file gets to the server it would decompress the file into the invidivual work units, and count and file away the ones that made it back, and re issue work units for the ones which the client did not complete or error ed out.

this would greatly reduce the amount of hammering the seti servers see from clients begging to upload or download. instead of seeing thousands of little files comming and going, we would see hundreds of large files which means reduced inbound and outbound connections. the 100mbit pipe would still get saturated to capacity, but at least the number of connections would decrease.


you could even go so far as to limit the number of downloads an individual client is allowed per day. set it at for example, 2. only twice a day can a client request a new compressed stack of work units, and only if he has sent the previous ones back already. the boinc app only lets you download up to 100 wu's a day when you first join. why not compress those 100 into 1 file, send it out and hope you get some of it back a week later. and just like the current boinc app, if a client shows that it can handle 100 a day, then the number could gradually increase.

again, it wouldnt solve the bandwidth issue, but it would greatly reduce the number of connection attempts, which are in and of themselve bandwidth hogs.

on the computers on my account alone they are trying to upload a completed work unit every few seconds. this could be reduced to 2x per computer per day.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4443
Credit: 119,144,296
RAC: 140,264
United States
Message 918474 - Posted: 16 Jul 2009, 16:10:33 UTC - in response to Message 918441.

One thought I have had.....

BUT it would require a change to the Boinc client software.

I'll throw it in the ring anyway

It seems a lot of the problem is the continual hammering of the upload server with attempt to upload by each result individually.

Why not get Boinc to apply the backoff to ALL results attempting to upload to that SAME server that caused the initial backoff.

This would mean having a backoff clock for each upload server, instead of for each result.

This would mean just one or two (whatever your # of simultaneous tranfers setting) results would make the attempt, then the rest of the results waiting (up to 1000's in some cases) would be backed off as well and give the servers a breather.

Not being a programmer, I'm not sure how difficult this would be to implement (proverbially it doesn't seem like it would be to me), and the benefits of reduced bandwidth wasting should be substantial.

Please feel free to comment.


We were just talking about that last night in the panic thread here. Seems that something like that might be comming. :D
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8634
Credit: 51,603,022
RAC: 48,820
United Kingdom
Message 918475 - Posted: 16 Jul 2009, 16:14:21 UTC - in response to Message 918460.

But I accept your point about needing to check the CPU overhead of the unzip process on Bruno when the zips arrive.

Why would they need to be decompressed on Bruno? As zip they take up less space on the disk as well as in throughput. I would expect that once you really need the results that you can import them on whatever server you're using to look at them and decompress them there. Or is that too simple?

The advantage of decompressing them on Bruno is that they end up in exactly the same place as they would have done under the existing upload handler: no change is required to all the cross-mounted complexity of the SETI file system. And the files are available immediately and individually: you don't have to teach the validator how to go fossicking around in any number of zip archives when it needs a file.

And that also makes the whole system reversible: if something goes wrong with the remote concentrator, use DNS to point all of us back to Bruno. Gets sticky again, of course, but lets the project limp along until the concentrator is revived.

Aurora Borealis
Volunteer tester
Avatar
Send message
Joined: 14 Jan 01
Posts: 2981
Credit: 5,089,751
RAC: 1,705
Canada
Message 918484 - Posted: 16 Jul 2009, 17:01:45 UTC - in response to Message 918441.
Last modified: 16 Jul 2009, 17:08:10 UTC

One thought I have had.....

BUT it would require a change to the Boinc client software.

I'll throw it in the ring anyway

It seems a lot of the problem is the continual hammering of the upload server with attempt to upload by each result individually.

Why not get Boinc to apply the backoff to ALL results attempting to upload to that SAME server that caused the initial backoff.

This would mean having a backoff clock for each upload server, instead of for each result.

This would mean just one or two (whatever your # of simultaneous tranfers setting) results would make the attempt, then the rest of the results waiting (up to 1000's in some cases) would be backed off as well and give the servers a breather.

Not being a programmer, I'm not sure how difficult this would be to implement (proverbially it doesn't seem like it would be to me), and the benefits of reduced bandwidth wasting should be substantial.

Please feel free to comment.

This idea has already been checked in by the Boinc developer. It will probably be incorporated in the next version released for testing. As I understand it, this was tried a couple of years ago with some negative effect that will need to be looked at again.

EDIT:Ooops, should have read the top of the thread before replying.

Profile Virtual Boss*
Volunteer tester
Avatar
Send message
Joined: 4 May 08
Posts: 417
Credit: 6,199,112
RAC: 408
Australia
Message 918485 - Posted: 16 Jul 2009, 17:03:05 UTC - in response to Message 918474.

We were just talking about that last night in the panic thread here. Seems that something like that might be comming. :D



Thanks HAL9000, I had not seen that thread yet. Looks like may be good news. :)

____________
Flying high with Team Sicituradastra.

Anthony Liggins
Send message
Joined: 23 Aug 99
Posts: 14
Credit: 454,120
RAC: 0
United Kingdom
Message 918488 - Posted: 16 Jul 2009, 17:11:57 UTC - in response to Message 917472.

Since CUDA was introduced into the project, the Boinc servers have been taking on an ever increasing load, as people populate there spare PCIE slots with extra GPU's adding an extra 112 or more cores each time they do that, your bandwidth woes will only increase exponentially.

Seti@home has become a victim of it’s own success where CUDA is concerned, the best thing to do here is to limit the amount each GPU can download each day through the web interface, cutting it by one third or one half will free up a good portion of bandwidth. This will also decrease the load on the backend as you will not need to create so many multibeam Wu's, increasing the chirp rate will affect the slower CPU’s far more than GPU’s. :-(

I have been browsing stats, and looking at computers attached to Boinc, and I have notice fellow participants who have between 1500 to 5000 wu's downloaded onto there pc's, I would consider this somewhat excessive, this is why I am making this suggestion.

Once implemented you should notice a difference within 24 hours, then hopefully people will not feel so frustrated when trying to upload there finished Wu's. This then will limit the amount of people going red in the face and blowing off steam on this forum, well maybe until the next managed emergency comes along, participants need to remember science does not hurt anyone if it is running late.

Anthony.

A 10 year veteran of Seti@home :-)
____________

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12327
Credit: 2,632,842
RAC: 1,201
Netherlands
Message 918489 - Posted: 16 Jul 2009, 17:13:14 UTC - in response to Message 918475.

you don't have to teach the validator how to go fossicking around in any number of zip archives when it needs a file.

Install Windows XP or above to do the looking into that archive. ;-)

Just kidding!
I forgot about the validator needing to be able to check the contents. Smacks head.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 918490 - Posted: 16 Jul 2009, 17:21:04 UTC

Matt mentioned that they are starting to run low on work. I wonder how much of the old tapes are available to run Astropulse. Obviously the regular S@H has been run on the old (1999-start of astropulse) tapes but there are years of tapes to run astropulse on. Does this have to do with the RFI and other factors previously mentioned that they aren't available?
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2290
Credit: 8,813,816
RAC: 4,080
United States
Message 918493 - Posted: 16 Jul 2009, 17:28:58 UTC - in response to Message 918490.

Matt mentioned that they are starting to run low on work. I wonder how much of the old tapes are available to run Astropulse. Obviously the regular S@H has been run on the old (1999-start of astropulse) tapes but there are years of tapes to run astropulse on. Does this have to do with the RFI and other factors previously mentioned that they aren't available?

Yes, it does have to do with the RFI/radar. A while back, the recorder was set up so that one of the 14 channels of data holds the "chirping" of the radar, and the remaining channels have the data that gets cut into WUs. That's what I think I remember reading a long time ago.

Actually, what I remember reading is that we were only using 12 channels at the time, so there were two free channels left, so one was used for the radar chirping, and the other was still available for future use. No idea where to even try to find that reference now.

But at any rate, before that 13th channel was used for radar chirping, there is no way of knowing where the chirps actually are, and that's where the software radar blanker that Matt has been working on comes into play. Once he gets that up and running, it can pre-process the older tapes, find where it thinks the radar is, and fill that 13th channel with the chirps so the splitters can do what they normally do.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Nicolas
Avatar
Send message
Joined: 30 Mar 05
Posts: 160
Credit: 10,335
RAC: 0
Argentina
Message 918523 - Posted: 16 Jul 2009, 19:56:50 UTC - in response to Message 918399.

It would need a change to the transitioner/validator logic and timing, to avoid those 'validate error' we get when the report arrives before the upload: but with BOINC doing delayed/batched reporting anyway, the overall effect wouldn't be big. It would finally give me an excuse to ditch v5.10.13 (no point in early reporting!). All it really needs is that if the validator can't find the file it needs at the first attempt, it goes into backoff/retry (like much of the rest of BOINC) instead of immediate error.

It's possible for the scheduler to ignore reports if the file wasn't uploaded yet; and the client would just keep them queued for a while longer and try reporting them later. This can be done for individual workunits (there is a separate 'ack' for each which the client must receive before it gets rid of the task locally).
____________

Contribute to the Wiki!

Nicolas
Avatar
Send message
Joined: 30 Mar 05
Posts: 160
Credit: 10,335
RAC: 0
Argentina
Message 918525 - Posted: 16 Jul 2009, 20:05:07 UTC - in response to Message 918441.

Why not get Boinc to apply the backoff to ALL results attempting to upload to that SAME server that caused the initial backoff.

Already implemented, see http://boinc.berkeley.edu/trac/changeset/18593.

There hasn't yet been a new version release since then, though.

____________

Contribute to the Wiki!

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12327
Credit: 2,632,842
RAC: 1,201
Netherlands
Message 918528 - Posted: 16 Jul 2009, 20:14:45 UTC - in response to Message 918525.

You know, some people had pointed that out already in this same thread... ;-)
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 918557 - Posted: 16 Jul 2009, 22:33:07 UTC - in response to Message 918399.

Would an set of remote upload servers as "data aggregators" work?

It is an interesting idea.

The biggest single issue as I see it:

Work is uploaded, and the moment the upload completes it is available for processing on the upload server.

Then, the result is reported. At this point, it is marked in the database as received, and subject to validation.

The validator doesn't have to check to see if the result is in local storage, because it is in local storage by definition.

This change means you have a new state: reported but not in local storage.

BOINC would have to know about that, and have some way of dealing with it (rescanning the database and checking to see if the result is actually here), probably by making the "unzip" process on the upload server report.

There is also a chance that the result gets lost between the off-site server and the "true" upload server.

I like the idea of doing just one, near Berkeley.

What I'm not sure about: the change that Eric made to shorten the "pending connection" queue suggests that the number of simultaneous connections is a big issue, this just moves that issue from the upload server to the server near the edge.

... but, a better idea (related to the thread which I haven't worked my way through) might be to zip all of the pending uploads into one file. All the client really needs to know is what is in the zip -- then let that go to Bruno.

The downside is that you have to push all of the work through in one session, and the bigger the .zip file the more bytes/packets you have to push through in a row....

____________

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

Message boards : Technical News : Working as Expected (Jul 13 2009)

Copyright © 2014 University of California