Guess what's wrong with uploading...

Message boards : Number crunching : Guess what's wrong with uploading...
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20267
Credit: 7,508,002
RAC: 20
United Kingdom
Message 137832 - Posted: 17 Jul 2005, 18:36:46 UTC

Here on the boards, we have many experts of various types. Now here is your chance to tell Berkeley what they don't know and what to fix!

I'll start off with a few of my guesses.

We already have the clues that the upload/download server is 100% CPU bound. Also, the Cogent link is not bandwidth saturated...

So here goes:

Too many files accumulated in the upload directory causing the filesystem to choke;

Database enquiries issues causing excessive server(s) delays;

DOS attack with junk WUs;

Bad network card or a bad router causing confusion;

Too many CPU processes causing the server to thrash;

Unexpectedly high disk fragmentation;

Very high throughput of DOWNLOADING WUs for starving half crazed users desperate to load up to The Max;

...

I'll leave a few options for the other experts to have a chance :)

So who gets the prize for best guess?

Good luck,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 137832 · Report as offensive
IT_Eagle03
Avatar

Send message
Joined: 22 Nov 99
Posts: 5
Credit: 154,363
RAC: 0
United States
Message 137834 - Posted: 17 Jul 2005, 18:39:01 UTC

The SETI chipmunk died. :(
ID: 137834 · Report as offensive
CJOrtega

Send message
Joined: 15 May 99
Posts: 186
Credit: 1,126,273
RAC: 0
United States
Message 137844 - Posted: 17 Jul 2005, 18:48:39 UTC

Too many people hitting the retry/update button.

ID: 137844 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 137846 - Posted: 17 Jul 2005, 18:49:58 UTC

Matt set his beer on the UL/DL server and someone knocked it over (alcohol abuse).
ID: 137846 · Report as offensive
Profile KWSN - MajorKong
Volunteer tester
Avatar

Send message
Joined: 5 Jan 00
Posts: 2892
Credit: 1,499,890
RAC: 0
United States
Message 137851 - Posted: 17 Jul 2005, 18:53:58 UTC - in response to Message 137844.  

Too many people hitting the retry/update button.



I agree with CJOrtega. Too many people tap-dancing on the update button. The upload/download server can't catch up as long as people are doing this, in my opinion. Quite the Prisoner's Dilemma.
https://youtu.be/iY57ErBkFFE

#Texit

Don't blame me, I voted for Johnson(L) in 2016.

Truth is dangerous... especially when it challenges those in power.
ID: 137851 · Report as offensive
Profile Kevin N. Shapley
Volunteer tester
Avatar

Send message
Joined: 1 Jan 00
Posts: 100
Credit: 2,539,295
RAC: 0
United States
Message 137853 - Posted: 17 Jul 2005, 18:57:41 UTC




PacBell. Oops, I mean SBC ;)



-
Oderint dum metuant
ID: 137853 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20267
Credit: 7,508,002
RAC: 20
United Kingdom
Message 137855 - Posted: 17 Jul 2005, 18:59:04 UTC - in response to Message 137846.  

Matt set his beer on the UL/DL server...

Or rather Matt is enjoying a beer while watching "top" over a remote link while the CPU is maxed out and the load average goes exponential as top gobbles more and more CPU trying to calculate the load!

OK, another guess: The server is choked with multiple remote links or nfs mounts getting polled.

Or worse still, they are using Linux and have left "fam" enabled and the kernel has gone into meltdown trying to service fam's requests for what files have changed...!

Mmmmm, more beer here I think,
Cheers,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 137855 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20267
Credit: 7,508,002
RAC: 20
United Kingdom
Message 137859 - Posted: 17 Jul 2005, 19:06:34 UTC - in response to Message 137855.  

Matt set his beer on the UL/DL server...

Or rather Matt is enjoying a beer while watching "top" over a remote link while the CPU is maxed out and the load average goes exponential as top gobbles more and more CPU trying to calculate the load!...

Or better yet, he's started up Xorg server to view the system stats graphically and X has burped into its 100% CPU useage mode...!

;)
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 137859 · Report as offensive
Ned Slider

Send message
Joined: 12 Oct 01
Posts: 668
Credit: 4,375,315
RAC: 0
United Kingdom
Message 137861 - Posted: 17 Jul 2005, 19:12:04 UTC

Someone thought they'd try out a Win X64 installation over the weekend. Windows and servers don't mix :D


*** My Guide to Compiling Optimised BOINC and SETI Clients ***
*** Download Optimised BOINC and SETI Clients for Linux Here ***
ID: 137861 · Report as offensive
Iztok s52d (and friends)

Send message
Joined: 12 Jan 01
Posts: 136
Credit: 393,469,375
RAC: 116
Slovenia
Message 137867 - Posted: 17 Jul 2005, 19:20:18 UTC - in response to Message 137861.  

Hi!

While monitoring upload, I saw quite often result uploaded, but not ACKed.
So, it went into game again.

Now... If I do not get connection, it is fine. But if I get it, it should
do the job. Maybe some timer?

Anyhow, if successfull uploads are higher than work we do, then queues will
slowly shring. If not, then we might discover some limits on client side.

Let us hope they find magic parameter tomorrow, and we are back to normal
soon.

BR
Iztok


ID: 137867 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 137872 - Posted: 17 Jul 2005, 19:27:13 UTC - in response to Message 137867.  

Hi!

While monitoring upload, I saw quite often result uploaded, but not ACKed.
So, it went into game again.

Now... If I do not get connection, it is fine. But if I get it, it should
do the job. Maybe some timer?

Anyhow, if successfull uploads are higher than work we do, then queues will
slowly shring. If not, then we might discover some limits on client side.

Let us hope they find magic parameter tomorrow, and we are back to normal
soon.

BR
Iztok



Yes I saw this too. Connection accepted and ack'd and established. Data sent to server giving file size and ack'd but then a tcp/ip reset from the server. Strange behaivour for sure.


ID: 137872 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20267
Credit: 7,508,002
RAC: 20
United Kingdom
Message 137874 - Posted: 17 Jul 2005, 19:33:09 UTC - in response to Message 137872.  
Last modified: 17 Jul 2005, 19:33:48 UTC

Yes I saw this too. Connection accepted and ack'd and established. Data sent to server giving file size and ack'd but then a tcp/ip reset from the server. Strange behaivour for sure.

Interesting.

A resource limit hit at their end? Too many open connections or a fs timeout?...

OK, is their HDD full?!

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 137874 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 137877 - Posted: 17 Jul 2005, 19:36:29 UTC - in response to Message 137874.  

Yes I saw this too. Connection accepted and ack'd and established. Data sent to server giving file size and ack'd but then a tcp/ip reset from the server. Strange behaivour for sure.

Interesting.

A resource limit hit at their end? Too many open connections or a fs timeout?...

OK, is their HDD full?!

Regards,
Martin


Well perhaps. I saw in another thread I think that the server was unable to find files it needed. So....full....corruption.....who knows but its seems serious to me.
Ian

ID: 137877 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 137881 - Posted: 17 Jul 2005, 19:40:00 UTC

In case you missed it, if you RetryNow an upload and then cancel it in mistream, the upload is actually accepted and posted to your account. Kinda wierd, but it seems to work.

I suppose if everyone did this, then the disk might overflow or otherwise crash the upload system.
May this Farce be with You
ID: 137881 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20267
Credit: 7,508,002
RAC: 20
United Kingdom
Message 137887 - Posted: 17 Jul 2005, 19:45:48 UTC - in response to Message 137881.  
Last modified: 17 Jul 2005, 19:46:45 UTC

In case you missed it, if you RetryNow an upload and then cancel it in mistream, the upload is actually accepted and posted to your account. Kinda wierd, but it seems to work.

I hope you don't mean "abort" the WU?

The 'Abort' trick has been tested elsewhere (see Misfit's posts). Your WU is dumped and you get zero credit and zero science. DO NOT ABORT YOUR WUs!

Just be patient and let the WUs get returned as and when they can. I'm getting a steady trickly returned ok after a few (automatic) attempts each.

Good luck,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 137887 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20267
Credit: 7,508,002
RAC: 20
United Kingdom
Message 137900 - Posted: 17 Jul 2005, 20:06:37 UTC - in response to Message 137877.  
Last modified: 17 Jul 2005, 20:08:23 UTC

Yes I saw this too. Connection accepted and ack'd and established. Data sent to server giving file size and ack'd but then a tcp/ip reset from the server. Strange behaivour for sure.
...
A resource limit hit at their end? Too many open connections or a fs timeout?...

OK, is their HDD full?!

Well perhaps. I saw in another thread I think that the server was unable to find files it needed. So....full....corruption.....who knows...

Whatever it is, a lot of the bandwidth is being lost to repeated upload attempts:
2005-07-17 19:19:37 [SETI@home] Started upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:20:52 [SETI@home] Temporarily failed upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:20:52 [SETI@home] Backing off 1 minutes and 0 seconds on transfer of file 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:21:52 [SETI@home] Started upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:23:43 [SETI@home] Temporarily failed upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:23:43 [SETI@home] Backing off 1 minutes and 0 seconds on transfer of file 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:24:43 [SETI@home] Started upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:25:58 [SETI@home] Temporarily failed upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:25:58 [SETI@home] Backing off 1 minutes and 0 seconds on transfer of file 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:26:58 [SETI@home] Started upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:29:15 [SETI@home] Temporarily failed upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:29:15 [SETI@home] Backing off 1 minutes and 0 seconds on transfer of file 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:30:15 [SETI@home] Started upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:31:30 [SETI@home] Temporarily failed upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:31:30 [SETI@home] Backing off 2 minutes and 1 seconds on transfer of file 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:33:31 [SETI@home] Started upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:34:46 [SETI@home] Temporarily failed upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:34:46 [SETI@home] Backing off 1 minutes and 18 seconds on transfer of file 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:36:04 [SETI@home] Started upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:37:19 [SETI@home] Temporarily failed upload of 07fe05aa.532.7760.47148.134_4_0
2005-07-17 19:37:19 [SETI@home] Backing off 8 minutes and 31 seconds on transfer of file 07fe05aa.532.7760.47148.134_4_0

Mmmm, calling all Experts, any other ideas?

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 137900 · Report as offensive
KB7RZF
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 9549
Credit: 3,308,926
RAC: 2
United States
Message 137901 - Posted: 17 Jul 2005, 20:12:03 UTC

I've done what a few others have done, and just disabled BOINC network access, and i'll leave it that way till probably Monday afternoon, when hopefully (crossing fingers) everything is working or semi-working again. I got plenty of WU's for my computer to crunch, plus another project for it to work on, with no need to connect to anything.

And like someone else said, hitting the retry/update buttons only throws more data at the servers, and since they are overloaded already, why overload it more? Don't make sense to me. Oh well. Keep on crunchin everyone!!

Jeremy
ID: 137901 · Report as offensive
Profile tekwyzrd
Volunteer tester
Avatar

Send message
Joined: 21 Nov 01
Posts: 767
Credit: 30,009
RAC: 0
United States
Message 137912 - Posted: 17 Jul 2005, 20:49:44 UTC


I had two wu upload today (due on July 28th) but am still unable to get units due on the 26th and 27th to upload. Well, at least I still have time until the deadline. I'm not beating at the the sever with repeated retries but rather am leaving the connection enabled while online. They upload if and when they want to. Currently the completed unit with the longest time to report is scheduled to retry first.I wonder why they're uploading in such an odd order.
ID: 137912 · Report as offensive
Profile StokeyBob
Avatar

Send message
Joined: 31 Aug 03
Posts: 848
Credit: 2,218,691
RAC: 0
United States
Message 137935 - Posted: 17 Jul 2005, 21:54:58 UTC

If the upload and download would just spend enough time to finish the job it started we wouldn't need a system that needs to retry over and over.

Why does it download just enough information to give your machine a work unit number and then quit? Then when you go back later it is going to have to look up that work unit to send you the information. That is just making more work for itself.

This can be very stressful on some of us. You never know when you may have to reinstall your operating system.
ID: 137935 · Report as offensive
Profile JERFilm

Send message
Joined: 20 Apr 02
Posts: 4
Credit: 4,131,391
RAC: 0
United States
Message 137937 - Posted: 17 Jul 2005, 21:59:37 UTC

There must be some priority to downloading new work units. I seem to get downloads but have as many as 11 uploads sitting around waiting. Seems so strange since they take about 3 seconds to do -= the eleven of them would take less time to upload than one WU downloaded.....hmmmmm......
ID: 137937 · Report as offensive
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Guess what's wrong with uploading...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.