Sharing Large Data Files
- By Terry Calhoun, Joe St Sauver
- 10/29/03
J'e's back! While away, he had an aaaHHHAAA!!! moment, similar to his first
browsing experience, and thinks he may just have spotted the "Internet2
Killer App." Read on to see why.
------------------------------------------
by J'e St. Sauver
University of Oregon Computing Center
This year's Fall
Internet2 Member Meeting took place in Indianapolis, from October 12th to
17th. Besides being a nice opportunity to learn what's been going on in the
Internet2 community (while also providing a chance to hash out issues with colleagues
from other I2 schools face-to-face over a beer and a bowl of Cincinnati-style
3-, 4-, or 5-way chili), the I2 Member Meeting included the widely overlooked
announcement of what may very well be the long-awaited "Internet2 killer
app," a program from the University of Tennessee Knoxville Computer Science
Department that g'es by the somewhat odd name of LoRS, part of the LoCI
project.
Watching the LoRS demo at the Indy I2 meeting gave me the same sort of "aaaHHHAAA!!!"
moment that I recall from when I first saw someone use an early version of Netscape
to access a simple Web page: clearly, here was something that's going to profoundly
change the way we do things online.
If you happen to have attended the LoRS session like I did, then you had had
the chance to see an application that satisfies a fundamental need, much in
the way that e-mail or the Web d'es. The need that LoRS satisfies is the need
to be able to efficiently distribute large files, files that are too big to
conveniently send by e-mail, files too big to conveniently download via the
Web. You know the sort of files I'm talking about - large multi-gigabyte (or
even multi-terabyte) experimental physics datasets, or CD-sized Linux ISOs,
or those wonderful multi-hundred megabyte PowerPoint marketing presentations
we all so love.
Yes, I know: people currently do move large (if not huge) files all the time
via ftp, or chopped into digestible chunks via e-mail, or via Web pages.
Unfortunately, when folks move files using traditional tools, they don't tend
to get very good network throughput, even over well-engineered, high-capacity,
lightly loaded networks like Internet2's Abilene. (For example, the median throughput
on Abilene for bulk file transfers is still less than 2.5Mbps. See Table 1 in
the I2 weekly NetFlow
report . One reason you're not seeing experimental physicists with fast
Ethernet connections routinely saturating 100Mbps links is no mystery: it is
simply a manifestation of our old friend, the TCP bandwidth delay product and
its negative impact on untuned single-threaded network application throughput.
(For a nice discussion of this, see: http://www.psc.edu/networking/perf_tune.html).
Serving large files from a single location, or even from a comparatively small
set of distributed mirrors, also d'esn't scale very well. Ask anyone who hosts
a Linux distribution mirror what they see when a Linux distributor kicks a new
release out the door!
To move large files efficiently, you really want to get the data distributed
around the network, with redundant copies topologically close to those who need
it. Staging the data close to those who need it means that observed throughput
*will* typically go up and network hot spots *can* usually be avoided. A free
side benefit of distributing large files this way is that distribution of files
via a series of geographically dispersed nodes grants the content of those files
a degree of resistance to network denial of service attacks, or to data loss
due to simple hardware failure.
All this is wonderful, so far as it g'es, but if you are like me, you probably
have a well-developed skeptical side:
"What d'es it cost to use LoRS?";
"Are there running production-quality applications, on the platforms I
need?"; and
"Can I trust my data to some bunch of random LoRS nodes?"
For once, the answers are all pretty good:
You can use LoRS for free, in part because the LoCI project has received
federal funding as well as support from many volunteers who host storage nodes;
LoRS is available in graphical and command line form for Windows PCs, Mac OS
X boxes, including source to build on Solaris, Linux and other Unix systems;
and
LoRS uses encryption to protect your data from storage depot operators (and
storage depot operators from your data); protection against arbitrary loss of
a node, or data on a node comes from the fact that you will typically make redundant
copies of the same data (stored in chunks) on a variety of different storage
nodes, so that if one node g'es down you can get a copy of the missing chunks
from a different node.
All this sounds pretty cool, d'esn't it?
Having said that, I won't kid you: installing LoRS may take a little noodling
around depending on what you've already got installed (you'll need to download
and install Tcl/Tk and Perl, for example, if you don't already have them installed
on your system); it is not yet a download-click-and-go operation (but neither
d'es it typically require expert knowledge to do an install). I'd put it at
the level where it helps to be technically minded, but you don't need to be
a hard core geek (or even a minor league geek) to make it work.
As you begin working with LoRS, you'll need to absorb some new concepts, such
as:
The idea of an exNode (an XML-formated file with pointers to the chunks of your
dataset that have been written to storage depots by LoRS); or the fact that
. . .
Soft (gratis, as-available) storage allocations which are made by LoRS are
of limited duration (a day by default), sort of like a giant temp or scratch
disk on some Unix system. If you need to, you can request that your storage
allocation be refreshed for additional time, subject to space availability and
provided your storage allocations haven't already expired.
Getting Started with LoRS
If you'd like to try LoRS, and I'd encourage you to do so, you should begin
by checking out the LoRS section of the LoCI Web site. Download the documentation
(yes, do RTFM, because in this case the manual is quite good), then download
and install the LoRS software. Try uploading and downloading a sample file or
two. See what you think - I suspect you'll be as impressed as I was.
After you've worked with LoRS for a bit, assuming you believe it to be useful,
consider putting up a shared storage depot at your local site. Pricewatch.com
(and comparable computer-part-price-checking Web sites) show 250MB EIDE drives
for under $200, which makes it pretty darn cheap to assemble a few terabytes
to donate to the collective effort.
Of course, before offering any potentially network-intensive service of this
sort, be sure to discuss your plans with your local networking folks to insure
that any bandwidth-related issues receive proper consideration, and you've
got campus support for the new application you're deploying.
Well . . . It's new! It's cool! It solves a big need! It's free (for now)!
And it has a good instruction manual. Woo-hoo! I'll take J'e's opinion that
you don't even have to be "a minor league geek" to play around with
this new stuff with the proverbial grain of salt, but I'll bet quite a few of
you will be "LoRSing around" with it soon.