Sharing Large Data Files

J'e's back! While away, he had an aaaHHHAAA!!! moment, similar to his first browsing experience, and thinks he may just have spotted the "Internet2 Killer App." Read on to see why.

------------------------------------------

by J'e St. Sauver
University of Oregon Computing Center


This year's Fall Internet2 Member Meeting took place in Indianapolis, from October 12th to 17th. Besides being a nice opportunity to learn what's been going on in the Internet2 community (while also providing a chance to hash out issues with colleagues from other I2 schools face-to-face over a beer and a bowl of Cincinnati-style 3-, 4-, or 5-way chili), the I2 Member Meeting included the widely overlooked announcement of what may very well be the long-awaited "Internet2 killer app," a program from the University of Tennessee Knoxville Computer Science Department that g'es by the somewhat odd name of LoRS, part of the LoCI project.

Watching the LoRS demo at the Indy I2 meeting gave me the same sort of "aaaHHHAAA!!!" moment that I recall from when I first saw someone use an early version of Netscape to access a simple Web page: clearly, here was something that's going to profoundly change the way we do things online.

If you happen to have attended the LoRS session like I did, then you had had the chance to see an application that satisfies a fundamental need, much in the way that e-mail or the Web d'es. The need that LoRS satisfies is the need to be able to efficiently distribute large files, files that are too big to conveniently send by e-mail, files too big to conveniently download via the Web. You know the sort of files I'm talking about - large multi-gigabyte (or even multi-terabyte) experimental physics datasets, or CD-sized Linux ISOs, or those wonderful multi-hundred megabyte PowerPoint marketing presentations we all so love.

Yes, I know: people currently do move large (if not huge) files all the time via ftp, or chopped into digestible chunks via e-mail, or via Web pages.

Unfortunately, when folks move files using traditional tools, they don't tend to get very good network throughput, even over well-engineered, high-capacity, lightly loaded networks like Internet2's Abilene. (For example, the median throughput on Abilene for bulk file transfers is still less than 2.5Mbps. See Table 1 in the I2 weekly NetFlow report . One reason you're not seeing experimental physicists with fast Ethernet connections routinely saturating 100Mbps links is no mystery: it is simply a manifestation of our old friend, the TCP bandwidth delay product and its negative impact on untuned single-threaded network application throughput. (For a nice discussion of this, see: http://www.psc.edu/networking/perf_tune.html).

Serving large files from a single location, or even from a comparatively small set of distributed mirrors, also d'esn't scale very well. Ask anyone who hosts a Linux distribution mirror what they see when a Linux distributor kicks a new release out the door!

To move large files efficiently, you really want to get the data distributed around the network, with redundant copies topologically close to those who need it. Staging the data close to those who need it means that observed throughput *will* typically go up and network hot spots *can* usually be avoided. A free side benefit of distributing large files this way is that distribution of files via a series of geographically dispersed nodes grants the content of those files a degree of resistance to network denial of service attacks, or to data loss due to simple hardware failure.

All this is wonderful, so far as it g'es, but if you are like me, you probably have a well-developed skeptical side:

  • "What d'es it cost to use LoRS?";
  • "Are there running production-quality applications, on the platforms I need?"; and
  • "Can I trust my data to some bunch of random LoRS nodes?"

    For once, the answers are all pretty good:

  • You can use LoRS for free, in part because the LoCI project has received federal funding as well as support from many volunteers who host storage nodes;
  • LoRS is available in graphical and command line form for Windows PCs, Mac OS X boxes, including source to build on Solaris, Linux and other Unix systems; and
  • LoRS uses encryption to protect your data from storage depot operators (and storage depot operators from your data); protection against arbitrary loss of a node, or data on a node comes from the fact that you will typically make redundant copies of the same data (stored in chunks) on a variety of different storage nodes, so that if one node g'es down you can get a copy of the missing chunks from a different node.

    All this sounds pretty cool, d'esn't it?

    Having said that, I won't kid you: installing LoRS may take a little noodling around depending on what you've already got installed (you'll need to download and install Tcl/Tk and Perl, for example, if you don't already have them installed on your system); it is not yet a download-click-and-go operation (but neither d'es it typically require expert knowledge to do an install). I'd put it at the level where it helps to be technically minded, but you don't need to be a hard core geek (or even a minor league geek) to make it work.

    As you begin working with LoRS, you'll need to absorb some new concepts, such as:

  • The idea of an exNode (an XML-formated file with pointers to the chunks of your dataset that have been written to storage depots by LoRS); or the fact that . . .
  • Soft (gratis, as-available) storage allocations which are made by LoRS are of limited duration (a day by default), sort of like a giant temp or scratch disk on some Unix system. If you need to, you can request that your storage allocation be refreshed for additional time, subject to space availability and provided your storage allocations haven't already expired.

    Getting Started with LoRS

    If you'd like to try LoRS, and I'd encourage you to do so, you should begin by checking out the LoRS section of the LoCI Web site. Download the documentation (yes, do RTFM, because in this case the manual is quite good), then download and install the LoRS software. Try uploading and downloading a sample file or two. See what you think - I suspect you'll be as impressed as I was.

    After you've worked with LoRS for a bit, assuming you believe it to be useful, consider putting up a shared storage depot at your local site. Pricewatch.com (and comparable computer-part-price-checking Web sites) show 250MB EIDE drives for under $200, which makes it pretty darn cheap to assemble a few terabytes to donate to the collective effort.

    Of course, before offering any potentially network-intensive service of this sort, be sure to discuss your plans with your local networking folks to insure that any bandwidth-related issues receive proper consideration, and you've got campus support for the new application you're deploying.

    Well . . . It's new! It's cool! It solves a big need! It's free (for now)! And it has a good instruction manual. Woo-hoo! I'll take J'e's opinion that you don't even have to be "a minor league geek" to play around with this new stuff with the proverbial grain of salt, but I'll bet quite a few of you will be "LoRSing around" with it soon.

  • Featured

    • landscape photo with an AI rubber stamp on top

      California AI Watermarking Bill Garners OpenAI Support

      ChatGPT creator OpenAI is backing a California bill that would require tech companies to label AI-generated content in the form of a digital "watermark." The proposed legislation, known as the "California Digital Content Provenance Standards" (AB 3211), aims to ensure transparency in digital media by identifying content created through artificial intelligence. This requirement would apply to a broad range of AI-generated material, from harmless memes to deepfakes that could be used to spread misinformation about political candidates.

    • stylized illustration of an open laptop displaying the ChatGPT interface

      'Early Version' of ChatGPT Windows App Now Available to Paid Users

      OpenAI has announced the release of the ChatGPT Windows desktop app, about five months after the macOS version became available.

    • person signing a bill at a desk with a faint glow around the document. A tablet and laptop are subtly visible in the background, with soft colors and minimal digital elements

      California Governor Signs AI Content Safeguards into Law

      California Governor Gavin Newsom has officially signed off on a series of landmark artificial intelligence bills, signaling the state’s latest efforts to regulate the burgeoning technology, particularly in response to the misuse of sexually explicit deepfakes. The legislation is aimed at mitigating the risks posed by AI-generated content, as concerns grow over the technology's potential to manipulate images, videos, and voices in ways that could cause significant harm.

    • Jetstream logo

      Qualified Free Access to Advanced Compute Resources with NSF's Jetstream2 and ACCESS

      Free access to advanced computing and HPC resources for your researchers and education programs? Check out NSF's Jetstream2 and ACCESS.