Fedorazon How to Guides

From DigiRepWiki

[ Home | Project Reports | How to Guides | Training ]

The Fedorazon Project


Setting up an instance (what you are getting yourself into)

It takes about twenty minutes to launch a repository instance (an Amazon Machine Image[1]), including signing up for an Amazon account (with a credit card) for your private pass keys (yes that's right you can have a repository up and running in twenty minutes, and that is if you are just copying the below screencast: it's faster if you know what you are doing). And I'm serious when I say that *anyone* including your Grandma could turn this thing on.

  • The beauty of starting a repository instance on AWS is that is doesn't care if you want to use EPrints, DSpace or Fedora for your repository, in fact you can run all three of them at the same time if you like. The set-up process for all three is the exact same, you just need to know the AMI catalogue number [2] for which piece of software you want to use. In fact you can start up different versions of Fedora (version 3.0 versus 3.0.1.) if you find the need. Also by utilising ECB/S3 you can start up a new version of Fedora (say 3.1 beta) while still running the older version and each will read from the same S3 store. This way you can launch multiple test instances of any one of the repositories without having to take down your old version. Or say you are feeling a bit nostaligic and want to launch an old version just because you can (or rather because the old version had a tool or API that the new one doesn't). All of this and more is possible using AWS.
  • Be forewarned it costs around $80 per month if you leave the repository running on EC2[3] all the time (about $1000 per year). Which is actually cheap in comparison to what your local IT department is paying to maintain your local server (cost analysis figures coming soon). Then again if you just want to have a quick play charges are incurred per hour of usage for EC2, so playing with the thing for an hour will cost you 10 cents. Really cheap especially for the powerful server you are getting [4].
  • If you are looking to use one of these instances for your official -or rather "Enterprise"- repository (rather than just having a play), then you should consider using Amazon's Elastic Block Store (EBS[5]). This will enable the system to create snapshots of the database running on EC2. However this will cost you more as you will also need to store the data along with transfer fees (again, cost analysis coming).
  • The final cost you could incur is if you are looking to store the data in another location, known as "replication" and a smart step if you want your repository to be preserved in the long term. AWS offers S3[6] which is a cheap storage option where data can be backed up. See cost analysis reports [7]

Step-by-step setting up a Repository on AWS (EC2, ECB and S3)

We hope to have screencasts and step-by-step instructions for all three repositories including launching them and controlling them via a local computer (e.g. your laptop) be that Windows, Apple or Linux (xubuntu).

Using Windows as your command console

What you'll need:

  • either an XP or Vista Windows operating system
  • internet connection
  • an Amazon.com username and password
  • a credit card
  • an SSH client (don't worry about this one we'll show you how to get and use this)

Steps for Set-up:

This is draft documentation, we'll be improving this in days to come...

  1. Watch this screencast [8]
  2. Follow along with the above screencast using this AWS documentation [9]
  3. Use this documentation for specific Fedora AMI set-up [10]
    1. Fedora Public AMI for EC2 is ami-7f38dd16
    2. Eprints Public AMI for EC2 is ami-a7a84cce
    3. DSpace Public AMI coming soon...

Using Apple as your command console

What you'll need


Steps for Set-up:

  1. screencast and detailed insrtuction coming soon
  2. to hold you over, go here [11]

Using XUbuntu (Linux) as your command console

Do you have an old computer you don't use anymore? Why not use it as the control console for your repository? Set up XUbutu on it (you'll be surpirsed how well it works) and you'll be using an old computer to control a very powerful server!


What is AWS, AMI, EC2, EBS and S3?

Acronym Soup Explained (AWS's AMI, EC2, EBS and S3):

  • AWS = Amazon Web Services is an additional department within Amazon.com (yes the bookseller and web store extraordinaire) that offers various computational services for hire, including the following. For a full listing of services on offer from AWS go here [12]
    • EC2 = Elastic Computing Cloud are basically servers for hire, however these specific servers are for hosting web software or applications upon. Another example of a company that hires out servers is Google's App Engine [13].
      • AMI = AnAmazon Machine Image is the pre-configured server environment variables that run the selected software on EC2. For example, for The Fedora Commons repository software we are using Tomcat and MySQL. In short you really don't need to care about the database and operating system (if you really do give us a shout, glad to discuss), you should be focusing on which repository software is the best for your task or even those things called "users".
    • EBS = Elastic Block Storage is a service that enables the back-up of your EC2 software to a storage layer. EBS is for making sure that copies of EC2 are kept somewhere else in case EC2 goes bonk!
    • S3 = Simple Storage Service is where the information from EC2 is stored seperatley (this is where EBS puts the back-ups). S3 is like the basement archives in your library. FAQ on S3 [14]
      • Additional services like Elastic IP (EIP) allows for faster distribution of content to specific parts of the world, etc.
      • Please note: Amazon is quickly bringing about new services all the time that provide solutions we didn't even know we needed yet so look on their site for up to the minute news [15].

If you require further explnation please do read on...

Amazon Web Services for Dummies

Since these three "services" (EC2, EBS and S3) run in the so called "Cloud" (another word for 'Web') they are sometime hard for people -in the physical object based world- to understand. Accordingly, let's pretend that each of the three (EC2, S3 and EBS) are servers ("boxes") that are sitting at Amazon's offices (keep in mind that Amazon actually has "offices" -or rather Tier 3 Server facilities- all over the world).

Imagine that these three boxes are stacked upon one another. The EC2 box is stacked on the top with the EBS box stacked underneath and the S3 box at the bottom of the stack:

The job of the box on top (EC2) is to host the software, in our case it is running the repository software (EPrints, DSPace or Fedora). When users want to access the resources from the "repository" (and keep in mind I am now referring to the 'repository' as the entire stack of boxes hosting it, also known as "the instance"), the user is requesting an identifier from EC2.

The job of EC2 is to provide identifiers -in collaboration with the software it is hosting- to the user. As more users access ("hit") the EC2 box the more busy the box becomes. As EC2 is accessed more and more it becomes increasigly slower in its ability to deliver the resources quickly (same as with any server). However this is not a problem for Amazon's EC2 as you can log onto your EC2 account from anywhere in the world and turn on more Amazon EC2 servers to deliver those resources, and when the demand decreases you can turn those additional EC2 instances off (keep in mind that for every EC2 box you turn on it costs about $80 per month to run).

The job of EBS is to support EC2 in the hard work that it does. EBS's job (box no.2) is to automatically take back-ups of EC2 (box no.1). In this way if EC2 gets to busy and falls over (aka goes bonk) then the information that EC2 was dealing with is saved so EC2 can be restarted and reload what it was originally working on via EBS.

The job of S3 (box no.3) is to act as the storage for where EBS (box no.2) puts its backups, as well as storage for anything else EC2 might spit out. In our case S3 could also act as a store for respoitory resources in the long term.

By componitising the services within a repository instance (the stack of boxes) have more flexibility to grow over time as well as have more options for how they can be used. For example, by seperating out the storage (S3: box no.3) from the computing (EC2: box no.1) further preservation strategies can be put in place: EC2 can store its resources with S3 (therefore making them available on spinning disk whereby the user can access them on the web immediatley) and in addition S3 could replicate it's store of objects to a tape drive -via LOCKSS?- that sits on the campus of the host institution assuring that preservation of the objects are kept within the control of the institution.