Diving deep into Amazon Web Services
- 14 August, 2008 08:43
Amazon's Web Services (AWS) are based on a simple concept: Amazon has built a globe-spanning hardware and software infrastructure that supports the company's Internet business, so why not modularize components of that infrastructure and rent them? It is akin to a large construction company in the business of building interstate highways hiring out its equipment and expertise for jobs such as putting in a side road, paving a supermarket parking lot, repairing a culvert, or just digging a backyard swimming pool.
Hooking your apps into Amazon Web Services
More specifically, AWS makes various chunks of Amazon's business machinery accessible and usable via REST or SOAP-based Web service calls. Those chunks can be virtual computer systems with X2GHz processors and 2GB of RAM, storage systems capable of holding terabytes of data, databases, payment management systems, order tracking systems, virtual storefront systems, combinations of all the above, and more. And when I say "usable," I really mean "rentable." You pay only for the services (and their accompanying resources) that you use.
This is a key point. You can employ an army of virtual machines, store terabytes of data, or establish an Internet-wide message queue, and you will only pay Amazon for the resources you consume. So if your business needs a cluster of CPUs and several hundred gigabytes of storage to be available, say, every Wednesday for weekly processing, you don't have to keep a room full of servers sitting idly around six days a week. You can use AWS. Therefore, AWS is particularly attractive for business systems with intermittent or transient processing needs.
Nor are the costs unreasonable. For example, storage of 100GB for a month will cost you US$15 (at 15 cents per gigabyte per month), not counting 10 cents per gigabyte transferred in. (The Amazon Web Services site provides an online AWS Simple Monthly Calculator, for tallying your monthly costs of using any combination of offered services.)
As hinted above, the kinds of services range from hardware (albeit virtual) to processes. The services fall into three categories: infrastructure services, e-commerce services, and Web information services.
Investigating infrastructure services
The infrastructure services are composed of the Elastic Computing Cloud (EC2); Simple Storage Service (S3), a persistent storage system; the Simple Database (SimpleDB), which implements a remotely accessible database; and Amazon's Simple Queuing Service (SQS), a message queue service and the agent for binding distributed applications formed from the combination of EC2, S3, and SimpleDB.
These services provide virtually limitless compute, storage, and communication facilities. They're ideally suited for what might be called "intermittent" applications: those that require substantial compute or storage facilities on an irregular basis (for example, an application that wakes up Friday evening to process data gathered during the week). An application that requires worldwide connectivity -- say, a system that processes graphics files and makes the results available to clients across the Internet -- can also make good use of infrastructure services. Finally, these services act as excellent proof-of-concept laboratories for large-scale distributed applications. A development house seeking to demonstrate the feasibility of a proposed enterprise-wide application can implement a prototype using the infrastructure services, and avoid hardware costs that, if the prototype is deemed unworkable, would be a net loss.
Elastic Computing Cloud (EC2): Imagine a vast room filled with server systems, all networked together. Sitting at your single workstation, you create a virtual machine image that defines a 1.2GHz processor running Linux with 1.7GB of RAM and a 160GB hard disk, pre-loaded with software you have crafted specifically to number-crunch a large matrix of mined data. You deploy this image to an outside service, which manages those servers. At some future point, a boatload of matrices arrives from your data-mining operations. You instruct the service to instantiate 50 of your virtual machines, and turn each loose on one of the data matrices. Within a few seconds, 50 of those 1.2GHz processors are active and chomping on your data. They finish, deposit their results at a pre-specified storage site, and disappear.
That's EC2 in a nutshell. It's nothing less than a boundless collection of virtual computers that a user can call into existence to perform some processing task. "Boundless," however, does not mean "infinite"; rather, there is no specific upper limit -- other than your wallet. Amazon's documentation states that you can commission "hundreds, or even thousands" of virtual machines simultaneously.
Because systems in EC2 are virtual, Amazon provides a range of hardware capabilities. At the low end, you can call for a 1.26GHz Opteron-class machine with 1.8GB of RAM. At the high end (at the time of this writing), you can request a 64-bit multicore system with 15GB of RAM. These specifications are approximations. Virtual machines that you instantiate are rated in EC2 Computer Units (ECUs), which Amazon defines as being equivalent to a 1.0GHz to 1.2GHz 2007 Opteron processor. (The company suggests you do your own benchmarking to determine which instance is best for your particular application.)
An Amazon Machine Image (AMI) consists of an operating system and whatever applications you want pre-loaded when the virtual machine is started. Currently, only Linux is available as an EC2 instance's OS, though this is hardly a limitation. There are quite a few distributions in Amazon's catalog of prebuilt AMIs. Perusing the list, I found ready-to-use AMIs for Ubuntu, OpenSolaris, Centos, Fedora, and many others -- all told, more than 100 AMIs ready to go. You can build your own AMI using a free Amazon-provided SDK, but the process is lengthy. It is far easier to select a prebuilt AMI from the catalog, and customize it as necessary. Even so, many available AMIs include software for specific applications; you may well find one that already has much of what you need.
Simple Storage Service (S3): Amazon's Simple Storage Service (S3) is effectively a large disk drive in the ether. Strictly speaking, that's 90 percent of everything you need to know about it. It has no directories and no file names -- just a big place where you can store and fetch unstructured data in gobs as small as 1 byte or as big as 5GB.
What I call a "gob," S3 calls an "object," and in place of "directory," S3 says "bucket." So when you store a 200KB JPEG on S3, you're putting a 200KB object in a bucket. A given AWS account can own up to 100 buckets. A bucket can hold an unlimited number of gobs, and it can be configured to reside either in the United States or Europe. Presumably, this provides users a comforting feeling of locality, because buckets are available anywhere on the Internet that Amazon is accessible. Cost differences between the two are tiny; a bucket in Europe will run you something like two-thousandths of a cent more per 1,000 requests than in the United States.
Digging a bit deeper, you can think of an object as a three-in-one entity: key, value, and metadata. The key is the object's name, value is its content, and metadata is an array of key/value pairs carrying information about the object. (Access permissions are also associated with an object, but are treated as separate from object storage.) An object's name can be between 3 and 255 characters, and the only constraint that Amazon places on names is that they not confuse URL parsing. Thus, an object with a name of "192.168.12.12" is a bad idea.
Whereas the architecture of S3 is effectively a flat file system, S3's APIs permit a clever programmer to build apparent subdirectories within a bucket. The hierarchies have to be encoded in the object names, which is less than ideal; however, it's an artifact that code could simply mask. So, if you want one directory of animals and another of vegetables, you might have object keys such as "animal-cat", "animal-dog," "vegetable-beet," and "vegetable-carrot." Using the prefix parameter of the List operation, you can restrict retrieved object keys to only animals or only vegetables. More complicated data structures should be kept in Amazon's Simple Database.
Amazon Simple Database Service (SimpleDB): While Amazon S3 is designed for large, unstructured blocks of data, SimpleDB is built for complex, structured data. As with the other services, the name says it all. SimpleDB implements a database that sits behind a lightweight, easily mastered query language that nonetheless supports most of the database operations (searching, fetching, inserting, and deleting) you'll likely need. In keeping SimpleDB simple, Amazon has followed the principle that the best APIs are those with minimal entry points: I count seven for SimpleDB.
A SimpleDB database is not exactly like a relational database of the Oracle or MySQL sort. (Amazon's documentation points out that, if you do need a full-blown relational database, you are free to run a MySQL server on an AMI in the elastic compute cloud.) A SimpleDB database (a "domain" in SimpleDB parlance) is composed of items, and items are composed of attributes. An attribute is a name/value pair. At a minimum, an item must have an ItemName attribute, which serves as the item's unique identifier. When you issue a query, the result is a collection of ItemName values -- to fetch the actual content of the item (the attributes), you perform a Get operation using those values as input.
As simple as it is, SimpleDB packs surprising capabilities. A SimpleDB database can grow up to 10GB and house up to 250 million attributes. You can define up to 256 attributes for a given item, and there is no requirement that all the items in a domain have the same attributes. In addition, a given attribute can have multiple values, so a customer database could store multiple aliases in a single CustomerName attribute.
Finally, SimpleDB is designed to support "real-time" (fast turnaround) queries. To ensure quick query response, all attributes are indexed automatically as items are placed in the database. Also, Amazon's documentation indicates that a query should take no more than 5 seconds to complete; otherwise, it will likely time out. Amazon does this to ensure that a query receives a quick response, even if a query is malformed to the degree that it would hamper the calling application.
Amazon Simple Queue Service (SQS): Amazon SQS is a message queuing service in the vein of JMS or MQSeries -- only simpler. SQS's most impressive characteristic is its ubiquity. A blurb from Amazon's documentation reads: "Any computer on the Internet can add or read messages without any installed software or special firewall configurations." The most likely participants in SQS message transactions are, of course, instantiated AMIs in the EC2.
As with other Amazon Web services, SQS earns its name: Messages are text-only, and must be less than 8KB in length. You can build a working queue with only four functions: CreateQueue, SendMessage, ReceiveMessage, and DeleteMessage. (There are other convenience functions; ListQueues, for example, will list an account's existing queues.)
SQS queues are designed primarily to support workflows among distributed computer systems, and as such, concurrency management and fail-over are implicit. When a client reads a message from a queue, that message is not deleted; it is simply locked in such a fashion that it becomes invisible to other clients. In that way, if the message represents a specific task to perform as part of a workflow, two clients cannot read the same message and, thereby, duplicate effort. However, if the message is not deleted before a specified timeout, the lock is released. The intent, then, is for the original reader of the message to delete it when the specified work is complete. If the original reader is unable to complete the work (perhaps on account of a system crash), the timeout expires, the message "reappears" in the queue, and a different client can read the message and undertake the specified work.
Entering E-Commerce Service
Amazon's E-Commerce Service provides facilities that turn you (or, rather, your Web site) into a reseller for Amazon's merchandise. Others will turn Amazon into a reseller for your merchandise. Still others let you employ the same payment authentication and collection system that Amazon uses. In short, the Amazon E-Commerce Service enables an organization to tap into the e-business facilities that Amazon uses 24/7 to operate its own affairs.
Amazon Flexible Payment Service (FPS): Amazon's FPS lets users tap into the company's existing payments collection infrastructure (for a fee, of course). The idea of FPS is particularly attractive when you read that it will "take on the complexity of managing security and fraud protection" so that you don't have to.
Two aspects of FPS are especially interesting. First, it supports micropayments, those that involve cents -- or even fractional cents. This is useful when business activities involve piles and piles of transactions, each having little monetary value, but the sum of which has measurable value. Imagine selling bubblegum for 10 cents. That doesn't seem like much -- unless you're selling, say, 100,000 pieces a month. Amazon FPS lets you aggregate micropayments into a single transaction, thus eliminating the problem of transaction costs swamping whatever profits the transactions involve.
FPS's other interesting aspect is its support for "middleman" operations. That is, you can facilitate a transaction in which you participate neither as a sender (buyer) or recipient (seller). You can, however, take a cut of the action.
There are two ways to employ FPS in your Web application: using an Amazon-supplied "widget" (of which there are two), or hard-coding an interface. The two available widgets are Pay Now and Marketplace (both designed to be easily added to a Web site's UI).
Amazon has automated the creation of Pay Now widgets. Connect to the online Pay Now Widgets Implementation Guide, and it walks you through the process of building a widget by prompting for various parameters (for example, the destination URL after the payment has been placed), then generates the HTML that you cut and paste into your Web site's code. The Marketplace widget lets you act as a third party between buyer and seller. In essence, it turns you into an instant reseller. You can use a MarketPlace Widget to let sellers do business on your Web site and pay you for the privilege.
The hard-coded approach is more difficult, but more flexible, as it enables any application that can communicate with a Web service to tap into FPS. You have to express the parameters and processes for payment transactions in a specialized mini-language called Gatekeeper. Once you've done that, you install those instructions into the Amazon FPS, which returns a token that is essentially a handle to the Gatekeeper code. Future transactions that employ that token are shepherded by your Gatekeeper program. Details for this process can be found in the online Amazon FPS Technical Documentation.
Amazon DevPay: Suppose you've written an amazing application that runs in Amazon EC2. You're convinced that people would be willing to pay you to use your application. Enter the Amazon DevPay Service.
Amazon DevPay is built on the same payment management infrastructure as Amazon FPS. But DevPay -- as its name attests -- is designed specifically to let developers charge for the use of their EC2- or S3-based applications.
Interaction with DevPay takes place via tokens (unique identifiers). One token identifies your application; the other identifies a specific user allowed to employ your application. The first, the product token, is generated by Amazon when you register your product with DevPay. That token, combined with a user's activation key (created when the user signs up with AWS), is implemented during product installation to generate credentials that include the second token, the user token. Your product embeds these tokens in service calls it makes to AWS, and in that way, DevPay tracks your application's usage by a given customer.
When you register your application with DevPay, you establish how your application is priced. Users can be billed on a metered (pay for what they use) basis, they can be charged monthly, or they can pay a one-time up-front fee. Of course, you have to be careful how you structure your billing. While your clients pay you for the use of your application, you must pay Amazon for the use of its services. So, at the very least, you have to make sure that your customers pay you more than you pay Amazon. Unfortunately, Amazon does not provide a sandbox for testing your application's integration with DevPay, so you have to do your testing with real money. Fortunately, the cost of Amazon services is low enough that this is not a substantial problem.
Amazon Associates and Amazon Fulfillment Web Service (FWS): Anyone who has clicked through a site to order something from Amazon has used Amazon Associates: It's the service that lets you sell Amazon stuff from your Web site. You get a percentage -- a referral fee -- for each sale. There is not much more to be said about Amazon Associates.
A more interesting Amazon e-commerce service, however, is a remarkable kind of inverse of Amazon Associates: Amazon Fulfillment Web Service. With FWS, instead of your selling Amazon stuff, Amazon sells your stuff. Not only that, but Amazon will also warehouse, package, and ship your stuff.
FWS is actually two Web services: inbound and outbound. You use the inbound system to inform Amazon of incoming shipments bound to their warehouse. When a customer orders one of your products, you use the outbound service to inform Amazon of the sale. Based on the details of the order, Amazon packages and ships the product, and even provides tracking information that you and your customer can use to monitor the shipment's status.
Of course, there are warehousing and handling fees involved, but it's a compelling model. A small company, unable to afford warehousing and shipping costs, can "virtualize" those components with Amazon FWS, until that company is large enough to provide them for itself. And any developer interested in exploring the mechanics of the inbound and outbound services will be happy to discover that Amazon has provided "scratchpad" applications -- tools that let you exercise simulations of the services.
Mechanical Turk: Amazon's Mechanical Turk is a peculiar service. (It is difficult to categorize; I have listed it with the other e-commerce services.) Its name comes from the famous 18th-century robotic chess player invented by Wolfgang von Kempelen. The robot, however, was no robot; inside the machine was a human chess player who operated the mechanism, unbeknownst to the human opponent. The idea of Mechanical Turk, then, is an automated front end, behind whose machinery hides a human.
Only, in this case, it's not just one human; there're lots. Whereas EC2 provides an elastic cloud of computers, Mechanical Turk provides an elastic cloud of humans. But this analogy goes only so far; the computers in EC2 are virtual, the humans of Mechanical Turk are not.
Here's how it works. Suppose you have a big pile of identical tasks that must be performed by humans. Perhaps you have a large quantity of text files that must be translated from one language to another. In the world of Mechanical Turk, you are a requester; you submit your tasks to the Mechanical Turk service, which places them on a kind of global bulletin board. Using that same service, workers log onto this bulletin board, select tasks, perform them, and post the results back to the service. Later you return to the Mechanical Turk, review the posted results, select those that are acceptable, and release funds to pay the workers. In short, the Mechanical Turk service is a middleman between employers and employees.
When I first read Mechanical Turk's description, I thought it was a great idea. It may yet be, but if my perusal of the tasks that are available is any indication, this is not a way to make any appreciable amount of money. Most of the HITS ("Human Intelligence Task," referring to a unit of work) posted paid mere pennies, and reading some of the descriptions gave me the uneasy feeling that workers would be used as human spam-bots.
It is possible that, in the future, Mechanical Turk will become a marketplace of decent work for reasonable money. For now, though, I am confident that I can make more money in less time -- and do more good -- by mowing the old lady's lawn next door.
Wading into Web Information Services
Amazon's Web Information Services are essentially query interfaces into extensive databases generated by a mixture of Web crawlers and Web traffic monitors. Data-mining organization can tap into the crawler-produced data to sift through information that is as wide-ranging as the Web itself. The utility of Web traffic data is self-evident to any company or individual interested in user visitation trends to their sites -- as well as to related or competing sites.
AlexaWeb Search: Amazon's Alexa Web Search is the result of partnering between Amazon and Alexa, and it lets you query the information gathered by Alexa's Web crawler bots. The quantity of information available is difficult to gauge; Alexa has been crawling the Web for over a decade, and the Internet is in nonstop growth. Alexa's site says that, while its bots are working constantly, it takes about two months for a complete cycle through the Internet.
When Alexa adds a new Web site document to its database, it indexes about 50 attributes associated with that document. Attributes include the document's language, its Open Document Category, various parsed components of the URL, geographic location of the hosting server, and more. Also available is the document's text, the first 20KB of which is text-indexed. All this is available for searching.
Naturally, searches on such a large database can take time. The Alexa Web Search service is architected so that when you issue a search, the service returns a request ID. You use this ID to track the status of your search's progress. When the search is complete, results are stored in a (possibly gigantic) text file. The text file can be downloaded and "mined" locally.
The accuracy of Alexa's data is unclear. The Alexa Web site states that the "traffic data are based on the set of toolbars that use Alexa's data, which may not be representative of the global internet population." Meanwhile, an Amazon Web services representative informed me that Amazon "aggregate[s] data from multiple sources to give you a better indication of Web site popularity." In any case, the ability to scour the text content of whole swaths of the Internet makes the Alexa Web service a profitable vein for Web data spelunkers.
Ready for the big time?
Amazon's Web Services are at once exciting and troubling. The infrastructure services adopt a sort of "mercenary" model of hardware and software horsepower; in theory, you can employ as large an army of computing power as your pocketbook can withstand. All the services offer universal availability -- if your network connection can reach Amazon, it can reach AWS. These are two powerful isotopes for fueling large-scale, on-demand, software services.
On the other hand, however, some of the important components are still in beta. SimpleDB, in fact, was in limited beta and not accepting new users at the time of this writing. The description of "beta" is off-putting, as it implies an architecture whose foundation has not yet solidified. And this implication became hard reality when, in June, Amazon's S3 suffered a temporary power outage that affected such high-profile users as the New York Times, whose archives were crippled.
Furthermore, the long-term security of the entire AWS remains to be seen. We can only take Amazon's word that its systems guarantee isolation of one user's applications from another's. Put simply, AWS is only going to work if its users' trust in it is complete. A security breach of any sort would likely be a mortal wound.
Programmers and architects of distributed systems will find the infrastructure pages on the AWS site to be nothing short of a playground. You can spend hours perusing the documentation, tutorials, examples, and references to community-supplied tools and libraries.
The "cloud" services -- EC2, S3, SQS, and SimpleDB -- are certainly compelling. Real applications are being built atop these virtual technologies. Examples can be found at the Amazon Web Services Elastic Compute Cloud resources page.
Some of the AWS components are of questionable utility. In particular, Mechanical Turk seems to create a built-in incentive to cause tasks to be priced below what they otherwise would. However, even the Turk might be a case of a technology ahead of its time. As the ability to conduct business over the Net continues to improve, perhaps Mechanical Turk will also.
Whether the notion of Amazon's "rentable infrastructure" catches on is unknown. Its failure (should it fail) will not be for lack of information and tools. I will be eagerly prowling the AWS Web site and AWS-relevant blogs to see what creations arise from the enticing techno-tinker-toy set that AWS represents.