'Tis the season to begin ramping up online shopping activity, and for retailers that means doing all they can to ensure their websites are up, highly available and able to handle peak capacity. Looming in many IT managers' minds is the cautionary tale of Target, whose website crashed twice after it was inundated by an unprecedented number of online shoppers when the retailer began selling clothing and accessories from high-end Italian fashion company Missoni.
"We are working around the clock to ensure that our site is operating efficiently and delivering an exceptional guest experience that's reflective of Target's brand,'' said a Target spokesperson in an email, but declined to give specifics on the measures the company has taken.
One company's hardship is often another company's gain, and those that face well-publicized failures tend to become de facto role models, retail industry watchers say. Take what happened to Best Buy in 2005: Its website experienced what some have called a catastrophic holiday failure and customers were unable to make online purchases. That same year, competitor Circuit City saw a huge spike in traffic, says Dave Karow, senior product manager of Web performance and testing at Keynote, a firm that monitors and tests mobile and Internet performance.
Test early to make sure there's enough capacity and that loads are balanced correctly.
Make sure traffic predictions are vetted by enough internal stakeholders so you're not guessing what your peak might be.
Check everything from application servers to your network firewall, all the way down to the speed of your Internet connection - and check more than twice.
Have contingency plans in place in case you exceed your traffic expectations. One way to do that is by removing the functionality that takes a lot of processing power or bandwidth, such as dynamically displaying customized information for each visitor.
If you're going to take your site down for required maintenance, make sure there's another way for people to get to it.
"There's nothing like falling flat on your face to give you the conviction to do right thing going forward. That was an extremely effective wakeup call for Best Buy,'' he says, adding that the retailer now conducts several load tests throughout the year.
Web retailers should be shooting for 99.5% availability, otherwise "they're not cutting it," Karow maintains. "Ninety-nine percent is not acceptable because if you achieve that, you're still one percent unavailable." That has a significant impact since it means more than one percent of potential transactions didn't occur -- and likely won't going forward, he says.
This holiday season, more than ever, Web retailers need to be prepared for the onslaught, since a growing number of consumers will be using mobile devices to shop. A report recently released by mobile ad network InMobi claims an estimated 60 million mobile users are planning to use their devices to shop during the Black Friday/Cyber Monday holiday weekend, with over 21 million intending to make purchases from those devices.
Prepare, test and review
Online shoe retailer Zappos conducts load testing early in the fall to ensure its site stays up and highly available during the holiday season, says Kris Ongbongan, senior manager, technical operations and systems engineering. Every year they follow the same procedure, he says: estimate load.
"We have our finance and planning departments give us sales predictions and we take a multiple of that to see what traffic we can absorb and test to that," typically beginning in September, Ongbongan says. That gives them enough time to make changes and add any necessary infrastructure.
Retailers should go through their transaction volume testing and validation in the September/October timeframe and then code lock their systems until about January 15th, suggests Michael Ebert, a partner in IT Advisory Services at KPMG. During that period, "retailers typically freeze their systems ... and don't do updates unless absolutely necessary to avoid performance issues,'' he says.
Another practice the very large Internet retailers tend to employ is having distributed networks in order to route traffic to make sure transactions are balanced around the U.S., Ebert says. That way, if one site gets too busy the customer will automatically be routed to another. "So make sure you have multiple points of your Internet presence around the U.S." A data center "may be slow to respond, but at least I'm up and running,'' he adds. "There's always a percentage of business you never regain if someone leaves the site."
Another metric that retailers need to be concerned with is latency, or the response time for how long it takes a page to load and for the payment transaction to be completed. "I expect we'll see some latency concerns" or other problems during the check-out step during this holiday season, predicts Greg Girard, program director, IDC Retail Insights. That's because there are throughput bottleneck issues at the gateway to the credit card processing network, he says.
"The micro-economic problem is that it costs money to maintain capacity that you utilize only at the peak time, which is only very infrequently during the year. It's an economical tradeoff you have to make."
Over-provisioning via Cloud
For a lot of smaller online retailers, it's hard to justify the return on investment for increasing the capacity they need to handle 12 hours of peak usage on one day of the year, says Girard. "That's where cloud comes into play, and we're seeing some retailers adopt cloud strategies. That's really going to progress going forward." Retailers will be able to get additional peak capacity at an incremental cost by moving to the cloud, he says.
Zappos' Ongbongan says they handle all network functions internally and do not use cloud providers. "We have instrumentation around every transaction point on the website, from search pages to product detail pages to checkout," he says, "so we can look at each individually to see if there's any slowness or problems in any of those areas."
More bulletproofing tips
Make sure an e-commerce site is secure, specifically against DDoS attacks.
Freeze all maintenance and any non-critical code changes in the November/December timeframe.
Make sure every component has a risk-mitigation strategy so there is a plan in place if something on the network goes down.
Communicate with marketing and other relevant business units to make sure you understand their promotion and other plans.
Consider a move to the Cloud to handle seasonal peak capacity.
But no matter how prepared you are, problems can still occur, especially when you outsource to third-party vendors. "Nothing is fully bulletproof, so really what [online retailers] need to try and achieve is fault tolerance,'' says Mike Gualtieri, a principal at Forrester Research. He recalls a retailer he worked with that uses an external credit card service that went down one year on Cyber Monday, so the company's orders couldn't be processed.
"Their e-commerce system is in-house, so they had planned for volumes -- searching and shopping the site -- but they have a service level agreement with a credit card service processing service that said, 'We can handle that volume.' So they did all the right things for their own systems and planned for the [increased] volume on Cyber Monday, but were held hostage by this particular provider,'' Gualtieri says.
He says he recommended that the retailer re-architect its site so if the payment processor were to go down again the company could still collect the order and payment information and process payments at a later time. That's particularly useful for small retailers, he says, who may not be able to invest in technologies like an online shopping cart and have to rely on third parties for the functionality.
Regardless of their size, Gualtieri says, retailers need to examine every component of their systems and assign a confidence level between one and five. "Every online retailer should look at their entire ecommerce architecture and all the components they use: shopping cart, products search, account registration--whatever they have--and rate their confidence level.
"Don't assume that everything will go right,'' Gualtieri says. "Assign a confidence level and don't fret too much, but have a mitigation strategy and backup plan."
Optimize for traffic
Among the lessons Karmaloop learned during the 2010 holiday season were that its content delivery network configuration was not optimized for the traffic it was going to experience on Cyber Monday, says Joseph Finsterwald, CTO at the online retailer of alternative street fashion for men and women. "We worked with our CDN vendor Akamai to come up with a configuration that was a better fit for us,'' he says. The firm also discovered problems with parallel processes on the network and synchronization issues when servicing up Web pages, which was corrected by rewriting code.
Revenues are growing 50% to 70% year over year, Finsterwald says, so Karmaloop is using Keynote's LoadPro Web load-testing services to ensure its site is not strained. Because its CDN network was not optimized to handle this level of traffic in past years, the site experienced "frequent" network outrages, he said, although he declined to provide specifics.
"It gives you peace of mind that we can come up with a reasonable facsimile under peak load,'' Finsterwald says. "Load testing is an inelegant science; you're trying to simulate user traffic, but you're integrating a lot of third-party components." If a test is done on a quiet day, a third party may be able to scale to handle that, but all bets might be off when they're handling multiple clients.
This year, when conducting load testing, Karmaloop scaled its systems to a high enough load to trigger a problem for the vendors to address proactively. "We saw performance degradation with some of our vendors," says Finsterwald, "so we're following up with them to make sure they're doing what they need to do."
Keynote's Karow concurs. "Load testing done right has to be a very close representation of what real users are going to do, so it takes real thinking about what people do and the various systems involved and are you stressing those systems?"
Talk to your stakeholders
Also critical to the success of keeping systems up and highly available is making sure everyone is on the same page. "Everybody needs to be involved in the planning and predictive process,'' says Zappos' Ongbongan. At Zappos, that means everyone from brand marketing to financial planning to warehouse staff is involved in planning for peaks in site traffic.
One thing his group learned from talking with other departments was that their peak traffic typically occurs in mid-December, as opposed to right after Thanksgiving or right before Christmas.
Forrester's Gualtieri says it's a definitely a problem when a marketing group doesn't let IT know what it's doing that might cause site traffic to spike. He says he worked with a large Midwestern insurance company that spent a couple of million dollars on its first TV ad during a football game. When the ad aired, the company's site went down "almost instantly," because the company's marketing department didn't tell IT it was running the ad. "So IT had no idea they were going to expect 500 times the normal amount of traffic,'' he says, and they ended up wasting their money on the ad.
Despite all the proactive measures retailers may be taking, Gaultieri predicts there will still be "some high-profile outages" this holiday season. "One, two or several will happen. I also think a lot will happen that you'll never hear about ... I don't think this problem is going to go away."
Although companies are becoming savvier about bulletproofing their sites, crashes will inevitably occur due to continuous changes made to enhance the online shopping experience, he says. "You can't just put a site up and have it be static; there are lots of moving parts and it creates complexity, and there's fallout."
Esther Shein is a freelance writer and editor. She can be reached at firstname.lastname@example.org