Sign up now to get free exclusive access to reports, research and invitation only events.
From Harmony to Hadoop, Apache has been a powerful contributor to the open source ecosystem
15 high-impact Apache projects
The Apache Software Foundation has been home to numerous important open source software projects from its inception in 1999. Successes range from Geronimo to Tomcat to Hadoop, the distributed computing system that now serves as a linchpin of the big data realm.
While Apache does not maintain comprehensive statistics on downloads, the Apache HTTP Server, for example, powers nearly 500 million websites, and OpenOffice, which came into Apache’s hands only recently, has been downloaded millions of times.
Apache also offers one of the more popular permissive open source licenses.
Here are 15 Apache projects that have been critical over the years, not only to the open source movement but to the technology world at large.
The Cassandra database serves as a “scalable system of record” in the big data world, says Jonathan Ellis, vice president of the Cassandra project. Apache received the project from Facebook, which open-sourced Cassandra in 2008. Whereas Hadoop undertakes data analysis, Cassandra provides a data store for applications, often highly scalable ones on the Web. Netflix, for example, runs many Cassandra clusters, Ellis says.
Cassandra offers fault tolerance on commodity hardware or cloud infrastructure and can be replicated across multiple data centers. Slated for July, Cassandra 2.0 will include support for CAS (compare-and-set) capabilities, to combine read and update into a single operation; trigger support, for defining actions in response to updates made to different tables and further reduction in request latency.
The project originated as PhoneGap and was developed by Nitobi, which was acquired by Adobe, says Brian Leroux, Cordova vice president at Apache and an Adobe product manager. Source code was donated to Apache.
“It allows [us] to synchronize any instance of CouchDB with any other. Each copy of the data can be worked on independently, and changes can be synchronized back to all other members of the group. Naturally occurring conflicts can be dealt with programmatically.”
Donated by Adobe, Flex is an application framework that has leveraged Adobe’s Flash rich Internet plug-in technology. Developers can build applications for iOS, Android, and BlackBerry Tablet OS, as well as desktop and browser applications. Apache is working on extending Flex to support HTML5, says Alex Harui, vice president of Apache Flex. But any HTML5-related improvements might carry a different name.
“We want to run in as many places as we can,” Harui says, in explaining Apache’s HTML5 ambitions for Flex. The upcoming version 4.10 of Flex, however, is expected to offer just incremental improvements. Having Flex at Apache “allows the people in the community who have a real stake in the Flex technology to actually contribute to its development,” Harui notes.
This server runtime integrates open source projects, including Tomcat, MyFaces, and OpenJPA, to produce Java/OSGi server runtimes. The most popular distribution is a Java EE 6 application server runtime.
“Apache Geronimo is a modular, compose-able, open source server runtime,” says Kevan Miller, chair of the Geronimo program management committee. “The next logical major release would be a Java EE 7 [version]. There haven't been any concrete discussions on a Java EE 7 release. We'll need to see more progress on the Java EE 7 specifications before that happens.” The project started in the Apache Incubator in 2003 and graduated as a Top Level Project the following year.
This project is all the rage these days and is synonymous with “big data,” in which enterprises and Web properties sift through reams of data to surface insights about customers and users. Hadoop provides an operating system for distributed computing.
“If you want to run computations on hundreds of thousands of computers instead of just on one computer, Hadoop lets you do that,” says Doug Cutting, a primary contributor to Hadoop for several years. Hadoop arose from the Nutch Web software project in 2006, Cutting said. Companies like Cloudera, where Cutting is employed, and HortonWorks are building businesses around Hadoop. Future improvements will include boosts for security and scalability.
Since retired, this modular Java runtime was one of Apache’s most controversial projects, sparking a dispute between Apache and Sun that carried over to Oracle’s stewardship of Java.
“The main goal of Harmony was to create a free and open source implementation of the Java runtime environment,” says Apache participant Jim Jagielski. “The project was retired due to Sun and then Oracle's refusal to grant Apache the required TCKs [Technology Compatibility Kits] to validate Harmony as Java-compliant, despite promises, guarantees, and signed contracts to do so.”
A field of use restriction imposed by Sun prevented Harmony’s use on mobile platforms, which Sun claimed would impact Java ME sales. Harmony, though, forced Oracle to accept OpenJDK and is a core component of Google Android, adds Jagielski.
This project, aka “httpd,” features an HTTP server. “In many ways, Apache httpd is still the cornerstone of the Apache Software Foundation,” says Jagielski, who has been a committer to the project since 1995. “It would not be an overstatement to credit Apache httpd with the popularity, usefulness, and ubiquitous of the Web. Having a ‘free,’ open source, and fully compliant reference implementation allowed the Web to become as universal and pervasive as it has.”
The latest version, httpd 2.4.4, offers improved performance and suitability for cloud environments. “This includes dynamic reconfiguration of reverse-proxy setups, faster and more memory efficient request processing, support for asynchronous I/O, and a suite of new modules for in-process and on-the-fly content processing.”
Lucene provides a text engine search library written in in Java. “Lucene users are people who need to add search to their apps,” says Simon Willnauer, a core committer on Lucene since 2006. Lucene is being used in Twitter, he notes, and began in 1997 when many companies were working on search.
Lucene 4.0 was released this past October, serving as a rewrite and supporting users’ own codecs for determining how data structures are encoded. This enables specialized use cases, Willnauer says. Lucene 4.1 was released in January, featuring disk space savings and performance improvements. Version 4.2, due in a few months, is expected to feature a refactoring of doc values capabilities for searching documents.
This software project management and comprehension tool is used to manage builds, reporting, and documentation. It has emphasized Java development.
“The main [benefit of Maven] has been a much faster way to get people up and running on a project,” says Brett Porter, who has been involved with Maven’s development for 10 years and is CTO at devops automation vendor Maestrodev, which supports Maven.
Dependency management for Java projects is also critical to Maven, linking different software projects together. It can integrate with tools such as the Jenkins software build system. Improvements to Maven are planned to boost plug-in and logging capabilities.
Turned over by Oracle to Apache in 2011, the OpenOffice application suite had been a Sun Microsystems project. It had floundered at Oracle, with the company clashing with members of the OpenOffice.org community.
The suite features six personal productivity applications: word processor, spreadsheet, presentation graphics, drawing, equation editor, and database. Apache released two versions in 2012, adding vector graphics capabilities, additional language support, performance improvements, and bug fixes. Version 4.0 is due in April, says Andrea Pesecetti, vice president of Apache OpenOffice. It will feature a modernized GUI, interoperability improvements for Microsoft Word files, better accessibility for disabled persons, and performance improvements. The 3.4 release of OpenOffice has been downloaded more than 35 million times since May 2012.
Pig is used to analyze large data sets, featuring parallelization and a high-level language for data analysis algorithms. Developers can use Pig instead of writing Java code when using Hadoop.
“You can think of Pig as an abstraction layer on top of Hadoop,” says Daniel Dai, a committer on the project. Pig is so-named because of its ability to eat everything data-wise, Dai says. “It consumes all kinds of data.”
Users can build their own functions for special-purpose processing. The forthcoming upgrade, Pig 11.0, will feature performance enhancements and operators cube, for calculating multiple dimension aggregates, and rank, for ranking. Pig developers would like Pig to eventually be independent of Hadoop, but right now it is Hadoop-dependent, Dai says.
Struts is a framework for building Java Web apps. It began as a subproject of Apache Jakarta and was spun out in 2005.
“The Apache Struts project offers framework solutions to build so-called action-based Java Web applications, in contrast to component-based solutions like JSF or Apache Wicket,” says Rene Gielen, vice president of Apache Struts.
Version 1 was the de-facto standard for building Java-based Web applications before the rise of JavaServer Faces, Gielen says. Struts 2 “is a lightweight, elegant, and highly decoupled action-based Web framework being built on the basic principles introduced by Struts 1, but without sharing a single line of code with its predecessor.” A major redesign is anticipated for Struts 3 in the near future.
Subversion was founded by CollabNet in 2000. The version control system currently vies with Git for developer mindshare, but Greg Stein, vice president of Subversion, does not see it as a duel.
“There is no battle. Version control systems are tools, and development groups will choose the tool that works best for them. It makes sense to have many options.”
“The centralized repository, simple setup, access control, massive repository sizes, and a wide variety of clients is hugely favored by many businesses. Subversion is the most popular version control system in businesses by a huge margin,” Stein says.
The forthcoming Version 1.8 will offer client improvements related to moving files and directories. It will also offer improved merging and inheritable and server-defined properties.
This implementation of Java Servlet and JavaServer Pages technologies is an Apache veteran since 1999. Tomcat is effectively a Java application server, and it has spawned such commercial products as Tcat Server from Mulesoft and VMware vFabric tc Server. There is also Apache TomEE, which is essentially the Java EE 6 Web Profile version of Tomcat. Plans for Tomcat 8 include support for Servlet 3.1 specification.
“The big new feature there is support for non-blocking I/O, which should enable highly scalable apps to be written more easily,” says Mark Thomas, a longtime participant in Apache’s development and release manager for Tomcat 4 and 7. WebSocket communications support, meanwhile, should help make applications more scalable by handling more messages.