Positioned as a "mini language" for writing semantic extractors, Fathom already is in production with Firefox's Activity Stream web traffic tracker, picking out page descriptions, images, and other items, said Mozilla's Erik Rose. Still in an early stage of development, Fathom "enables Firefox to understand the structure and content of a web page," he said. The framework could be implemented in browsers, browser extensions, and server-side software.
Rose presented scenarios in which Firefox could understand pages the same as a person. For example, the browser could recognize and follow a log-in link, provide hotkeys to dismiss popovers, hide superfluous navigation or header sections on small screens, and determine what to print without needing print stylesheets.
These scenarios, he said, assume the browser can identify meaningful parts on a page. Echoing the much-touted semantic web, Rose cited previous attempts in this vein, such as semantic tags, Resource Description Framework, and microformats.
Fathom, meanwhile, is a data-flow language like Prolog. It extracts meaning from web pages, identifying parts like address forms, Previous/Next buttons, and the main textual content. DOM nodes are scored and extracted based on user-specified conditions, and a system of types and annotations expresses dependencies between scoring steps and controls state. Existing sets of scoring rules can be extended without having to directly edit them, so third-party refinements can be mixed in.