10 maart 2010

CSS Selector to XPath-query

When scrapping HTML-pages it can be usefull to get elements based on CSS-selectors. Therefore I wrote a nifty function that builds a XPath-query that can be used to select the elements in a DOMXPath-document.

The function can be found below:

<?php
/**
 * Convert a CSS-selector into an xPath-query
 *
 * @return    string
 * @param    string $selector    The CSS-selector
 */
function buildXPathQuery($selector)
{
    // redefine
    $selector = (string) $selector;

    // the CSS selector
    $cssSelector = array(    // E F: Matches any F element that is a descendant of an E element
                            '/(\w)\s+(\w)/',
                            // E > F: Matches any F element that is a child of an element E
                            '/(\w)\s*>\s*(\w)/',
                            // E:first-child: Matches element E when E is the first child of its parent
                            '/(\w):first-child/',
                            // E + F: Matches any F element immediately preceded by an element
                            '/(\w)\s*\+\s*(\w)/',
                            // E[foo]: Matches any E element with the "foo" attribute set (whatever the value)
                            '/(\w)\[([\w\-]+)]/',
                            // E[foo="warning"]: Matches any E element whose "foo" attribute value is exactly equal to "warning"
                            '/(\w)\[([\w\-]+)\=\"(.*)\"]/',
                            // div.warning: HTML only. The same as DIV[class~="warning"]
                            '/(\w+|\*)?\.([\w\-]+)+/',
                            // E#myid: Matches any E element with id-attribute equal to "myid"
                            '/(\w+)+\#([\w\-]+)/',
                            // #myid: Matches any E element with id-attribute equal to "myid"
                            '/\#([\w\-]+)/'
                        );

    // the xPath-equivalent
    $xPathQuery = array(    '\1//\2',
                            '\1/\2',
                            '*[1]/self::\1',
                            '\1/following-sibling::*[1]/self::\2',
                            '\1 [ @\2 ]',
                            '\1[ contains( concat( " ", @\2, " " ), concat( " ", "\3", " " ) ) ]',
                            '\1[ contains( concat( " ", @class, " " ), concat( " ", "\2", " " ) ) ]',
                            '\1[ @id = "\2" ]',
                            '*[ @id = "\1" ]'
                        );

    // return
    return (string) '//'. preg_replace($cssSelector, $xPathQuery, $selector);
}
?>

In a post that will be published in the near future you 'll see why I really needed it.

Dit artikel gaat over: , . CSS Selector to XPath-query werd geschreven door Tijs Verkoyen.
Er zijn nog geen reacties, reageer als eerste.

Reageer velden gemarkeerd met een sterretje zijn verplicht

*