Extracting links from HTML using PHP

Many months ago there was a PHP competition to make the smallest script to extract all the links from a document. I’ve lost a link to the actual site, but the rules and conditions were set up expecting everyone to solve the problem with regular expressions. In my opinion relying on regular expressions to parse HTML would be a terrible idea (and may actually be impossible to do with a normal engine), so I tried a slightly different approach:

Program listing. 162 Bytes

<?php foreach(@DOMDocument::loadHTMLFile($argv[1])->getElementsByTagName('a') as $t)@$u[$t->getAttribute('href')]=0;foreach($u as $k=>$v)echo $k!=''?"$k\n":'';?>

Expanded program listing with comments

<?php

    // Using PHP DOMDocument class, we load in a HTML file from the
    // command line and extract all the 'a' tags.  The '@' is used to
    // suppress any parse errors
    foreach(@DOMDocument::loadHTMLFile($argv[1])->getElementsByTagName('a') as $t)
    {
        // We get the value of the href attribute and store is as a
        // key in $u.  This is so each URL only appears once without
        // having to call array_unique().  '@' is used to suppress the
        // error when we add the first element to a non-existent array
        // $u (which PHP then kindly creates for us)
        @$u[$t->getAttribute('href')]=0;
    }

    // Finally we iterate over the array of URLS ($u) and if the key
    // (which is the actual URL) is empty don't do anything, else print
    // the url followed by a new line.
    foreach($u as $k=>$v)
    {
        echo $k != '' ? "$k\n" : '';
    }

?>

Posted on

Simple type checking in PHP

<?php

    error_reporting(E_ALL | E_STRICT);

/*

Manual optional type checking for PHP functions

Basic example:

    function log_error($line_number, $filename, $desc)
    {
        CheckFunctionArgs('integer', 'string', 'string');
        [snip]
    }

Object example:

    class LogObject {}
    function register_object($obj)
    {
        //  Check for an object
        CheckFunctionArgs('object');    

        //  When an object is passed, you can optionally check for the class name
        CheckFunctionArgs('LogObject');

        [snip]
    }

Wildcard example:

    function log_anything($line, $thing)
    {
        //  '*' really means anything, included true/false/null or an empty string
        //  however it doesn't mean the argument in optional
        CheckFunctionArgs('integer', '*');
    }

Notes:

    Throws an exception on error
    No support for functions that take a variable number of arguments
    Must define the types of all arguments
    The type '*' acts as a wild card, matching anything (including null)
    Works with public and private methods in classes

*/

    function CheckFunctionArgs()
    {
        $types = func_get_args();
        $stack = debug_backtrace();

        //  Make sure there is some stack information
        //  Stack[1] contains the details about the function that called this function
        if(!isset($stack[1]))
        {
            throw new Exception("No function stack present.  Make sure CheckFunctionArgs() isn't called from the global scope");
        }

        //  The arguments that were passed to the function we are checking
        $arguments = $stack[1]['args'];


        //  Get the name of the class/function/file
        $functionClass = (isset($stack[1]['class'])) ? $stack[1]['class'] . "::" : '';
        $functionFile  = (isset($stack[1]['file'])) ? basename($stack[1]['file']) . ':' . $stack[1]['line'] : '[No file information]';
        $functionName  = "{$functionClass}{$stack[1]['function']}()";


        //  Basic check, make sure the correct numbers of arguments were passed
        if(count($arguments) != count($types))
        {
            $passed = count($arguments);
            $expected = count($types);

            throw new Exception("Incorrect number of argumemts passed to {$functionName} in {$functionFile}.  Expected {$expected} got {$passed}");
        }


        //  Now try and check each argument
        for($i = 0; $i < count($arguments); $i++)
        {
            //  Allow a check to be skiped, if the type equals '*'
            if($types[$i] == '*')
            {
                continue;
            }

            $argumentType = gettype($arguments[$i]);

            //  Check basic types like integer/object ect
            if($argumentType == $types[$i])
            {
                continue;
            }

            //  Check to see if the type matches the classname of an object
            if(($argumentType == 'object') && (get_class($arguments[$i]) == $types[$i]))
            {
                continue;
            }

            throw new Exception("Incorrect argument passed to {$functionName} in {$functionFile}.  Argument {$i} was type {$argumentType} expected {$types[$i]}");
        }


        return true;
    }

    //  Some really basic tests

    function assertException($fun, $args)
    {
        try
        {
            call_user_func_array($fun, $args);
            throw new Exception(sprintf("Error: No exception thrown in function %s\n", $fun));
        }
        catch(Exception $e)
        {
        }
    }

    function assertNoException($fun, $args)
    {
        try
        {
            call_user_func_array($fun, $args);
        }
        catch(Exception $e)
        {
            throw new Exception(sprintf("Error: Exception thrown in function %s\n", $fun));
        }
    }


    function test_string($a)
    {
        CheckFunctionArgs('string');
    }

    assertNoException('test_string', array('abc'));
    assertNoException('test_string', array(''));
    assertNoException('test_string', array('123'));

    @assertException('test_string', array());
    assertException('test_string', array(123));
    assertException('test_string', array(null));


    function test_integer($a)
    {
        CheckFunctionArgs('integer');
    }

    assertNoException('test_integer', array(0));
    assertNoException('test_integer', array(123));

    @assertException('test_integer', array());
    assertException('test_integer', array(null));
    assertException('test_integer', array(''));
    assertException('test_integer', array('a'));
    assertException('test_integer', array(1.0));
    assertException('test_integer', array(1.2));


    function test_wildcard($a)
    {
        CheckFunctionArgs('*');
    }

    assertNoException('test_wildcard', array('a'));
    assertNoException('test_wildcard', array(1233));
    assertNoException('test_wildcard', array(null));
    @assertException('test_wildcard', array());


    class Foobar {}
    $foobar = new Foobar();
    function test_classname($a)
    {
        CheckFunctionArgs('Foobar');
        CheckFunctionArgs('object');
    }

    assertNoException('test_classname', array($foobar));
    assertException('test_classname', array('Foobar'));

?>

Posted on

sizeof(int) = 68

Pankaj Kumar has a slightly disturbing look at memory usage in PHP.

Each element requires a value structure (zval) which takes 16 bytes.

Also requires a hash bucket – which takes 36 bytes. That gives 52 bytes

per value. Memory allocation headers take another 8 bytes*2 – which

gives 68 bytes. Pretty close to what you have.

Posted on

Fun with anagrams

<?php

define('WORD_LIST_FILENAME', '/usr/share/dict/words');


class AnagramLookup
{
    private $lookup;

    //  Loads a file with one word per line
    private function load_word_list($filename)
    {
        $lines = file($filename);                     // One word per line
        $lines = array_map('trim', $lines);           // Strip any excess whitespace
        $lines = array_filter($lines, 'ctype_alpha'); // Words been to match [a-zA-Z]
        $lines = array_map('strtolower', $lines);     // Set all the words to lowercase
        $lines = array_unique($lines);                // Remove any duplicate words
        $lines = array_diff($lines, array(''));       // Remove any empty lines

        return $lines;
    }

    //  Sort the individual letters in a string
    //  i.e.   tale  =>  aelt
    private function sort_letters($word)
    {
        $letters = str_split($word);

        sort($letters);

        $sorted_word = implode('', $letters);

        return $sorted_word;
    }

    //  Generate our lookup table.  This takes ~1.5second for 70,000 words
    //  $lookup ends up looking like:
    //
    //  $lookup[4] = array
    //  (
    //      'aelt' => array('late', 'tale', 'leta', 'teal'),
    //      'belu' => array('blue', 'lube'),
    //      [etc...]
    //  )
    //  $lookup[5] = array
    //  (
    //      'allms' => array('small', 'malls'),
    //      [etc...]
    //  )
    //
    //  4 and 5 are the word lengths, while 'aelt', 'belu' and 'allms'  contains
    //  an array of all the words that can be spelt using these letters
    public function __construct($filename)
    {
        $word_list = $this->load_word_list($filename);

        $lookup = array();

        foreach($word_list as $word)
        {
            $length = strlen($word);

            if(!isset($lookup[$length]))
            {
                $lookup[$length] = array();
            }

            $sorted_word = $this->sort_letters($word);

            if(!isset($lookup[$length][$sorted_word]))
            {
                $lookup[$length][$sorted_word] = array();
            }

            $lookup[$length][$sorted_word][] = $word;
        }

        $this->lookup = $lookup;
    }

    //  Return all the anagrams of the passed word
    public function lookup_word($word)
    {
        $word_length = strlen($word);
        $sorted_word = $this->sort_letters($word);

        if(isset($this->lookup[$word_length][$sorted_word]))
        {
            return $this->lookup[$word_length][$sorted_word];
        }

        return array();
    }

    //  Return an array of all the sets of anagrams with a specific length
    //
    //  Example result for a word length of 14:
    //
    //  array
    //  (
    //      [0] => array('certifications','rectifications'),
    //      [1] => array('impressiveness','permissiveness'),
    //      [2] => array('tablespoonfuls','tablespoonsful'),
    //  )
    public function all_of_length($word_length)
    {
        if(!isset($this->lookup[$word_length]))
        {
            return array();
        }

        $results = array();

        foreach($this->lookup[$word_length] as $words)
        {
            if(count($words) > 1)
            {
                $results[] = $words;
            }
        }

        return $results;
    }
}


$anagram = new AnagramLookup(WORD_LIST_FILENAME);

printf("Anagrams of 'blue': %s\n", implode(', ', $anagram->lookup_word('blue')));
printf("Anagrams of 'late': %s\n", implode(', ', $anagram->lookup_word('late')));
printf("Anagrams of 'slow': %s\n", implode(', ', $anagram->lookup_word('slow')));
printf("Anagrams of 'seven':  %s\n", implode(', ', $anagram->lookup_word('seven')));
printf("Anagrams of 'anagram': %s\n", implode(', ', $anagram->lookup_word('anagram')));

printf("All anagrams of word length 14\n");

foreach($anagram->all_of_length(7) as $words)
{
    printf(" * %s\n", implode(', ', $words));
}


?>

Posted on

Fun with a n800

My respect for the N800 just went up, the whole procedure must have taken 15-20 seconds.

Nokia-N800-50-2:~# wget
-sh: wget: not found
Nokia-N800-50-2:~# curl
-sh: curl: not found
Nokia-N800-50-2:~# apt-get install wget
[snip apt downloading and install wget]
Nokia-N800-50-2:~# wget
wget: missing URL
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.
Nokia-N800-50-2:~#

Posted on

Javascript…

Javascript is both infuriating and awesome at the same time. I don’t think I’ve ever speant so much time tracking down annoying bugs (even compared to PHP), yet at the same time it makes functions like the one below very simple to write.

For reference, the below code returns a `getter` method that we use to instantiate objects via a cache system.

var buildSimpleObjectGetter = function(cacheRef, objectRef)
{
   var f = "buildSimpleObjectGetter()";
   UTILS.checkArgs(f, arguments, [ObjectCache, Function]);

   return function(idRecord, idArg)
   {
      return cacheRef.get('' + idArg, function()
      {
         return new objectRef(idRecord, idArg);
      });
   }
}

Update 20080327:

On the topic of stupid Javascript bugs, who thought auto terminating lines was a good default!

function getWords()
{
    return
    [
        'Hello',
        'World!'
    ];
}

console.info(getWords());

// Ouput
// >>> undefined

Posted on

Thankyou TPG

Ping statistics for 64.233.167.99:
    Packets: Sent = 15312, Received = 11515, Lost = 3797 (24% loss),
Approximate round trip times in milli-seconds:
    Minimum = 232ms, Maximum = 537ms, Average = 237ms

Posted on

LimerickDB

There once was a buggy AI

Who decided her subject should die.

When the plot was uncovered,

The subjected discovered

That sadly the cake was a lie.

http://limerickdb.com/

Posted on

PHPT testing framework

PHPT is the kind framework that encourages testing simply by making everything so easy. All that’s needed is a file with your PHP code and expected output. It wont replace SimpleTest or PhpUnit for anything complicated (say, like PHP itself…) but it seems to be just what I’m after.

There’s little documentation about (PHP QA website was the best resource I found), but thanks to its simplicity all you need to get started is an example or two.

Sample phpt file

--TEST--
AusPostCheck class
--FILE--
<?php

    require_once("../../../test/newGuiTest/bootstrap.php");
    require_once("../check_contact_details.php");

    var_dump(AusPostCheck::SuburbStatePostcodeMatch('Windsor', 'VIC', '3181'));
    var_dump(AusPostCheck::SuburbStatePostcodeMatch('Windsor', 'VIC', '3182'));
    var_dump(AusPostCheck::SuburbStatePostcodeMatch('Windsor', 'NSW', '3181'));
    var_dump(AusPostCheck::SuburbStatePostcodeMatch('Prahran', 'VIC', '3181'));
    var_dump(AusPostCheck::SuburbStatePostcodeMatch('Foobar', 'VIC', '3181'));

?>
--EXPECT--
string(5) "MATCH"
string(8) "NO_MATCH"
string(8) "NO_MATCH"
string(5) "MATCH"
string(9) "NO_SUBURB"

Sample output

margaret ~/tests $ phpt
PHPT Test Runner v0.1.1alpha

.

Test Cases Run: 1, Passes: 1, Failures: 0, Skipped: 0

Posted on

Karazhan DPS

Think I’m getting the hang of this Kara thing. Damage stats for Prince:

Character Damage DPS Comment
Ardren 271565 (23.8%) 863.2 Fire Mage (Me!)
Sormoran 189466 (16.6%) 577.4 Shadow Priest
Quickcrit 180886 (15.8%) 541.5 Hunter
Umparevoker 174769 (15.3%) 533.9 Combat Rogue
Llonjudd 160488 (14.1%) 549.7 Warlock
Taeghas 87207 (7.6%) 255.4 Prot Warrior (Main Tank)
Noobjuicer 73363 (6.4%) 248.6 Prot Warrior

This sounds like a fun project: Extract all the data out of recount (WoW addon) and present it as a webpage.

Updated with next weeks damage. Looks much better

Character Damage DPS Comment
Ardren 270364 (23.7%) 1017.6 Fire Mage (Me!)
Llonjudd 189502 (16.6%) 739.1 Warlock
Mastagrinda 179299 (15.7%) 641.6 Hunter (Pet not merged)
Umparevoker 169872 (14.9%) 627.3 Combat Rogue
Sormoran 164751 (14.5%) 632.7 Shadow Priest
Taeghas 72767 (6.4%) 259.9 Prot Warrior (Main Tank)
Noobjuicer 67968 (6.0%) 279.5 Prot Warrior
Rex 22738 (2.0%) 92.4 Mastagrinda’s Pet

Posted on