Problem
I’d like to make a page that lists all of the photographs on my website, along with their titles and alternative representations.
I’ve already written a program to locate and load all HTML files, but I’m stumped as to how to extract the src, title, and alt from this HTML:
I suppose this could be done with regex, but since the sequence of the tags can change, and I need all of them, I’m not sure how to do it gracefully (I could do it the hard way, but that’s terrible).
Asked by Sam
Solution #1
$url="http://example.com";
$html = file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $tag->getAttribute('src');
}
Answered by karim
Solution #2
Using regexp to tackle a problem like this is a horrible idea, as it will almost certainly result in unmaintainable and unreliable code. Use an HTML parser instead.
In that scenario, it’s preferable to divide the procedure into two sections:
I’m assuming your document isn’t xHTML strict, thus you won’t be able to utilize an XML parser. For example, consider the following source code for a web page:
/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */
preg_match_all('/<img[^>]+>/i',$html, $result);
print_r($result);
Array
(
[0] => Array
(
[0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
[1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
[3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
[4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[...]
)
)
Then, using a loop, we get all of the image tag attributes:
$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}
print_r($img);
Array
(
[<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
)
)
Because regexps use a lot of CPU, you might want to cache this page. If you don’t have a cache system, you can create one with ob start and loading and saving from a text file.
To begin, we’ll use preg_ match_ all, a method that returns every string that matches the pattern as its third parameter.
The regexps :
<img[^>]+>
It’s used on all HTML web pages. Every string that starts with “img”, has a non-“>” char, and ends with a > can be read as this.
(alt|title|src)=("[^"]*")
We do this for each image tag in turn. It consists of every string beginning with “alt”, “title”, or “source”, followed by a “=”, a'” ‘, a bunch of stuff that isn’t a'” ‘, and finally a'” ‘. Isolate the sub-strings between the two strings ().
Finally, having appropriate tools to quickly test regexps is useful whenever you need to deal with them. Use this online regexp tester to see if your regexp is correct.
EDIT: Here’s my response to the first comment.
True, I didn’t consider the (hopefully insignificant) number of individuals who use single quotes.
If you just use ‘, simply replace all of the ” with ‘.
If you combine the two. You should first slap yourself:-), then try replacing [“] with (“|’) or ” and []].
Answered by e-satis
Solution #3
To give you an idea of how to use PHP’s XML capabilities for this purpose, consider the following:
$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}
I used the DOMDocument::loadHTML() function because it can handle HTML syntax and does not require an XHTML input document. The conversion to a SimpleXMLElement isn’t strictly essential; it simply simplifies the use of xpath and the xpath results.
Answered by Stefan Gehrig
Solution #4
If your example is XHTML, you only require simpleXML.
<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>
Output:
object(SimpleXMLElement)#1 (1) {
["@attributes"]=>
array(3) {
["src"]=>
string(22) "/image/fluffybunny.jpg"
["title"]=>
string(16) "Harvey the bunny"
["alt"]=>
string(26) "a cute little fluffy bunny"
}
}
Answered by DreamWerx
Solution #5
I did that with preg match.
In my scenario, I had a string from WordPress that contained precisely one img> tag (and no other markup), and I was trying to extract the src property so I could run it through timthumb.
// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);
// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);
You could simply use $pattern = ‘/title=”([“]*)”/’; in the pattern to capture the title or $pattern = ‘/title=”([“]*)”/’; in the pattern to grab the alt. Unfortunately, my regex isn’t powerful enough to grab all three (alt/title/src) in a single pass.
Answered by WNRosenberg
Post is based on https://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php