Here’s a small snippet of code that I use very frequently when parsing out webpages or content for specific items. For example on any webpage you need to extract data which is present like this:
<html><body><h1>ABC</h1>.... <!-- A lot list of code --><div id="myNewsItem">This is my news, and I am interested in extracting this out</div>.... <!-- and the HTML code continues on --></body></html>
I would like to extract out the data between the DIV tag “myNewsItem”.
Here’s the PHP function to do the extraction:
function SimMyExtract($string, $openingTag, $closingTag){ $string = trim($string); $start = intval(strpos($string,$openingTag) + strlen($openingTag)); $end = intval(strpos($string,$closingTag)); if($start == 0 || $end ==0) return false; // not found $mytext = substr($string,$start, $end - $start); return $mytext;}
Usage for above example:
SimMyExtract( $content, '<div id="myNewsItem">', '</div>' );
You can use it recursively to extract items in a list of similar tags (i.e. when the same tag is used a number of times on the same page). To offer more power I use it in conjunction with regular expressions. I would rid you from going into any further details for RegEx but it is absolutely powerful, and I love the way RegEx is implemented in PHP (both Perl’s PREG and EREG)…
For instance the same function could be reduced to:
ereg( $openingTag."[a-zA-Z0-9<>/]+".$closingTag, $content, $result);return implode($result,'');
The point is RegEx is able to capture a lot of occurrences and extract out, you need to master regex. Without that an interesting exercise could be to extract all URLs (content of HREF) from a webpage.
You could simplify this even more by not passing the closing tag at all:
$closingTag = preg_replace("/\\<\\s*(\\S*)[^>]*\\>/", "</\\1>", $openingTag);
It would seem that this is not going to work with nested tags … as in the case of
<html>
<body>
<h1>ABC</h1>
…. <!– A lot list of code –>
<div id="myNewsItem">This is my news,
<div class="pullquote">the rain in spain</div>
and I am
interested in extracting this out</div>
…. <!– and the HTML code continues on –>
</body>
</html>
It would appear that the regex will extract
This is my news,
<div class="pullquote">the rain in spain</div>
and miss
and I am
interested in extracting this out
Am I mistaken? if not, is there a way to make this work as intended?
Perhaps. But when you have a list of nodes that you would want to traverse and extract, its better to use domXML with xquery.
Please tell me how you recursively used it i am using it in that manner but its returning only one result i need all the data which comes in between the tags again and again
Please help
below is my code :
<?php
function SimMyExtract($string, $openingTag, $closingTag)
{
$string = trim($string);
$start = intval(strpos($string,$openingTag)
+ strlen($openingTag));
$end = intval(strpos($string,$closingTag));
if($start == 0 || $end ==0)
return false; // not found
$mytext = substr($string,$start, $end – $start);
return $mytext;
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.lonare.com");
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
$text = strip_tags($result);
$text = str_replace("Today,", "<div class=\"mydata\">", $text);
$text = str_replace("FML#", " </div> ", $text);
$text1 = SimMyExtract($text, ‘<div class="mydata">’, ‘</div>’);
echo $text1."<br>";
?>
I cant seem to make this piece of code work…
this is my PHP file
<?PHP
function SimMyExtract($string, $openingTag, $closingTag)
{
$string = trim($string);
$start = intval(strpos($string,$openingTag) + strlen($openingTag));
$end = intval(strpos($string,$closingTag));
if($start == 0 || $end ==0)
return false; // not found
$mytext = substr($string,$start, $end – $start);
return $mytext;
}
?>
<html>
<body>
<h1>ABC</h1>
…. <!– A lot list of code –>
<div id="myNewsItem">This is my news, and I am
interested in extracting this out</div>
…. <!– and the HTML code continues on –>
<?PHP
$exdata = SimMyExtract($text, ‘<div id="myNewsItem">’, ‘</div>’);
echo $exdata;
?>
</body>
</html>
Any ideas?
I Think of your talents as the things you’re really good at. They’re like personality traits. For instance, you may be a very creative person, or a person who’s really good at attending to details or a person with a gift for communicating. Your talents are the base for any successful business venture, including a home-based business.