Web Crawler In Perl 網路爬蟲
凌晨12:57Web Crawler In Perl 網路爬蟲
因為學校功課的關係,在很短的時間內摸perl的網路爬蟲,用api寫完後才發現老師要我們自己寫parser(乾 早知道不翹課),所以在用RE寫第二遍( )之後一陣子應該都用不到了,所以紀錄下來給未來可能用到的自己或是其他人看Readme
本程式會去爬台南市的三個影城的電影時刻表並且把資料抓下來解析,因為作業要求所以略過一些資訊(是否3dMax之類的),只抓電影名稱跟時間,很簡單的小程式
抓html -> parse 出我要的資訊 -> 輸出
Sample code Using Regular Expression
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;
use Encode;
$browser = LWP::UserAgent->new();
$browser->timeout(10);
&crawler('http://www.atmovies.com.tw/showtime/t06607/a06/');
&crawler('http://www.atmovies.com.tw/showtime/t06608/a06/');
&crawler('http://www.atmovies.com.tw/showtime/t06609/a06/');
sub crawler{
(my $URL) =@_;
my $request = HTTP::Request->new(GET => $URL);
my $response = $browser->request($request);
if ($response->is_error()) {printf "%s\n", $response->status_line;}
$contents = $response->content();
$data = $contents;
while($data =~ m!<ul id="theaterShowtimeTable">(.*?)<ul>(.*?)</ul>(.*?)</ul>!gs)
{
$item=$1;
$otherparse=$3;
$item =~ m!<a href=(.*)>(.*?)</a>!;
$title=$2;
print "$title\n";
while($otherparse =~m!(\d)(\d):(\d)(\d)!gs)
{
print "$1$2:$3$4\n";
}
}
}
sample code using TreeBuilder
use HTML::TreeBuilder;
binmode(STDIN, ':encoding(utf8)');
binmode(STDOUT, ':encoding(utf8)');
binmode(STDERR, ':encoding(utf8)');
$URL = 'http://www.atmovies.com.tw/showtime/t06607/a06/';
my $tree = HTML::TreeBuilder->new_from_url($URL);
my @items = $tree->look_down('id', 'theaterShowtimeTable' )or die("no items: $!\n");
for my $item (@items)
{
my @movies = $item->look_down( '_tag', 'li' )
or die("no movies$!\n");
$count=0;
for my $movie (@movies)
{
if($count!=1&&$count!=2&&$count!=3)
{
if($movie->attr('class') ne "theaterElse" && $movie->attr('class') ne "filmVersion")
{
print $movie->as_text, "\n";
}
}
$count++;
}
}