Thursday, 18 August 2016

XAPIAN & WEB CRAWLER

Prerequisites:

  • Ubuntu 64 bits
  • Apache 2
  • Php 5
  • Xapian

 Installing:

Type in a terminal:

sudo apt-get update
sudo apt-get install xapian
sudo apt-get install python-xapian
sudo apt-get install libxapian-dev
sudo apt-get install xapian-dev
sudo apt-get install apache2 php5 libapache2-mod-php5

sudo apt-get install php5-xapian
apt-get install php5-curl apt-get install php5-gd
sudo a2enmod rewrite to allow rewrite

sudo nano /etc/php5/apache2/php.ini
# scroll down to and add extension as shown
extension=xapian.so

;;;;;;;;;;;;;;;;;;;
; Module Settings ;
;;;;;;;;;;;;;;;;;;;


Ctrl+0

sudo service apache2 restart

 Web-Server Test

Create a php page in /var/www/html/index.php or /var/www/index.php depending of your apache version
<?php
    echo phpinfo();
?>
Fire the browser to http://localhost 
Should be able to find in the page the following content
 



Project Folder  $HOME/xac &  /var/www/xac ----> $HOME/xac

Pick a working folder in your home. Eg xac. /home/xac, then make a symbolic link to it in the web root folder.

Type in a terminal:

 mkdir $HOME/xac
cd xac
mkdir database
mkdir pages
cd /var/www/
sudo chown $USER:$USER html  (if html is the default home folder)
cd html
ln -s $HOME/xac xac



Time to Crawl 

We use the wget to crawl the pages and websites we want

Create a script and add following content 

Type in a terminal:

cd $HOME/xac
nano w-crawl.sh # and add following content
(Check for glyphs, dashes and quotes, might be altered by the blog)


#!/bin/bash
[[ ! -d "pages" ]] && mkdir -p pages
while read -r full_url
do
    proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
    url="$(echo ${full_url/$proto/})"
    host="$(echo ${url/$user@/} | cut -d/ -f1)"
    path="$(echo $url | grep / | cut -d/ -f2-)"
    dir="$proto$host"
    fold=$(echo $dir | sed 's/http:\/\///g')
    www=$(echo $fold | sed 's/www\.//g')
    folder=$(echo $fold | sed 's/\./_/g')
    [[ ! -d pages/$folder ]] && mkdir pages/$folder
    pushd pages/$folder
        nslookup $www | grep Address | grep -v 127 > IP.TXT
        echo $proto > PROTO.TXT
        echo $url > URL.TXT
        wget --limit-rate=100k -r -l 3 -R gif,mp*,png,jpg,jpeg,GIF,PNG,JPG,JPEG,ogg,js,rss,xml,feed,.tar.gz,.zip,rar,.rar,.php -t 1 $full_url
    popd
done < "./sites.txt"


Create an input file for the script. The file would contain the url for the crawled sites.
nano sites.txt



http://www.enjoydecor.com
http://www.coinscode.com
http:www.do-not-crawl-any-search-engines.com

Save (Ctrl+O Ctrl+X)

chmod +x ./ w-crawl.sh
./w-crawl.sh 
# wait...


Xapian Index the data

The above crawl script would generate www folders for each domain under pages/ directory
We create a script to index the pages sub tree, and create the database for xapian.

nano x-index.sh # and add following content
(Check for glyphs, dashes and quotes, might be altered by the blog)
#!/bin/bash
PAGES=pages
DATABASE=database
[[ ! -d $DATABASE ]] && mkdir -p $DATABASE
pushd $PAGES
    for site in $(ls);do
        url=$(cat $site/URL.TXT)
        st=$(echo $url | sed 's/www\.//g')
        echo "omindex -p --db $DATABASE --url $st $site"
        omindex -p --db $DATABASE --url $st $site
    done
popd




PHP Search page

Copy the xapian.php in current folder.

Type in a terminal:
find /usr/share -name xapian.php
/usr/share/php/xapian.php
cp /usr/share/php/xapian.php ./

create the index.php file.

nano index.php # copy following content (Check for glyphs, dashes and quotes, might be altered by the blog)

<HTML>
<TITLE>
</TITLE>

<form action="search.php" method="get">
<?php
    include "xapian.php";

    $ss="";
    if(isset($_GET["w"])) $ss=$_GET["w"];
    echo "<input type='text' name='w' id='w' size='30' value='{$ss}' /></td><td>";
    echo "<input type='submit' name='but' value='Search...'>";


    function config_prep($expresion)
    {
        $n_words = count(explode(" ",$expresion));
        if (!get_magic_quotes_gpc())
        {
            $ss = addslashes($expresion);
        }
        else
        {
            $ss = $expresion;
        }
        $ss = str_replace('_','\_',$ss); // avoid '_' in the query
        $ss = str_replace('%','\%',$ss); // avoid '%' in the query
        $ss = str_replace('\"',' ',$ss); // avoid '"' in the query
        $ss = strtolower($ss);
        $ss = trim(ereg_replace(" +"," ",$ss)); // no more than 1 blank
        return $ss;
    }


    function main()
    {
        $_PAGE=5;
        $DATABASE="database";
        $PAGES="";
       
        $n=isset($_GET["n"]) ? $_GET["n"] : $_PAGE; // per page, n-count
        $f=isset($_GET["f"]) ? $_GET["f"] : 0; // from
        $ww=isset($_GET["w"]) ? $_GET["w"] : ""; //word
        $rezc = 0;
        $np = $f;
        $tp = $f;
        $w = config_prep($ww);
        if(strlen($w))
        {
            $warr = explode(" ",$w);
        try {
                $database = new XapianDatabase($DATABASE);
                $enquire = new XapianEnquire($database);
                $query_string = $w;
                $qp = new XapianQueryParser();
                $qp->set_default_op(XapianQuery::OP_AND);
                $stemmer = new XapianStem("english");
                $qp->set_stemmer($stemmer);
                $qp->set_database($database);
                $qp->set_stemming_strategy(XapianQueryParser::STEM_ALL);
                $query = $qp->parse_query($query_string);
                $enquire->set_query($query);
                $matches = $enquire->get_mset($f*$_PAGE, $_PAGE);
                $i      = $matches->begin();
                $rezc   = $matches->get_matches_estimated();
                $paging = $_PAGE;
                $nex    =  $paging + $f*$_PAGE;
                if($rezc)
                {
                    $sr = ($f*$_PAGE)+1;
                    echo "Searched:\'{$query_string}\' got:{$rezc} records. Showing: {$sr}-{$nex}";
                    while (!$i->equals($matches->end()))
                    {
                        $thisdoc = $i->get_document();
                        $data = ($thisdoc->get_data());
                        $n = $i->get_rank() + 1;
                        $doca = explode("\n",$data);
                        $urloutp = strpos($doca[0],"www");
                        $link = substr($doca[0],$urloutp);
                        $urla = parse_url("http://".$link);
                        $ln   = $urla["path"];
                        $file = pathinfo($ln)["basename"];
                        $urla   = str_replace("index.html","",$urla);
                        $webfod = str_replace(".","_",$urla["host"]);
                        $link="";
                        $content="";
                        echo "<div><a href='{$urla["scheme"]}://{$urla["host"]}{$urla["path"]}'><b>{$urla["host"]}</b>{$ln}</a></div>";
                        $data = $doca[1];
                        $subject = preg_replace('/\x1b(\[|\(|\))[;?0-9]*[0-9A-Za-z]/', "",$data);
                        $subject = preg_replace('/\x1b(\[|\(|\))[;?0-9]*[0-9A-Za-z]/', "",$data);
                        $subject = preg_replace('/[\x03|\x1a]/', "", $data);
                        foreach($warr as $word)
                        {
                            $data=str_ireplace($word,"<font color='blue'><b>". $word.'</b></font>',$data);
                        }
                        echo($data);
                        echo("<div class=rightd>{$n}/{$i->get_percent()}/{$i->get_docid()}</div>");
                        echo ("<hr>");
                        $i->next();
                    }
                }
                $database->close();
            }
            catch (Exception $e)
            {
                echo $e->getMessage() . "\n";
                exit(1);
            }
        }
        $pages = ceil($rezc/$_PAGE);
        $thispage = ceil(($f*$_PAGE)/$_PAGE)+1;

        $navlink="";//<div align='left'>This page:{$thispage} Pages:{$pages}  Records:{$rezc}</div>";
        $encoded_str=urlencode($ww);
        if($f>0)
        {
            $pf=$f-1;
            $navlink.="<a href=search.php?w={$encoded_str}&f={$pf}><img src='iprev.png'></a>";
        }
        else
        {
            $navlink.="<a href=search.php?w={$encoded_str}&f={$f}><img src='istop.png'></a>";
        }
        $navlink.="---{$thispage}/{$pages} ({$rezc})---" ;
        if($pages != $thispage)
        {
            $pf=$f+1;
            $navlink.="<a href=search.php?w={$encoded_str}&f={$pf}><img src='inext.png'></a>";
        }
        else
        {
            $navlink.="<a href=search.php?w={$encoded_str}&f={$f}><img src='istop.png'></a>";
        }

        echo $navlink;
        return $f;
    }


    $ff=main();

?>

</BODY>
</HTML>


The expected result




 





 

No comments:

Post a Comment