Thursday, 18 August 2016



  • Ubuntu 64 bits
  • Apache 2
  • Php 5
  • Xapian


Type in a terminal:

sudo apt-get update
sudo apt-get install xapian
sudo apt-get install python-xapian
sudo apt-get install libxapian-dev
sudo apt-get install xapian-dev
sudo apt-get install apache2 php5 libapache2-mod-php5

sudo apt-get install php5-xapian
apt-get install php5-curl apt-get install php5-gd
sudo a2enmod rewrite to allow rewrite

sudo nano /etc/php5/apache2/php.ini
# scroll down to and add extension as shown

; Module Settings ;


sudo service apache2 restart

 Web-Server Test

Create a php page in /var/www/html/index.php or /var/www/index.php depending of your apache version
    echo phpinfo();
Fire the browser to http://localhost 
Should be able to find in the page the following content

Project Folder  $HOME/xac &  /var/www/xac ----> $HOME/xac

Pick a working folder in your home. Eg xac. /home/xac, then make a symbolic link to it in the web root folder.

Type in a terminal:

 mkdir $HOME/xac
cd xac
mkdir database
mkdir pages
cd /var/www/
sudo chown $USER:$USER html  (if html is the default home folder)
cd html
ln -s $HOME/xac xac

Time to Crawl 

We use the wget to crawl the pages and websites we want

Create a script and add following content 

Type in a terminal:

cd $HOME/xac
nano # and add following content
(Check for glyphs, dashes and quotes, might be altered by the blog)

[[ ! -d "pages" ]] && mkdir -p pages
while read -r full_url
    proto="$(echo $full_url | grep :// | sed -e's,^\(.*://\).*,\1,g')"
    url="$(echo ${full_url/$proto/})"
    host="$(echo ${url/$user@/} | cut -d/ -f1)"
    path="$(echo $url | grep / | cut -d/ -f2-)"
    fold=$(echo $dir | sed 's/http:\/\///g')
    www=$(echo $fold | sed 's/www\.//g')
    folder=$(echo $fold | sed 's/\./_/g')
    [[ ! -d pages/$folder ]] && mkdir pages/$folder
    pushd pages/$folder
        nslookup $www | grep Address | grep -v 127 > IP.TXT
        echo $proto > PROTO.TXT
        echo $url > URL.TXT
        wget --limit-rate=100k -r -l 3 -R gif,mp*,png,jpg,jpeg,GIF,PNG,JPG,JPEG,ogg,js,rss,xml,feed,.tar.gz,.zip,rar,.rar,.php -t 1 $full_url
done < "./sites.txt"

Create an input file for the script. The file would contain the url for the crawled sites.
nano sites.txt

Save (Ctrl+O Ctrl+X)

chmod +x ./
# wait...

Xapian Index the data

The above crawl script would generate www folders for each domain under pages/ directory
We create a script to index the pages sub tree, and create the database for xapian.

nano # and add following content
(Check for glyphs, dashes and quotes, might be altered by the blog)
[[ ! -d $DATABASE ]] && mkdir -p $DATABASE
pushd $PAGES
    for site in $(ls);do
        url=$(cat $site/URL.TXT)
        st=$(echo $url | sed 's/www\.//g')
        echo "omindex -p --db $DATABASE --url $st $site"
        omindex -p --db $DATABASE --url $st $site

PHP Search page

Copy the xapian.php in current folder.

Type in a terminal:
find /usr/share -name xapian.php
cp /usr/share/php/xapian.php ./

create the index.php file.

nano index.php # copy following content (Check for glyphs, dashes and quotes, might be altered by the blog)


<form action="search.php" method="get">
    include "xapian.php";

    if(isset($_GET["w"])) $ss=$_GET["w"];
    echo "<input type='text' name='w' id='w' size='30' value='{$ss}' /></td><td>";
    echo "<input type='submit' name='but' value='Search...'>";

    function config_prep($expresion)
        $n_words = count(explode(" ",$expresion));
        if (!get_magic_quotes_gpc())
            $ss = addslashes($expresion);
            $ss = $expresion;
        $ss = str_replace('_','\_',$ss); // avoid '_' in the query
        $ss = str_replace('%','\%',$ss); // avoid '%' in the query
        $ss = str_replace('\"',' ',$ss); // avoid '"' in the query
        $ss = strtolower($ss);
        $ss = trim(ereg_replace(" +"," ",$ss)); // no more than 1 blank
        return $ss;

    function main()
        $n=isset($_GET["n"]) ? $_GET["n"] : $_PAGE; // per page, n-count
        $f=isset($_GET["f"]) ? $_GET["f"] : 0; // from
        $ww=isset($_GET["w"]) ? $_GET["w"] : ""; //word
        $rezc = 0;
        $np = $f;
        $tp = $f;
        $w = config_prep($ww);
            $warr = explode(" ",$w);
        try {
                $database = new XapianDatabase($DATABASE);
                $enquire = new XapianEnquire($database);
                $query_string = $w;
                $qp = new XapianQueryParser();
                $stemmer = new XapianStem("english");
                $query = $qp->parse_query($query_string);
                $matches = $enquire->get_mset($f*$_PAGE, $_PAGE);
                $i      = $matches->begin();
                $rezc   = $matches->get_matches_estimated();
                $paging = $_PAGE;
                $nex    =  $paging + $f*$_PAGE;
                    $sr = ($f*$_PAGE)+1;
                    echo "Searched:\'{$query_string}\' got:{$rezc} records. Showing: {$sr}-{$nex}";
                    while (!$i->equals($matches->end()))
                        $thisdoc = $i->get_document();
                        $data = ($thisdoc->get_data());
                        $n = $i->get_rank() + 1;
                        $doca = explode("\n",$data);
                        $urloutp = strpos($doca[0],"www");
                        $link = substr($doca[0],$urloutp);
                        $urla = parse_url("http://".$link);
                        $ln   = $urla["path"];
                        $file = pathinfo($ln)["basename"];
                        $urla   = str_replace("index.html","",$urla);
                        $webfod = str_replace(".","_",$urla["host"]);
                        echo "<div><a href='{$urla["scheme"]}://{$urla["host"]}{$urla["path"]}'><b>{$urla["host"]}</b>{$ln}</a></div>";
                        $data = $doca[1];
                        $subject = preg_replace('/\x1b(\[|\(|\))[;?0-9]*[0-9A-Za-z]/', "",$data);
                        $subject = preg_replace('/\x1b(\[|\(|\))[;?0-9]*[0-9A-Za-z]/', "",$data);
                        $subject = preg_replace('/[\x03|\x1a]/', "", $data);
                        foreach($warr as $word)
                            $data=str_ireplace($word,"<font color='blue'><b>". $word.'</b></font>',$data);
                        echo("<div class=rightd>{$n}/{$i->get_percent()}/{$i->get_docid()}</div>");
                        echo ("<hr>");
            catch (Exception $e)
                echo $e->getMessage() . "\n";
        $pages = ceil($rezc/$_PAGE);
        $thispage = ceil(($f*$_PAGE)/$_PAGE)+1;

        $navlink="";//<div align='left'>This page:{$thispage} Pages:{$pages}  Records:{$rezc}</div>";
            $navlink.="<a href=search.php?w={$encoded_str}&f={$pf}><img src='iprev.png'></a>";
            $navlink.="<a href=search.php?w={$encoded_str}&f={$f}><img src='istop.png'></a>";
        $navlink.="---{$thispage}/{$pages} ({$rezc})---" ;
        if($pages != $thispage)
            $navlink.="<a href=search.php?w={$encoded_str}&f={$pf}><img src='inext.png'></a>";
            $navlink.="<a href=search.php?w={$encoded_str}&f={$f}><img src='istop.png'></a>";

        echo $navlink;
        return $f;




The expected result



No comments:

Post a Comment