CLUEWEB API

ClueWeb Part B was indexed in Elastic 5.3 using stemming and stopping. The default retrieval algorithm was used i.e. BM25. The Elastic Instance has been deployed onto Microsoft Azure Platform - and is supported by a Microsoft Azure Research Award. Many thanks to Dr. Guido Zuccon and his team for setting up and indexing the collection.

To query the API you can send CURL request or write a script. Some examples are below.

To get the top ten results for the query "clueless": 
curl -XGET '40.68.209.241:9200/clueweb12_docs/_search?q=clueless&pretty'

To change the number of results returned, add in "size", here we request 20 results. 
curl -XGET '40.68.209.241:9200/clueweb12_docs/_search?q=clueless&pretty&size=20'

You can also include "from" if you want to specify an offset, i.e. results for the next page.

You will notice that for each result returned you have several fields, "_index", "_type", "_id","_score", and "_source", which contains "body", "title", and "spam_rank". The "_score" field is the BM25 score, the "_id" is the Clueweb TREC id, while the "spam_rank" is the spam score from Waterloo Spam Rankings (http://www.mansci.uwaterloo.ca/~msmucker/cw12spam/). Documents with a spam_rank close to zero are the spammiest! Where the scale is: 0=most spammy to 99 least spammy.

To get snippets (called highlights back) for each result, then you can issue the following request.

                         curl -XGET '40.68.209.241:9200/clueweb12_docs/_search?pretty' -H 'Content-Type: application/json' -d '
                    {
                        "query" : {
                            "match": { "body": "clueless" }
                        },
                        "highlight" : {
                            "fields" : {
                                "body" : {"type" : "plain"}
                            }
                        }
                    }'

For more parameters that you can use, see the Elastic Search API documentation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html).