Dataset

Evaluation Results can be found here.

In the NTCIR-12 IMine-2 task we will provide the following datasets for both Query Understanding and Vertical Incorporating subtasks:

Topics

The topics for the IMine-2 task can be downloaded here.

100 queries will be provided as the topics of the NTCIR-12 IMine-2 task. The same query topics will be adopted in both Query Understanding and Vertical Incorporating subtasks for all languages. These topics are sampled from the median-frequency queries collected from AOL, Sogou and Bing search logs. Five types of queries, namely, ambiguous, faceted, very clear, task-oriented, and vertical-oriented, are included in the query topic set. About one-third of topics are shared among different languages for possible future cross-language research purposes.

The details of the five query types are as follows:

  • Ambiguous: The concepts/objects behind the query are ambiguous (e.g., “jaguar” -> car, animal, etc)
  • Faceted: The information needs behind the query include many facets or aspects (e.g., “harry potter” -> movie, book, Wikipedia etc)
  • Very clear: The information need behind the query is very clear so that usually a single document can satisfy his information needs.(e.g., “Tanaka laboratory homepage”)
  • Task-oriented: The search intent behind the query relates the searcher’s goal (e.g., “lose weight” -> exercise, healthy food, medicine, etc)
  • Vertical-oriented: The search intent behind the query strongly indicates a specific vertical (e.g., “iPhone photo” -> Image vertical)

Document Collection

For the Vertical Incorporating subtask, we provide the following document collection as the resource of organic (Web) documents.
Participants who submit Vertical Incorporating subtask are required to submit at lease one run that uses this dataset.

  • IMine-2 Web Corpus (for English, Chinese subtasks):
    The collection contains the top 500 results of the Bing Search API, which we crawled during July-August 2015.
    The collection contains html files of the documents as well as their title, url, and summary returned by the Bing Search API. For more detail, please see the readme file in the archive. Please note that those who participate in the Vertical Incorprating subtask is requierd to submit at least one run that use this corpus.
    Participants are allowed to use the collection not only for the VI subtask but also for the QU subtask to predict possible intents.IMine-2 Web Corpus can be downloaded here.

Also, the participants can use the following two document collections as the resource of organic (Web) documents.

  • SogouT (for Chinese subtask):
    The collection contains about 130M Chinese pages together with the corresponding link graph. The size is roughly 5TB uncompressed. The data was crawled and released on Nov 2008. Further information regarding this collection can be found on the page http://www.sogou.com/labs/dl/t-e.html. You can directly contact chenjing@sogou-inc.com to obtain the data set. We plan to provide more recent data
  • ClueWeb12-B13 (for English subtask):
    As for English Document Ranking subtask, the ClueWeb12-B13 data set will be adopted, which includes 52M English Web pages crawled during 2012 and a search interface provided by Lemur project. We appreciate Prof. Jamie Callan and his team for providing the collection, which dramatically reduces the working efforts of participants. Further information regarding the collections can be found on http://lemurproject.org/clueweb12/.

Note that, participants are NOT allowed to mix different collections into one run. For example, if a run contains documents in Our Web Corpus, the run should not contain the documents in SogouT.

Query log/suggestions/recommendations

The following data is provided to the participants so that the participants can predict/mine intents for a given query. Also, we encourage the participants to use other external resources for their runs on both the QU and VI subtasks.

  • Web Search Related Query Data from Yahoo! JAPAN (for Japanese subtask):
    This dataset is generated from the query log of Yahoo! Japan Search from July 2009 to June 2013 (related page). The user agreement form (memorandum) for the dataset is now available here. The download url of the dataset will be informed after sending the agreement form to NII.
  • SogouQ (for Chinese subtask):
    SogouQ search user behavior data collection is available for participants as additional resources. The collection contains queries and click-through data collected and sampled in November, 2008 (consistent with SogouT). A new version of SogouQ is also available now which is a sample of data collected in 2012. Further information regarding the data can be found on the page http://www.sogou.com/labs/dl/q.html.
  • Query suggestions/completions of several commercial search engines (for Chinese, English, Japanese subtasks):
    A list of query suggestions/completions collected from popular commercial search engines such as Google, Yahoo!, Bing, Baidu are provided as possible subtopic candidates. download

Download

If you don’t know the ID and password for the download page, please contact us (imine2-organizers at dl.kuis.kyoto-u.ac.jp).

Evaluation Results

Preliminary Evaluation Results for Query Understanding subtask:

Preliminary Assessment Results for Query Understanding subtask:

Preliminary Evaluation Results for Vertical Incorporating subtask:

Preliminary Assessment Results for Vertical Incorporating subtask: