# Query Understanding Subtask (QU) (Chinese, English, Japanese)

The Query Understanding subtask is defined as follows: given a query, the participant is required to generate a diversified ranked list of not more than 10 subtopics with their relevant vertical intents. In the QU subtask, a subtopic of a given query is viewed as a search intent that specializes and/or disambiguates the original query. The participants are expected to (1) rank important subtopics higher, (2) cover as many intents of a given query as possible, and (3) predict a relevant vertical for each subtopic.

This subtask corresponds to the Subtopic Mining (SM) subtask in IMine-1, INTENT2 and INTENT. The difference from the previous SM subtask is that participants are also required to identify the relevant vertical for each subtopic. In other words, for a given query, the participants have to identify its important subtopics and which vertical should be presented for the subtopic. The options of verticals can be found here.

For example, for the query “iPhone 6”, a possible result list is:

[Topic ID] [Subtopic] [Vertical Intent] [Score] [Run Name]
IMINE2-E-001 iPhone 6 apple Web 0.98 KUIDL-Q-E-1Q
IMINE2-E-001 iPhone 6 sales News 0.90 KUIDL-Q-E-1Q
IMINE2-E-001 iPhone 6 photo Image 0.88 KUIDL-Q-E-1Q
IMINE2-E-001 iPhone 6 review Web 0.78 KUIDL-Q-E-1Q

where [Topic ID] is a topic ID, [Subtopic] is a string that the system generates as a subtopic, [Vertical Intent] is an estimated vertical relevant to the subtopic, [Score] is an estimated importance of the subtopic. For [Vertical Intent], the system must pick up one vertical out of six verticals defined for each language (See here for checking the types of vertical for each language). For example, for the English QU subtask, a vertical intent should be “Web”, “Image”, “News”, “QA”, “Encyclopedia” or “Shopping”. Note that we do not use [Score] values for our evaluation and use only the order of subtopics and their vertical intents; the ranks of the subtopics are determined just by their appearance orders in the submission file. Also note that the participants can leave the [Vertical Intent] field blank if they want to focus just on subtopic mining.

Participants are expected to generate a result list by analyzing the document collections, user behavior data, query suggestions/completions provided by the organizers (See Dataset). Also, the participants are encouraged to use external resources to tackle the task.

# Vertical Incorporating Subtask (VI) (Chinese, English)

In the Vertical Incorporating subtask, given a query and the document collection, the system is required to return a diversified ranked list of not more than 100 results. The objective of the ranking is to (1) rank documents relevant to important intents higher, (2) rank vertical results (defined as virtual documents in IMine-2) relevant to important intent higher, and (3) cover as many intents as possible. Since some of the queries have very clear intents, the participants are encouraged to selectively diversify the results.

This subtask corresponds to the Document Ranking (DR) subtask in IMine, INTENT2 and INTENT. The difference from the previous DR subtask is that the participants should decide whether the result list should contain certain types of vertical results. For this purpose, the participants can include virtual documents as well as organic documents in their ranking.
A virtual document is a special document that represents a search result generated from the vertical. More specifically, for English subtask, the participants can use the following virtual documents: Vertical-Image, Vertical-News, Vertical-QA, Vertical-Encyclopedia, Vertical-Shopping. For Chinese subtask, the participants can use the following virtual documents: Vertical-Image, Vertical-News, Vertical-Download, Vertical-Encyclopedia, Vertical-Shopping. A virtual document of a vertical is assumed to be an ideal search result generated by the vertical and always relevant if and only if its vertical is relevant to one of the intents behind the query. By using the document collections and virtual documents, the participants have to decide which virtual documents should be ranked higher while keeping the diversity of the ranking.

For example, a possible result list for the Vertical Incorporating subtask is:

[Topic ID] [Doc ID] [Score] [Run Name]
IMINE2-E-001 IMINE2-E-001-013.html 0.78 KUIDL-V-E-1M
IMine-2-E-001 Vertical-News 0.7 KUIDL-V-E-1M
IMine-2-E-001 Vertical-Image 0.6 KUIDL-V-E-1M
IMine-2-E-001 IMINE2-E-001-113.html 0.5 KUIDL-V-E-1M

where [Topic ID] is a topic ID, [Doc ID] is either a document ID in the document collection or a virtual document ID, [Score] is an estimated importance of the document. Note that we do not use [Score] values for evaluation, and use only the order of documents in the evaluation; the ranks of the documents are determined just by their appearance orders in the submission file.

As for the resource of organic (Web) document collection, we provide the data collection (IMine-2 Web Corpus) for both English and Chinese subtasks. The collection consists of the top 500 results of Bing Search API. Participants who attend the VI subtask are required to submit at lease one run that uses this collection.

Also, participants are allowed to use SogouT for Chinese runs and ClueWeb12-B13 for English runs.
In order to help participants who are not able to construct their own retrieval platforms for SogouT, we will plan provide a non-diversified baseline Chinese VI run based on our own retrieval system. Retrieval results can also be obtained through the ClueWeb12 search interface.
For more details about the document collections, please see the dataset page.

## Subtopics, Verticals, Vertical Intents

We describe several concepts that are important in IMine-2.

### Subtopics

In Query Understanding subtask, participants are required to return a ranked list of subtopics, not a ranked list of document IDs.
A subtopic of a given query is a query that specializes and/or disambiguates the search intent of the original query.
If a string returned in response to the query does neither, it is considered irrelevant.

e.g.
original query: “jaguar” (ambiguous)
subtopic: “jaguar car brand” (disambiguate)
incorrect: “jaguar jaguar” (does not disambiguate; does not specialize)

e.g.
original query: “harry potter” (underspecified)
subtopic: “harry potter movie” (specialize)
incorrect: “harry potter hp” (does not specialize; does not disambiguate)

In the NTCIR IMine task, the organizers cluster subtopics into intents and then evaluate the participant runs based on the intents. The example of intents for a query can be found in IMine-1, INTENT-2, and INTENT-1 Task data.

### Verticals

Nowadays, many commercial Web search engines merge several types of search results and generate a SERP (search engine results page) in response to a user’s query. For example, the results of query “flower” now may contain image results and encyclopedia results as well as usual Web search results. We refer to such “types” of search results as verticals. For example, “image”, “movie”, “audio”, “finance”, “news” can be a vertical. Also, please refer to TREC Federated Web search Track for the notion of vertical.

In IMine-2, we select six verticals for each of Japanese, Chinese, and English topics. More specifically, we consider the following verticals:

• Web
• Image
• News
• Encyclopedia
• Shopping
• Web
• Image
• News
• QA
• Encyclopedia
• Shopping
###### Japanese
• Web
• Image
• News
• QA
• Encyclopedia
• Shopping

Here is the typical representation of each vertical in SERP.

Image
Online images/photos

News
News articles

Encylcopedia
Encyclopedia/dictionary

Software

QA

Shopping
Products in E-commerce sites

### Vertical Intents

Relevant verticals depend on the intents behind a query. For a user who searches for “iPhone 6 photo,” for example, the image vertical might be much more relevant than usual Web search results. A vertical intent can be defined as a preference on verticals for a given intent. In Query Understanding subtask, the participants are required to identify relevant vertical intent for each subtopic.

# Evaluation

Below, we describe preliminary versions of the evaluation metrics for the Query Understanding and Vertical Incorporating subtasks to help participants design their algorithms.

In the QU subtask, the quality of the participants’ outputs (= runs) are evaluated based on both the diversity of intents and the accuracy of vertical intent prediction.

After we have runs submitted by participants, we will obtain a set of intents $I_{q}$, intent probability $p(i|q)$, vertical probability $p(v|i)$ by the following annotation procedure:

1. We pool the subtopics submitted by the participants.
2. The assessors manually cluster subtopics into intents.
3. The assessors vote for intents that they think are important. Note that for queries about news-related topics, their importance will be judged as of August 17th, 2015, which is the date of the documents were crawled.
4. For each intent, the assessors vote for verticals that they think are important for the intent. We will plan to follow the similar procedure used in FedWeb13; the assessors will make pairwise preferences on two verticals.

The diversity of intent is measured by D#-nDCG, which was proposed by Sakai et al.

As for the accuracy of vertical intent prediction, we will use $V\unicode{x2013}score$ which is defined as:

$$V\unicode{x2013}score = \frac{\sum_{k=1}^{n} {\rm correct}(v_{k})}{n}$$

where $n$ is the number of subtopics included in a run for a query, and $v_{k}$ is the predicted vertical intent for the $k$-th subtopic. If vertical intent $v_{k}$ corresponds to the most important (= the highest $p(v|i)$) vertical for intent $i$ then $correct(v_{k})$ is $1$; otherwise $0$.

The primary evaluation metric for the Query Understanding subtask is $QU\unicode{x2013}score$, which is the linear combination of D#-nDCG and $V\unicode{x2013}score$:

$$QU\unicode{x2013}score = \lambda D\#\unicode{x2013}nDCG + (1-\lambda) V\unicode{x2013}score$$

For Vertical Incorporating subtask, we mainly use D#-measure to measure whether the system can generate a diversified ranking. The following annotation procedure will be taken to compute the relevance $rel_{i}(d)$ of document $d$ to intent $i$.

1. We pool the documents submitted by the participants.
2. For each intent, assessors judge the relevance of documents at a
three-point scale (irrelevant, relevant, and highly relevant).

In D#-measure, the quality of document $d$ is computed by its global gain $GG(d)$:

$$GG(d) = \sum_{i \in I_{q}}p(i|q)g_{i}(d)$$

Since we consider the importance of vertical, we will compute gain $g_{i}(d)$ of document $d$ for intent $i$ as follows:

$$g_{i}(d) = \sum_{v \in V} {\rm \delta_{v}(d)} \ p(v|i) \ rel_{i}(d)$$

where $V$ is a set of verticals defined for each language. If the type of vertical of document $d$ is $v$, $\delta_{v}(d)$ is 1; otherwise 0. Note that the vertical type of non-virtual documents (i.e. ones from SogoT or ClueWeb) is regarded as “Web”. $rel_{i}(d)$ is the relevance of document $d$ to intent $i$, which can be obtained by the above annotation procedure. The range of $rel_{i}(d)$ would be { 0(irrelevant), 1(relevant), 2(highly relevant) }. Note that, as for the virtual documents, their relevances are assumed to be HighRel.

# Reference

Sakai, T. and Song, R.: Evaluating Diversified Search Results Using Per-Intent Graded Relevance, ACM SIGIR 2011, 2011.

K. Zhou, T. Demeester, D. Nguyen, D. Hiemstra and D. Trieschnigg. Aligning Vertical Collection Relevance with User Intent, ACM International Conference on Information and Knowledge Management (CIKM 2014), 2014.