Creation of summaries of events of interest from multitude of unstructured data is a challenging task commonly faced by intelligence analysts while seeking increased situational awareness. This document proposes a framework called Storyboarding that leverages unstructured text and images to explain events as sets of subevents. The framework ﬁrst generates a textual context for each human face detected from images and then builds a chain of coherent documents where two consecutive documents of the chain contain a common theme as well as a context. Storyboarding helps analysts quickly narrow down large number of possibilities to a few signiﬁcant ones for further investigation. Empirical studies on Wikipedia documents, images and news articles show that Storyboarding is able to provide deeper insights on events of interests.
Due to the rapid growth of inexpensive storage devices and improvements in the semiconductor industry, the amount of data in our world has been exploding. The increasing volume, detail of information, the rise of multimedia, social media, and the Internet of Things fuel the exponential growth of data for the foreseeable future. While publicly available data feeds are increasing exponentially providing a massive source of intelligence, ironically this plethora of information is what makes succinct details latent during the analysis of an event of interest. Traditional search engines help narrow down the scope of analysis to a speciﬁc event but it is not an easy task for an analyst to study massive amount of relevant documents and get a good understanding of how each subevent compose a complex set of interactions between the actors involved in a major event. In this document, we describe a framework called Storyboarding that provides summarizations of an event as chains of coherent documents and relevant image objects using publicly available data.
Summarization of an event involves the task of identifying relevant entities (e.g., person, organization, and location), discovering nonobvious connections between the entities to construct a coherent story – a task which sometimes is referred in the literature as “connecting the dots”, and providing relevant information (e.g., images) to explain the connections better. The “connecting the dots” problem has been of great interest to the researchers in a variety of domains and has been studied in diverse context, for example, in entity networks [14], image collections [13], social networks [9], and document collections [15]. Our approach provides a mechanism to build a story between two news articles published at two diﬀerent times using a sequence of intermediate published articles. The sequence provides a summary of a subevent that explains a major event in part. Our approach complements other summarization techniques [23 35] by forming a chain of articles and relevant images to provide better understanding of the story. Conventional summarization techniques that directly use the textual contents of a cluster of relevant documents as text snippets to explain an event can further explain the stories generated by our framework.

Subevent: The story contains ﬁve news articles associating Boston bombers’ involvement in the Waltham triple murder. The story includes trial phase. 
The purpose of automatic summarization of an event is to give an analyst a good overview and a clear visualization of the event without overburdening her/him with too much details. Our storyboarding approach provides summarization of events using both textual and imagery information. The relevant event corpus need not contain any image within its documents. The storyboarding framework builds a knowledge base ﬁrst, which later helps in providing an image context for each story. A straightforward way to incorporating images during summarization of events is to generate the story using only a similarity network of news articles, as done in Storytelling [14], and then attach images to enrich the presentation of the story. This straightforward approach does not take contents of the images into account whereas the images are likely to carry vital information for a storybuilding process. In our Storyboarding framework, we leverage image objects (e.g., faces) as added pieces of information to the text content. As a result, objects within images (e.g., faces) play an important role along with the actual text content in the storybuilding process. As an example of our Storyboarding output, Figure 1 provides a subevent explaining an aspect of the Boston Marathon Bombing tragedy. Figure 1 is further explained later in Section 6.5.
In summary, the main contributions of this document are:
The literature review is divided into the following areas. Situational awareness and intelligence analysis: Research work into software tools for intelligence analysis and situational awareness can be grouped into modelguided systems [3], topicguided methods [24], graphbased analytic tools [11 7], and collaborative systems [5]. Some other techniques are geared toward visual analytics, such as, INSPIRE [25], NetLens [17], and Palantir [18] where the reasoning processes require complex interaction between the analysts and the tools. Our work falls in the category of graphbased analytic tools. We support exploration of a similarity graph of documents (without materializing it) to help marshal summary items, explain connections, and form an explanation.
Event Summarization: Detection and summarization of events have been studied in a variety of domains like computer systems, intelligence analysis, surveillance systems, etc. Since the deﬁnition of an event varies largely among domains, a range of techniques have been proposed in the literature including frequent itemset mining [6], episode mining [20], graph search [36], temporal event clustering [12], to name a few. There is a recent trend of increasing interest for summarizing viral events using users’ interaction on social media. The graph based model proposed by Xu and Lu [36] has some resemblance to our framework that provides visual summary of an event using images from various social media. However, Storyboarding framework summarizes events as stories of several subevents using both textual and visual contents of the documents.
Association between image and text: Solutions to many reallife problems involve mapping text to images, e.g., recommending text advertisements for image search [37], classifying images in the bioscience literature [27], providing text for Web images [26], and ﬁlling news articles with relevant images [22]. Some of these techniques rely on training data using existing knowledge bases [22 37]. The work of Li et al. [22] on mutual summarization of web image and text has some resemblance to Storyboarding, since it provides narratives to explain an event. However, the scope of the work of Li et al. is limited to explaining one image by selecting text from an article or providing a set of related images if there is a text narrative. Storyboarding’s capabilities are not limited to one image or one document; rather it relates components at feature level.
Network exploration: The core algorithm of Storyboarding solves a “connecting the dots” problem, which has appeared in the literature before in a variety of guises: entity networks [10 14], image collections [13], social networks [9], and document collections [8 19 28]. While some of these eﬀorts can be adapted toward our problem context, the notion of a Storyboarding emphasizes on amalgamating heterogeneous information (text and image features) in the summarization process, whereas the above eﬀorts typically require a stronger connecting thread between nodes of a homogeneous network.
The Storyboarding problem involves two diﬀerent datasets: a knowledge base $\kappa $ and a collection of news articles $\mathcal{\mathcal{A}}$. The knowledge base, $\kappa =\left\{\mathcal{\mathcal{D}},{\mathcal{\mathcal{E}}}^{\kappa},\mathcal{\mathcal{I}},\mathcal{\mathcal{F}},{R}_{DE},{R}_{DI},{R}_{IF}\right\}$ is a $4$partite graph, where four sets of vertices are composed of (1) $\mathcal{\mathcal{D}}=\left\{{d}_{1},{d}_{2},\dots ,{d}_{n}\right\}$, a collection of $n$ documents, (2) ${\mathcal{\mathcal{E}}}^{\kappa}=\left\{{e}_{1},{e}_{2},\dots ,{e}_{\left{\mathcal{\mathcal{E}}}^{\kappa}\right}\right\}$, a set of entities (e.g., names of people, organization, location), (3) $\mathcal{\mathcal{I}}=\left\{{\text{\u0457}}_{1},{\text{\u0457}}_{2},\dots ,{\text{\u0457}}_{\left\mathcal{\mathcal{I}}\right}\right\}$, a set of images, and (4) $\mathcal{\mathcal{F}}=\left\{{f}_{1},{f}_{2},\dots ,{f}_{\left\mathcal{\mathcal{F}}\right}\right\}$, a set of faces extracted from images. ${R}_{DE}$ is the set of edges $\left\{\left\{{d}_{q},{e}_{r}\right\}:{d}_{q}\in \mathcal{\mathcal{D}},{e}_{r}\in \mathcal{\mathcal{E}}\right\}$ that represents entities within each document. Similarly, $\left\{\left\{{d}_{q},{\text{\u0457}}_{r}\right\}:{d}_{q}\in \mathcal{\mathcal{D}},{\text{\u0457}}_{r}\in \mathcal{\mathcal{I}}\right\}$ and $\left\{\left\{{\text{\u0457}}_{q},{f}_{r}\right\}:{\text{\u0457}}_{q}\in \mathcal{\mathcal{I}},{f}_{r}\in \mathcal{\mathcal{F}}\right\}$ are the sets of edges ${R}_{DI}$ representing images within documents and ${R}_{IF}$ representing faces of people within the images, respectively. The dataset of news articles is a bipartite graph $\left\{\mathcal{\mathcal{A}},{\mathcal{\mathcal{E}}}^{\mathcal{\mathcal{A}}},{R}_{\mathcal{\mathcal{A}}{\mathcal{\mathcal{E}}}^{\mathcal{\mathcal{A}}}}\right\}$ where $\mathcal{\mathcal{A}}=\left\{{a}_{1},{a}_{2},\dots ,{a}_{m}\right\}$ is a set of vertices representing news articles and ${\mathcal{\mathcal{E}}}^{\mathcal{\mathcal{A}}}=\left\{{\text{\u0451}}_{1},{\text{\u0451}}_{2},\dots ,{\text{\u0451}}_{\left{\mathcal{\mathcal{E}}}^{\mathcal{\mathcal{A}}}\right}\right\}$ is the set of entities within the news articles. An edge $\left\{{a}_{q},{\text{\u0451}}_{r}\right\}\in {R}_{\mathcal{\mathcal{A}}{\mathcal{\mathcal{E}}}^{\mathcal{\mathcal{A}}}}$ exists if entity ${\text{\u0451}}_{r}$ occurs in article ${a}_{q}$. Each news article ${a}_{q}$ has a timestamp denoted as $Timestamp\left({a}_{q}\right)$ that refers to the publication date.
Given two news articles $\left\{{a}_{s},{a}_{t}\right\}\in \mathcal{\mathcal{A}}$ such that $Timestamp\left({a}_{s}\right)<Timestamp\left({a}_{t}\right)$, Storyboarding framework outputs a summary represented as a sequence of $\{$article, faceset$\}$ pairs, $\mathcal{\mathcal{Z}}=\left\{{z}_{1},{z}_{2},\dots ,{z}_{q}\right\}$. Each $\{$article, faceset$\}$ pair ${z}_{i}$ includes an article from $\mathcal{\mathcal{A}}$ and a set of faces from $\mathcal{\mathcal{F}}$ that are relevant to the context of the article. Consider that $A\left({z}_{i}\right)$ is the article associated with ${z}_{i}$ and $F\left({z}_{i}\right)$ is the relevant set of faces. In sequence $\mathcal{\mathcal{Z}}$, ${a}_{s}=A\left({z}_{1}\right)$ and ${a}_{t}=A\left({z}_{q}\right)$. In Storyboarding, the sequence $\mathcal{\mathcal{Z}}$ must maintain the following three properties:
The Storyboarding framework has three operational stages: (1) construction of the knowledge base $\kappa $ represented as a 4partite graph in Section 3, (2) building a model for context generation using the knowledge base, and (3) generate summaries given an event data. Among these three stages, the ﬁrst two can be viewed as a startup cost that can be amortized over all generated summaries. The following subsections describe these three stages.
The 4partite knowledge base graph $\kappa =\left\{\mathcal{\mathcal{D}},{\mathcal{\mathcal{E}}}^{\kappa},\mathcal{\mathcal{I}},\mathcal{\mathcal{F}},{R}_{DE},{R}_{DI},{R}_{IF}\right\}$ is constructed using images and textual context of Wikipedia. In this document, we focus our analysis on topics related to politics and terrorism. To generate the data of focus, we trained a classiﬁer that was able to identify interesting or uninteresting Wikipedia documents. We used 20Newsgroups dataset and Global Terrorism Database (GTD) [32] for the training. The entire GTD and three groups from the 20Newsgroups dataset (talk.politics.misc, talk.politics.guns, talk.politics.mideast) were considered interesting and the rest of the groups of 20Newsgroups dataset were considered uninteresting. The classiﬁer identiﬁed $118,290$ out of $1.5$ million Wikipedia pages as interesting and hence relevant to politics and terrorism. Document set $\mathcal{\mathcal{D}}$ is composed of the text content of these $118,290$ Wikipedia pages. Similarly, all the relevant images form $\mathcal{\mathcal{I}}$. ${\mathcal{\mathcal{E}}}^{\kappa}$ is formed by extracting named entities from each document in $\mathcal{\mathcal{D}}$. $\mathcal{\mathcal{F}}$ is constructed by detecting human faces from the images in $\mathcal{\mathcal{I}}$. ${R}_{DE}$, ${R}_{DI}$, and ${R}_{IF}$ are the relevant sets of edges representing existence of entities in documents, images in documents, and faces in images, respectively.
We used ViolaJones algorithm [34] with trained cascade classiﬁcation model to detect faces from Wikipedia images. The cascade classiﬁcation model is an ensemble of weak classiﬁers, built on top of classiﬁcation and regression tree based strategies. The model uses rotated Haarlike features [34] that signiﬁcantly enrich the accuracy of the ViolaJones algorithm. However, about $50\%$ of the faces detected by this stateoftheart algorithm were incorrect when we applied it on Wikipedia images. We added an additional constraint, ‘each face must include an eye pair’, which signiﬁcantly reduced this error rate to $1\%$.
Extraction of eﬀective features from the detected faces is a challenging task since the quality of the features later dominates the accuracy of generated context (similar to face recognition). Popular feature extraction methods like Eigenface [31], Fisherface [21], and Local Binary Pattern [1] commonly used in face recognition perform reasonably well when most of the faces are frontfacing. Since many of the faces in the Wikipedia images are sidefacing, we use a novel frontalization method to be able to capture positions of some key facial points in a projected plane where the sidefaced photo represents a frontposing face. This enables us to bring side faces to a common space where all faces are considered frontfacing. Our observation is that our facial key points frontalization technique provides better features when combined with Eigenface.
Facial Key Points Detection: We detect ﬁve facial key points – two eye centers, nose and two mouth corners, as shown in Figure 2 – using a pretrained cascaded deep convolutional neural network. The neural network utilizes texture information over the entire face and the geometric constraints among key points with high accuracy [30]. Since a face can be in any pose in an image, we need to frontalize the facial key points to be able to extract the actual geometric properties of a face. Even before the frontalization, it is necessary to estimate the angles of a face by which it is deviated from a front facing position.
Angle Prediction: We use Generalized Linear Model (GLM) to predict vertical and side angles of a face. We generate a synthetic dataset using six 3Dface models, which is created from six face images covering various ethnicities (e.g., Caucasian, Scandinavian, Japanese). Since the dimension of faces in the knowledge base, $\kappa $ is $100\times 100$ pixels, we imagine a cube of $100\times 100\times 100$ for the 3D face models where the center of head is at the origin $\left(0,0,0\right)$. Five facial key points of the faces are then marked on these models. We made three assumptions for simplicity in placing the ﬁve key points in the model.
To create a synthetic training dataset of faces with known facial key points and known angles, we rotate the 3D models by varying angles around $z$axis and $x$axis and projecting them back on a 2D $xz$plane. We rotate the models from $4{5}^{\circ}$ to $+4{5}^{\circ}$ at ${3}^{\circ}$ intervals around $z$axis and from $1{5}^{\circ}$ to $+1{5}^{\circ}$ at ${5}^{\circ}$ intervals around $x$axis. This produces a training set of 1302 instances. Each instance consists of 30 features as described in “Feature Generation” part later. Two class labels are azimuth angle $az$ around $z$axis and elevation angle $el$ around $x$axis. We train two GLM models using these two sets of class labels. After the training, the models can predict the vertical and side angles for any ﬁve facial key points of a face. This enables us to apply frontalization by those predicted angles.
Frontalization: We predict azimuth angle (side angles), $az$ and elevation angle, $el$ for each detected face by using GLMs from the extracted facial key points. Let $\rho =\left\{{\rho}_{1},{\rho}_{2},...,{\rho}_{5}\right\}$ be the set of ﬁve facial key points extracted from a face. Algorithm 1 describes the steps for frontalizing the set of key points $\rho $. The algorithm uses a function $Rotate$ that rotates a 3D point around a particular axis by a certain angle. Figure 3 shows a sample of frontalization applied on ﬁve facial key points.
Feature Generation: An important criterion in building a good feature set of a face is that the produced features vary a little for a particular person in diﬀerent lighting conditions, expressions, and occlusions. Our method for facial key point generation and frontalization aid in generating features with such properties. We generate two sets of features from the frontalized facial key points. The ﬁrst set is composed of pairwise relative distances between all ﬁve facial key points and the second set comprises of twenty angles as shown in Figure 4. Combination of these two sets of features makes a feature vector of length thirty where the ﬁrst ten are the pairwise relative distances and the last twenty are the angles. In addition to these thirty features, we leverage pixel information of the faces as used in Eigenface. Each face is stored in a 35$\times $35 image resulting in a 1225 pixels per face. We perform a maxmin normalization of standardized pixel values to reduce the noise in the images due to color intensity variation. Previously extracted thirty features using frontalized key points are concatenated with these normalized pixel values forming a vector of length 1255 for each face. Finally, we apply principal component analysis on the concatenated feature vectors, and select a number of principle components by plotting and observing the relationship between relative magnitude of eigenvalues and number of components.
We combine the outputs of a number of named entity extractors including LingPipe [2], OpenNLP [4], and Stanford NER [29] to identify entities within the textual contents of the documents in $\mathcal{\mathcal{D}}$. Although we extracted all standard entity types including people, organization, and locations, this document scopes down the analysis to person entities only, especially because the images are explained using detected human faces. We represent the named entities using vector space modeling of $\left\{\mathcal{\mathcal{D}},{\mathcal{\mathcal{E}}}^{\kappa},{R}_{DE}\right\}$, which is a subset of the knowledge base 4partite graph $\kappa $. The weight of entity $e\in {\mathcal{\mathcal{E}}}^{\kappa}$ in the document $d\in \mathcal{\mathcal{D}}$ is calculated as:
where $t{f}_{e,d}$ is the frequency of entity $e$ in document $d$, $d{f}_{e}$ is the number of documents containing a connection with entity $e$, and ${\mathcal{\mathcal{E}}}_{d}^{\kappa}$ is the set of entities that are connected to document $d$. Equation 1 is a variant of tfidf modeling with cosine normalization. The Wikipedia corpora has descriptions of diﬀerent sizes. In general, longer descriptions have higher term frequencies because many terms are repeated. The cosine normalization helps lessen the impact of size of the descriptions in the modeling.
The context of a face $f\in \mathcal{\mathcal{F}}$ is a ranked list of named entities ordered by relevance to $f$, and expressed as a probability distribution over the set of entities ${\mathcal{\mathcal{E}}}^{\kappa}$. The contexts of the faces are leveraged to generate chains of contextual documents in Section 4.3. Using Bayesian rule, the probability of an entity $e\in {\mathcal{\mathcal{E}}}^{\kappa}$ for the given face $f$ can be expressed as
$$P\left(ef\right)\propto P\left(e\right)\times P\left(fe\right)$$  (2) 
where $P\left(e\right)$ is the prior probability of $e$ and $P\left(fe\right)$ is the likelihood. Let $f=\left\{{f}^{1},{f}^{2},\dots ,{f}^{L}\right\}$ be the feature representation of the face $f$ obtained by the method described in Section 4.1.1. $L$ is the length of the feature vector of $f$. With an assumption of independence between the features, we can rewrite Equation 2 as
$$P\left(ef\right)\propto P\left(e\right)\times \prod _{l=1}^{L}P\left({f}^{l}e\right)$$  (3) 
The person entity $e$ can appear in multiple documents of $\mathcal{\mathcal{D}}$. Let ${\mathcal{\mathcal{D}}}^{e}\subset D$ be the set of documents that contain entity $e$. Each document $d\in {\mathcal{\mathcal{D}}}^{e}$ may in turn contain a number of faces (as expressed by ${R}_{DI}$ of $\kappa $). Let ${\mathcal{\mathcal{F}}}^{d}\subset \mathcal{\mathcal{F}}$ be the set of faces in document $d$. The relationships between the facefeatures and entities in documents can be computed by the entity weights, $W\left(e,d\right)$ and facefeature weights in the documents. The likelihood of the $l$th feature of a face given an entity $e$ can be computed as
$$P\left({f}^{l}e\right)=\left(\right.\frac{\sum _{d\in {D}^{e}}P\left({f}^{l}d\right)\times W\left(e,d\right)}{\sum _{d\in {D}^{e}}W\left(e,d\right)}{\left)\right.}^{{f}^{l}}$$  (4) 
where $P\left({f}^{l}d\right)$, which is calculated by Equation 5, is the probability of the face feature ${f}^{l}$ given document $d$.
$$P\left({f}^{l}d\right)=\frac{\sum _{\varphi \in {\mathcal{\mathcal{F}}}^{d}}{\varphi}^{l}}{\sum _{\varphi \in {\mathcal{\mathcal{F}}}^{d}}\sum _{{l}^{\prime}=1}^{L}{\varphi}^{{l}^{\prime}}}$$  (5) 
where ${\varphi}^{l}$ is the $l$th feature of a face $\varphi $. Now, putting this expression for $P\left({f}^{l}e\right)$in Equation 3 we get
Taking logarithm on both sides of Equation 6, we obtain the following equation.
$$\begin{array}{llll}\hfill log\left(P\left(ef\right)\right)& \propto log\left(P\left(e\right)\right)\sum _{l=1}^{L}\left({f}^{l}\times log\left(\sum _{d\in {D}^{e}}W\left(e,d\right)\right)\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & +\sum _{l=1}^{L}\left({f}^{l}\times log\left(\sum _{d\in {D}^{e}}P\left({f}^{l}d\right)\times W\left(e,d\right)\right)\right)\phantom{\rule{2em}{0ex}}& \hfill \text{(7)}\end{array}$$We generate context for each face in $\mathcal{\mathcal{F}}$ using Equation 7. In practice, we do not keep the full probability distributions, rather we keep record of maximum of ${L}_{C}$ of entities with highest probabilities as the context of a face.
The core storyboarding task, as described in Section 3, focuses on forming a chain of $\{$article, faceset$\}$ pairs using news articles in $\mathcal{\mathcal{A}}$ and the generated face contexts using knowledge base $\kappa $. This part requires extraction of person entities from $\mathcal{\mathcal{A}}$ to form ${\mathcal{\mathcal{E}}}^{\mathcal{\mathcal{A}}}$, discovering faces relevant to each new article based on similarity between a news article and context of images in the knowledge base, and designing a pathﬁnding algorithm in a similarity network of news articles where discovered paths are constrained by text coherence, context similarity, and progression of time in the story.
${\mathcal{\mathcal{E}}}^{\mathcal{\mathcal{A}}}$ for news articles is formed in the same way we formed ${\mathcal{\mathcal{E}}}^{\kappa}$ for the knowledge base (as described in Section 4.1.2). While in reality news articles may contain images but many of those images are repeated from the past to provide a visual context. Many of the news articles do not contain any image. Since the knowledge base covers broad aspects of everything, face images can be reproduced from the images of the knowledge base. In our set of news articles, $\mathcal{\mathcal{A}}$, we considered that there is no image. We reproduce relevant faces for each article of a story using the knowledge base. We order faces for each news article $a\in \mathcal{\mathcal{A}}$ based on relevance between $a$ and all faces. We compute this relevance using a context matching score between a news articles $a$ and the context $C\left(f\right)\subset {\mathcal{\mathcal{E}}}^{\kappa}$ of a face $f$:
$$\beta \left(f,a\right)=\sum _{e\in {\mathcal{\mathcal{E}}}_{a}^{\mathcal{\mathcal{A}}}\cap C\left(f\right)}V\left(e,a\right)\times \left(\leftC\left(f\right)\rightR\left(e,f\right)\right)$$  (8) 
where $V$ is a function similar to $W$ deﬁned in Equation 1 and $\leftC\left(f\right)\right\le {L}_{C}$. The only diﬀerence is that $V$ weights are computed over news articles of $\mathcal{\mathcal{A}}$ instead of the documents in knowledge base $\mathcal{\mathcal{D}}$. $R\left(e,f\right)$ is the rank of $e$ in the list of entities $C\left(f\right)$ of face $f$. That is, we have two kinds of information associated with each news articles. One is the weighted list of entities extracted from the text of the article and the other type is the most relevant faces (or contexts) from the knowledge base. Both the types are leveraged in a heuristic search algorithm to build a path of $\{$article, faceset$\}$ pairs between two articles $\left\{{a}_{s},{a}_{t}\right\}\in \mathcal{\mathcal{A}}$. We use a variation of A^{*} search algorithm that uses Soergel distance as the heuristic. Soergel distance is an admissible heuristic for A^{*} search. The Soergel distance between two articles $a$ and ${a}^{\prime}$ is calculated using the following formula.
Equation 9 is also used to compute distance between the combined contexts of the relevant faces of two articles.
Our heuristic algorithm maintains the following properties during exploration.
Notice that all these constraints can be applied during a candidate evaluation phase of any heuristic search algorithm. We generate the $b$ nearest neighbors based on text content of the articles, and rank the candidate articles for exploration based on their contexts. $b$ is also considered the branching factor for the heuristic search algorithm. Empirical studies with diﬀerent $b$, $\mathit{\theta}$, and $\tau $ values are shown in Section 6.
We estimate the scalability of the Storyboarding framework by analyzing time complexities of the major components of the pipeline. The angle predictor for facial key point frontalization learns a linear model from ${N}_{s}$ synthetic observations of ${L}_{s}$ features and predicts the azimuth and elevation angles for each face in $\mathcal{\mathcal{F}}$, incurring a total complexity of $\mathcal{\mathcal{O}}\left({L}_{s}^{2}\times {N}_{s}+{L}_{s}\times \left\mathcal{\mathcal{F}}\right\right)$. Frontalization of ﬁve key points of a face using Algorithm 1 takes constant time. The feature generation step involves Eigenface computation on the feature matrix, and takes $\mathcal{\mathcal{O}}\left({L}_{feature}^{2}\times \left\mathcal{\mathcal{F}}\right+{L}_{feature}^{3}\right)$ time, where ${L}_{feature}$ is the length of each feature vector. Generation of context for all faces cost $\mathcal{\mathcal{O}}\left(L\times \left\mathcal{\mathcal{F}}\right\times {N}_{d}\times \left{\mathcal{\mathcal{E}}}^{\kappa}\right\right)$, where ${N}_{d}$ is the average number of documents associated with each entity. Complexities described so far can be considered as a onetime cost for a knowledge base. The constrained heuristic search to generate a story runs in $O\left({b}^{{L}_{sol}}\right)$ time, where ${L}_{sol}$ is the length of the solution path. There is also an additional cost of $\mathcal{\mathcal{O}}\left(\left\mathcal{\mathcal{A}}\right\times log\left(\left\mathcal{\mathcal{A}}\right\right)\right)$ for generating nearest neighbors for all articles using Ball tree but this cost is a onetime cost for a news article dataset $\mathcal{\mathcal{A}}$ prepared for a major event.
The speciﬁc questions we seek to answer in this section are:
We collected all the Wikipedia pages to build our knowledge base, and accumulated the news articles published in The New York Times on a particular event to generate the stories from. Table 1 summarizes the knowledge base ($\kappa $) and news article ($\mathcal{\mathcal{A}}$) data sets used in this work.
Total number of pages in Wikipedia  15,018,369 
Number of selected pages in $\kappa $  118,290 
Number of images in $\kappa $  524,928 
Number of faces in $\kappa $  100,928 
Number of person entities in $\kappa $  384,457 
Number of news articles in $\mathcal{\mathcal{A}}$  1,028 
Number of person entities in $\mathcal{\mathcal{A}}$  8,277 
As described in 4.1, use of key facial points to generate features provides higher quality of face contexts later. Detection of an appropriate number of key points is challenging because of diﬀerent poses faces can have. We applied two methods, Boosted Regression with Markov Networks (BoRMaN) [33] and a Deep Convolutional Network Cascade (CNN) method, for facial key point generation. The former discovers twenty facial key points and the later ﬁnds ﬁve. Our observation is that the BoRMaN method performs well with faces taken under controlled environment, e.g., photos taken in laboratories. When we used detected faces from Wikipedia images, the BoRMaN method was able to detect facial points properly for $70\%$ of the faces. For this experiment, we randomly picked up 300 faces from our database and manually checked if the points detected are in the vicinity of the expected pixels. With the same 300 faces, the accuracy of the CNN method in detecting ﬁve facial key points was $99\%$. Figure 5 shows two examples where ﬁve key points are detected correctly by CNN but the twenty points detected by the BoRMaN method are cluttered in one region of the face.
Although Storyboarding targets a diﬀerent problem than face recognition, we evaluate the proposed frontalization technique using face recognition to study its impact. In most face recognition literature, ground truth faces of each people are taken in diﬀerent poses under the same environment (e.g., illumination). However, faces detected from the Wikipedia images are not annotated and limited in number and poses. In addition to the original face, we included mirror image of each face to make sure that we have at least two poses of the same face. The experiment in this subsection compares face recognition accuracies with and without frontalization and mirror faces. We used a subset of 293 face images of 9 persons from the Labeled Faces in the Wild (LFW) dataset [16], which has annotated faces and the images are not taken in a controlled environment. Figure 6 compares face recognition accuracies at diﬀerent training and test rations with diﬀerent combinations of techniques. The ﬁgure shows that Eigenface with frontalization and mirror faces gives better accuracy than any other combination. The standard deviation of the accuracies was 2%5%.
Our frontalization technique relies on a generalized linear model based classiﬁer to compute the azimuth and elevation angles of a face in an original image. To compare the linear model based classiﬁer with a few other alternative mechanisms, we created a synthetic face model and rotated it in diﬀerent angles to create diﬀerent poses. That is, the ground truth angles are known for all the poses and an error can be computed for prediction of those angles. Figure 7 shows a comparison of mean square errors using three methods: Generalized Linear Model (GLM), Regression Tree and Ensemble Regression Tree to predict rotation angles. Figure 7(a) is for sideway rotations and Figure 7(b) shows the errors with elevation of the faces. In both the cases, GLM has lower errors than any other method at most of the angles. In this experiment, the sideway angles were varied from $4{5}^{\circ}$ to $+4{5}^{\circ}$ and the elevations were varied from $1{5}^{\circ}$ to $+1{5}^{\circ}$.
We evaluate the quality of the context of a face by comparing the context with Google image search results. Although Google search has a diﬀerent objective than ours, we are interested to check if some of the entities in the context of a face can be discovered using Google image search by uploading the face image. Since Google image search API limits number of queries and the process is time consuming, we randomly picked up 300 faces and computed how many of the terms of the context of each face were found in the ﬁrst page where the search result was returned by Google after uploading the face. Then we calculate the number of entities it has in common with our context of that face. Figure 8 shows a distribution of percentage of faces for diﬀerent number of overlaps of context and Google search. As expected, the percentage of faces goes down when we look for more common entities.The plot shows that almost $75\%$ of the faces have at least ﬁve entities in common with corresponding Google search result when we keep the context size to 80. These 80 entities are selected from over 384K total entities available in Wikipedia.
In an additional experiment to evaluate the quality of the generated contexts against ground truth information, we sample 21 faces from $\mathcal{\mathcal{F}}$, and manually attach most appropriate person entities to each of them with the help of human experts. We then compare these ground truth contexts with our automatically generated context. Figure 9 shows the accuracy of generated context for four context generation mechanisms. Eigenface combined with both frontalization and mirroring was found to be providing the best average accuracy for diﬀerent context sizes.
The two evaluations used above are quantitative but we believe that the qualitative improvement of the use of frontalization and mirroring approach is much higher and diﬃcult to measure. To depict this, we provided two sample contexts of two faces in Figure 10. Relevant entities (to the best of our knowledge) are highlighted in bold text in the ﬁgure. Eigenface with frontalization and mirror faces provides the best contexts. Later in Section 6.6 we show that use of contexts in storyboarding provides better stories. Hence, it is important to generate high quality context using frontalizaion and mirror faces so that the generated stories later are more coherent.
Story  Explanation 
NY286 (2014/04/15) $\to $ NY370 (2014/10/24) $\to $ NY334 (2014/11/12) $\to $ NY609 (2015/03/02) $\to $ NY677 (2015/04/10)  This story associates Boston bombers’ involvement in the Waltham triple murder. The story includes trial phase. 
NY158 (2014/10/16) $\to $ NY379 (2014/10/16) $\to $ NY206 (2014/10/28) $\to $ NY280 (2014/10/28)  Former governor of Massachusetts testiﬁes for Tsarnaev’s friend Robel Phillipos. Phillipos was found guilty of making false statement to authorities. 
NY435 (2014/05/02) $\to $ NY317 (2014/06/18) $\to $ NY340 (2014/08/14) $\to $ NY525 (2015/02/06)  Tsarnaev’s lawyers urge appeals to move trial location to Washington, delaying trial date. The story shows that the request was denied. 
NY121 (2014/04/22) $\to $ NY273 (2014/07/10) $\to $ NY177 (2014/08/20) $\to $ NY338 (2014/09/27)  This story highlights the trials of three friends of Tsarnaev. All three were accused of obstructing justice by lying and destroying evidence. 
New York Times returns 1028 articles with the query “Boston Marathon Bombing”. The storyboarding framework discovered a number of subevents that provide a ﬁne mental model of branches of the Boston Marathon Bombing tragedy and happenings afterwards. Table 2 lists a few of the stories discovered by the Storyboarding framework. The first story of Table 2 is illustrated in Figure 1 on a storyboard using term clouds and faces.
The story of Figure 1 describes a connection between Boston bombers and Waltham triple murder. The story moves forward till the penalty phase. The term cloud of the storyboard highlights a person named Todashev, a victim of triple murder, along with the bombers Tamerlan and Tsarnaev. Each related face list automatically selected from the knowledge base by the Storyboarding framework captures faces of relevant people very well. For example, the ﬁrst face of the second article of the story is Todashev, whose face is found repeatedly in the consecutive articles. Rest of the faces in the story are of the Boston bombers Tamerlan and Tsarnaev, police, and rescue crews.
The third subevents of Table 2 summarizes Tsarnaev’s lawyers’ appeal to move trial location to Washington which delayed the trial date. The fourth story describes the trials of three friends of Tsarnaev who were accused of obstructing justice, lying, and destroying evidences.
Story with context  Story without context 
[NY435 $\to $ NY317 $\to $ NY175 $\to $ NY427 $\to $ NY642 $\to $ NY552 $\to $ NY525]  
The story describes the request for delay and change of trial location.  The intermediate article digresses from the focus and brings investigation in Russia into consideration. 
The story connects triple murder with current cases.  The intermediate articles are about trial announcements. 
The story focuses on the trial of a friend of Tsarnaev and the jury selection for the suspected friends.  One of the intermediate articles is oﬀtopic and is about a citation of winning a video game at trial of the accused Boston bomber’s friend. 
This story describes former Governor’s testimony for Tsarnaev’s friend Robel Phillipos.  The testimony of the former Governor does not show up in the story without context. 
While inclusion of image context in the core heuristic path ﬁnding part of the Storyboarding process imposes an additional constraint, the outcome of the use of this constraint becomes evident when we compare the stories side by side. For a fair comparison we make sure that both the methods have the same parameters (e.g., same $\mathit{\theta}$, branching factor $b$, and startend pairs). Table 3 shows four pairs of sample stories generated with and without face contexts.
Our observation is that Storyboarding generates more coherent stories when the context is used, which is to be expected because context overlaps between consecutive articles reinforces a constraint that each story must be weaved using a certain theme. For all startend pair of articles in Table 3, the stories with context are more coherent than the stories without context. From the table, we observe that each story without the use of context has some oﬀtopic articles that may be relevant but the ﬂow of the theme is broken. Sometimes we have the same story with and without the use of context but those stories are not reported here.
Two important usersettable parameters in Storyboarding are maximum allowable distance $\mathit{\theta}$ and branching factor (nearest neighbor), $b$. Figure 11 shows the impact of $\mathit{\theta}$ and $b$ on the statistical signiﬁcance, average length and number of stories. To calculate the statistical signiﬁcance, $p$value, we randomly pick up $b$ documents from the entire candidate pool and check if the documents picked satisfy the distance threshold $\mathit{\theta}$, iterating the test 5,000 times. We repeat this process for every junctionarticle of a discovered story. The overall $p$value of story is calculated by multiplying all the $p$values of every document of the story except for the last one. Figure 11(left) shows that the signiﬁcance decreases (i.e. $p$value increases) with higher values of $\mathit{\theta}$ and $b$. This is an expected outcome since higher $\mathit{\theta}$ values imply less stringent overlap of content between consecutive articles. Less stringent constraint may result in stories with loose connections between consecutive articles. Similar argument can explain the plot in Figure 11(middle). Increasing the $\mathit{\theta}$ value and branching factor $b$ leads to shorter stories with loosely connected neighbors. The curve for branching factor 20 and 35 are exception which gives even shorter stories than larger branching factor until $\mathit{\theta}=0.75$. This exception is justiﬁed by the right plot where we see that there were not enough stories for those two branching factors until $\mathit{\theta}=0.75$. For other branching factors, number of stories follow a similar upward trend with increasing $\mathit{\theta}$.
In this subsection we describe the runtime of the core Storyboarding heuristic algorithm for path ﬁnding with diﬀerent parameters, e.g., distance threshold $\mathit{\theta}$, context threshold $\tau $, and branching factor $b$. Figure 12 (left) shows that increase of $b$ in general leads to higher execution time since more candidate neighbors are explored from each node. Results in this plots are shown with diﬀerent $\mathit{\theta}$. As observed in Figure 12(left), the curves may not monotonically increase because higher $b$ value also lowers the average length of the stories as shown in Figure 11(middle). Figure 12(middle) depicts that the runtime goes higher with the increase in maximum allowed distance, $\mathit{\theta}$ but remains uninﬂuenced by the value of maximum allowed distance between face contexts, $\tau $. This is due to the fact that $\tau $ only determines the order of exploring the neighbor during the search process. The rightmost plot also shows no impact of $\tau $ on the average runtime for varying $b$. The runtimes are averaged over all values of $\mathit{\theta}$.
Our storyboarding framework complements conventional summarization techniques by weaving through textual and imagery data and presenting subevents as chains of documents. Our directions for future work are two fold. One of the ﬁrst directions is to incorporate multiple news archives in the knowledge base. This will provide the ability to connect one major event with another. Secondly, we will systematically study the tradeoﬀ between the standalone use of text content and image context as an additional information to text content in the storybuilding process. This study will allow an analyst explore diﬀerent alternative stories smoothly by varying textual and imagery coherence requirements.
[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. PAMI, 28(12):2037–2041, 2006.
[2] Aliasi. LingPipe 4.1.0. Accessed: May 8, 2015, http://aliasi.com/lingpipe/.
[3] R. Alonso and H. Li. Modelguided information discovery for intelligence analysis. In CIKM ’05, pages 269–270, 2005.
[4] Apache Software Foundation. OpenNLP. Accessed: May 8, 2015, http://opennlp.apache.org.
[5] E. Bier, S. Card, and J. Bodnar. Principles and tools for collaborative entitybased intelligence analysis. TVCG, 16(2):178–191, 2010.
[6] S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising patterns using temporal description length. In VLDB ’98, volume 98, pages 606–617, 1998.
[7] G. Chin, O. A. Kuchar, P. D. Whitney, M. Powers, and K. E. Johnson. Graphbased comparisons of scenarios in intelligence analysis. In SMC ’04, volume 4, 2004.
[8] F. DasNeves, E. A. Fox, and X. Yu. Connecting topics in document collections with stepping stones and pathways. In CIKM ’05, pages 91–98, 2005.
[9] C. Faloutsos, K. S. McCurley, and A. Tomkins. Fast discovery of connection subgraphs. In KDD ’04, pages 118–127, 2004.
[10] L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. Rex: explaining relationships between entity pairs. Proc. VLDB Endow., 5(3):241–252, 2011.
[11] J. Gersh, B. Lewis, J. Montemayor, C. Piatko, and R. Turner. Supporting insightbased information exploration in intelligence analysis. Commun. ACM, 49, 2006.
[12] J. Gung and J. Kalita. Summarization of historical articles using temporal event clustering. In HLTNAACL ’12, pages 631–635, 2012.
[13] K. Heath, N. Gelfand, M. Ovsjanikov, M. Aanjaneya, and L. J. Guibas. Image webs: Computing and exploiting connectivity in image collections. In CVPR ’10, pages 3432–3439, 2010.
[14] M. S. Hossain, P. Butler, A. P. Boedihardjo, and N. Ramakrishnan. Storytelling in entity networks to support intelligence analysts. In KDD ’12, pages 1375–1383, 2012.
[15] M. S. Hossain, J. Gresock, Y. Edmonds, R. Helm, M. Potts, and N. Ramakrishnan. Connecting the dots between pubmed abstracts. PloS one, 7(1):e29509, 2012.
[16] G. Huang, M. Mattar, H. Lee, and E. G. Learnedmiller. Learning to align from scratch. In NIPS ’12, pages 764–772. 2012.
[17] H. Kang, C. Plaisant, B. Lee, and B. B. Bederson. Netlens: Iterative exploration of contentactor network data. Info. Vis., 6(1):18–31, 2007.
[18] H. Khurana, J. Basney, M. Bakht, M. Freemon, V. Welch, and R. Butler. Palantir: a framework for collaborative incident response and investigation. In IDtrust, 2009.
[19] D. Kumar, N. Ramakrishnan, R. F. Helm, and M. Potts. Algorithms for storytelling. In KDD ’06, 2006.
[20] S. Laxman, P. Sastry, and K. Unnikrishnan. Discovering frequent episodes and learning hidden markov models: A formal connection. KDE, 17(11):1505–1517, 2005.
[21] H.J. Lee, W.S. Lee, and J.H. Chung. Face recognition using ﬁsherface algorithm and elastic graph matching. In ICIP ’01, volume 1, pages 998–1001, 2001.
[22] P. Li, J. Ma, and S. Gao. Learning to summarize web image and text mutually. In ICMR ’12, page 28, 2012.
[23] P. Li, Y. Wang, W. Gao, and J. Jiang. Generating aspectoriented multidocument summarization with eventaspect model. In EMNLP ’11, pages 1137–1146, 2011.
[24] A. S. Maiya, J. P. Thompson, F. LoaizaLemos, and R. M. Rolfe. Exploratory analysis of highly heterogeneous document collections. In KDD ’13, pages 1375–1383, 2013.
[25] PNNL. Paciﬁc northwest national laboratory, inspire visual document analysis. Last accessed: May 26, 2011, http://inspire.pnl.gov.
[26] G.J. Qi, C. Aggarwal, and T. Huang. Towards semantic knowledge propagation from text corpus to web images. In WWW ’11, pages 297–306, 2011.
[27] B. Rafkind, M. Lee, S.F. Chang, and H. Yu. Exploring text and image features to classify images in bioscience literature. In BioNLP ’06, pages 73–80, 2006.
[28] D. Shahaf and C. Guestrin. Connecting the dots between news articles. In KDD ’10, pages 623–632, 2010.
[29] Stanford Natural Language Processing Group. Stanford NER. Accessed: May 8, 2015, http://nlp.stanford.edu/software/CRFNER.shtml.
[30] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In CVPR ’13, pages 3476–3483, 2013.
[31] M. Turk and A. Pentland. Eigenfaces for recognition. Cognitive Neuroscience, 3(1):71–86, 1991.
[32] University of Maryland. Global Terrorism Database. Accessed: May 8, 2015, http://www.start.umd.edu/gtd/.
[33] M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression and graph models. In CVPR ’10, pages 2729–2736, 2010.
[34] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR ’01, volume 1, pages I–511, 2001.
[35] M. Wu. Investigations on eventbased summarization. In COLING ’06, pages 37–42, 2006.
[36] J. Xu and T.C. Lu. Seeing the big picture from microblogs: Harnessing social signals for visual event summarization. In IUI ’15, pages 62–66, 2015.
[37] Q. Zhao, P. Mitra, and B. Chen. Temporal and information ﬂow based event detection from social text streams. In AAAI ’07, volume 7, pages 1501–1506, 2007.