Connecting the Dots using Contextual Information Hidden in Text and Images

(Complementary Material)

Md Abdul Kader1, Sheikh Motahar Naim1, Arnold P. Boedihardjo2, M. Shahriar Hossain1
1The University of Texas at El Paso, El Paso, TX 79968
2U. S. Army Corps of Engineers, Alexandria, VA 22315

[Use Firefox or Safari to obtain the best rendering of the equations]

ABSTRACT

Creation of summaries of events of interest from multitude of unstructured data is a challenging task commonly faced by intelligence analysts while seeking increased situational awareness. This document proposes a framework called Storyboarding that leverages unstructured text and images to explain events as sets of sub-events. The framework first generates a textual context for each human face detected from images and then builds a chain of coherent documents where two consecutive documents of the chain contain a common theme as well as a context. Storyboarding helps analysts quickly narrow down large number of possibilities to a few significant ones for further investigation. Empirical studies on Wikipedia documents, images and news articles show that Storyboarding is able to provide deeper insights on events of interests.

1 Introduction

Due to the rapid growth of inexpensive storage devices and improvements in the semiconductor industry, the amount of data in our world has been exploding. The increasing volume, detail of information, the rise of multimedia, social media, and the Internet of Things fuel the exponential growth of data for the foreseeable future. While publicly available data feeds are increasing exponentially providing a massive source of intelligence, ironically this plethora of information is what makes succinct details latent during the analysis of an event of interest. Traditional search engines help narrow down the scope of analysis to a specific event but it is not an easy task for an analyst to study massive amount of relevant documents and get a good understanding of how each sub-event compose a complex set of interactions between the actors involved in a major event. In this document, we describe a framework called Storyboarding that provides summarizations of an event as chains of coherent documents and relevant image objects using publicly available data.

Summarization of an event involves the task of identifying relevant entities (e.g., person, organization, and location), discovering non-obvious connections between the entities to construct a coherent story – a task which sometimes is referred in the literature as “connecting the dots”, and providing relevant information (e.g., images) to explain the connections better. The “connecting the dots” problem has been of great interest to the researchers in a variety of domains and has been studied in diverse context, for example, in entity networks [14], image collections [13], social networks [9], and document collections [15]. Our approach provides a mechanism to build a story between two news articles published at two different times using a sequence of intermediate published articles. The sequence provides a summary of a sub-event that explains a major event in part. Our approach complements other summarization techniques [23 35] by forming a chain of articles and relevant images to provide better understanding of the story. Conventional summarization techniques that directly use the textual contents of a cluster of relevant documents as text snippets to explain an event can further explain the stories generated by our framework.


pict

Sub-event: The story contains five news articles associating Boston bombers’ involvement in the Waltham triple murder. The story includes trial phase.

Figure 1: Storyboarding explains a sub-event using Boston marathon bombing data.


The purpose of automatic summarization of an event is to give an analyst a good overview and a clear visualization of the event without overburdening her/him with too much details. Our storyboarding approach provides summarization of events using both textual and imagery information. The relevant event corpus need not contain any image within its documents. The storyboarding framework builds a knowledge base first, which later helps in providing an image context for each story. A straightforward way to incorporating images during summarization of events is to generate the story using only a similarity network of news articles, as done in Storytelling [14], and then attach images to enrich the presentation of the story. This straightforward approach does not take contents of the images into account whereas the images are likely to carry vital information for a story-building process. In our Storyboarding framework, we leverage image objects (e.g., faces) as added pieces of information to the text content. As a result, objects within images (e.g., faces) play an important role along with the actual text content in the story-building process. As an example of our Storyboarding output, Figure 1 provides a sub-event explaining an aspect of the Boston Marathon Bombing tragedy. Figure 1 is further explained later in Section 6.5.

In summary, the main contributions of this document are:

1.
Knowledge base: Storyboarding framework connects textual information with facial features using a probabilistic technique that enables generation of textual context for each of the faces extracted from the images of a large unlabeled multimodal dataset.
2.
Feature extraction: We have developed a novel frontalization mechanism that enhances feature extraction by complementing conventionally used techniques. Empirical studies show that features generated using our frontalization technique outperforms feature extraction using state-of-the-art methods in a face recognition task. Experiments also show that the use of frontalization provides better textual context for faces.
3.
Exploration of multimodal data: Storyboarding leverages multimodal data (e.g., text and images) to explain events as chains of documents that progresses over time giving an idea about how the story evolved. Storyboarding traverses a similarity network of news articles without materializing the network entirely and by constraining consecutive documents with certain cohesion threshold, context overlap requirement, and temporal ordering.

2 Related Research

The literature review is divided into the following areas. Situational awareness and intelligence analysis: Research work into software tools for intelligence analysis and situational awareness can be grouped into model-guided systems [3], topic-guided methods [24], graph-based analytic tools [11 7], and collaborative systems [5]. Some other techniques are geared toward visual analytics, such as, IN-SPIRE [25], NetLens [17], and Palantir [18] where the reasoning processes require complex interaction between the analysts and the tools. Our work falls in the category of graph-based analytic tools. We support exploration of a similarity graph of documents (without materializing it) to help marshal summary items, explain connections, and form an explanation.

Event Summarization: Detection and summarization of events have been studied in a variety of domains like computer systems, intelligence analysis, surveillance systems, etc. Since the definition of an event varies largely among domains, a range of techniques have been proposed in the literature including frequent itemset mining [6], episode mining [20], graph search [36], temporal event clustering [12], to name a few. There is a recent trend of increasing interest for summarizing viral events using users’ interaction on social media. The graph based model proposed by Xu and Lu [36] has some resemblance to our framework that provides visual summary of an event using images from various social media. However, Storyboarding framework summarizes events as stories of several sub-events using both textual and visual contents of the documents.

Association between image and text: Solutions to many real-life problems involve mapping text to images, e.g., recommending text advertisements for image search [37], classifying images in the bioscience literature [27], providing text for Web images [26], and filling news articles with relevant images [22]. Some of these techniques rely on training data using existing knowledge bases [22 37]. The work of Li et al. [22] on mutual summarization of web image and text has some resemblance to Storyboarding, since it provides narratives to explain an event. However, the scope of the work of Li et al. is limited to explaining one image by selecting text from an article or providing a set of related images if there is a text narrative. Storyboarding’s capabilities are not limited to one image or one document; rather it relates components at feature level.

Network exploration: The core algorithm of Storyboarding solves a “connecting the dots” problem, which has appeared in the literature before in a variety of guises: entity networks [10 14], image collections [13], social networks [9], and document collections [8 19 28]. While some of these efforts can be adapted toward our problem context, the notion of a Storyboarding emphasizes on amalgamating heterogeneous information (text and image features) in the summarization process, whereas the above efforts typically require a stronger connecting thread between nodes of a homogeneous network.

3 Problem Formulation

The Storyboarding problem involves two different datasets: a knowledge base κ and a collection of news articles 𝒜. The knowledge base, κ = {𝒟,κ,,,R DE,RDI,RIF} is a 4-partite graph, where four sets of vertices are composed of (1) 𝒟 = {d1,d2,,dn}, a collection of n documents, (2) κ = {e 1,e2,,e|κ|}, a set of entities (e.g., names of people, organization, location), (3) = {ї1,ї2,,ї||}, a set of images, and (4) = {f1,f2,,f||}, a set of faces extracted from images. RDE is the set of edges {{dq,er} : dq 𝒟,er } that represents entities within each document. Similarly, {{dq,їr} : dq 𝒟,їr } and {{їq,fr} : їq ,fr } are the sets of edges RDI representing images within documents and RIF representing faces of people within the images, respectively. The dataset of news articles is a bipartite graph {𝒜,𝒜,R 𝒜𝒜} where 𝒜 = {a1,a2,,am} is a set of vertices representing news articles and 𝒜 = {ё 1,ё2,,ё|𝒜|} is the set of entities within the news articles. An edge {aq,ёr} R𝒜𝒜 exists if entity ёr occurs in article aq. Each news article aq has a time-stamp denoted as Timestamp(aq) that refers to the publication date.

Given two news articles {as,at}𝒜 such that Timestamp(as) < Timestamp(at), Storyboarding framework outputs a summary represented as a sequence of {article, face-set} pairs, 𝒵 = {z1,z2,,zq}. Each {article, face-set} pair zi includes an article from 𝒜 and a set of faces from that are relevant to the context of the article. Consider that A(zi) is the article associated with zi and F(zi) is the relevant set of faces. In sequence 𝒵, as = A(z1) and at = A(zq). In Storyboarding, the sequence 𝒵 must maintain the following three properties:

1.
For any two consecutive {article, face-set} pairs zi and zi+1 in the sequence, if respective articles are A(zi) and A(zi+1) then Timestamp(A(zi)) Timestamp(A(zi+1)). This indicates that the story must progress in time from the start article to the end.
2.
Two consecutive articles A(zi) and A(zi+1) must have distance less than a maximum allowable distance threshold 𝜃. This enforces certain coherence between the topic of the consecutive documents of 𝒵.
3.
The contexts of two consecutive sets of faces F(zi) and F(zi+1) in 𝒵 must maintain a maximum allowable context distance τ.

4 Methodology

The Storyboarding framework has three operational stages: (1) construction of the knowledge base κ represented as a 4-partite graph in Section 3, (2) building a model for context generation using the knowledge base, and (3) generate summaries given an event data. Among these three stages, the first two can be viewed as a startup cost that can be amortized over all generated summaries. The following subsections describe these three stages.

4.1 Knowledge Base, κ

The 4-partite knowledge base graph κ = {𝒟,κ,,,R DE,RDI,RIF} is constructed using images and textual context of Wikipedia. In this document, we focus our analysis on topics related to politics and terrorism. To generate the data of focus, we trained a classifier that was able to identify interesting or uninteresting Wikipedia documents. We used 20Newsgroups dataset and Global Terrorism Database (GTD) [32] for the training. The entire GTD and three groups from the 20Newsgroups dataset (talk.politics.misc, talk.politics.guns, talk.politics.mideast) were considered interesting and the rest of the groups of 20Newsgroups dataset were considered uninteresting. The classifier identified 118, 290 out of 1.5 million Wikipedia pages as interesting and hence relevant to politics and terrorism. Document set 𝒟 is composed of the text content of these 118, 290 Wikipedia pages. Similarly, all the relevant images form . κ is formed by extracting named entities from each document in 𝒟. is constructed by detecting human faces from the images in . RDE, RDI, and RIF are the relevant sets of edges representing existence of entities in documents, images in documents, and faces in images, respectively.

4.1.1 Face Detection from

We used Viola-Jones algorithm [34] with trained cascade classification model to detect faces from Wikipedia images. The cascade classification model is an ensemble of weak classifiers, built on top of classification and regression tree based strategies. The model uses rotated Haar-like features [34] that significantly enrich the accuracy of the Viola-Jones algorithm. However, about 50% of the faces detected by this state-of-the-art algorithm were incorrect when we applied it on Wikipedia images. We added an additional constraint, ‘each face must include an eye pair’, which significantly reduced this error rate to 1%.

Extraction of effective features from the detected faces is a challenging task since the quality of the features later dominates the accuracy of generated context (similar to face recognition). Popular feature extraction methods like Eigenface [31], Fisherface [21], and Local Binary Pattern [1] commonly used in face recognition perform reasonably well when most of the faces are front-facing. Since many of the faces in the Wikipedia images are side-facing, we use a novel frontalization method to be able to capture positions of some key facial points in a projected plane where the side-faced photo represents a front-posing face. This enables us to bring side faces to a common space where all faces are considered front-facing. Our observation is that our facial key points frontalization technique provides better features when combined with Eigenface.

Facial Key Points Detection: We detect five facial key points – two eye centers, nose and two mouth corners, as shown in Figure 2 – using a pre-trained cascaded deep convolutional neural network. The neural network utilizes texture information over the entire face and the geometric constraints among key points with high accuracy [30]. Since a face can be in any pose in an image, we need to frontalize the facial key points to be able to extract the actual geometric properties of a face. Even before the frontalization, it is necessary to estimate the angles of a face by which it is deviated from a front facing position.


pict pict pict

Figure 2: Detected eyes, nose and mouth corner points using Cascaded Deep Convolutional Neural Network for few Wikipedia images.


Angle Prediction: We use Generalized Linear Model (GLM) to predict vertical and side angles of a face. We generate a synthetic dataset using six 3D-face models, which is created from six face images covering various ethnicities (e.g., Caucasian, Scandinavian, Japanese). Since the dimension of faces in the knowledge base, κ is 100 × 100 pixels, we imagine a cube of 100 × 100 × 100 for the 3D face models where the center of head is at the origin (0, 0, 0). Five facial key points of the faces are then marked on these models. We made three assumptions for simplicity in placing the five key points in the model.

1.
Facial key points for eye pair and mouth corners are in the same plane, which is + 35 unit ahead of and parallel to xz-plane.
2.
Facial key point for nose is in plane 𝒫 parallel to xz-plane and + 15 ahead of the plane of mouth corners and eye pair.
3.
Nose point πnose = (0, 50, 0) is fixed for all the models. But eye pair and mouth corner points are placed by maintaining relative distance of those points.

To create a synthetic training dataset of faces with known facial key points and known angles, we rotate the 3D models by varying angles around z-axis and x-axis and projecting them back on a 2D xz-plane. We rotate the models from 45 to + 45 at 3 intervals around z-axis and from 15 to + 15 at 5 intervals around x-axis. This produces a training set of 1302 instances. Each instance consists of 30 features as described in “Feature Generation” part later. Two class labels are azimuth angle az around z-axis and elevation angle el around x-axis. We train two GLM models using these two sets of class labels. After the training, the models can predict the vertical and side angles for any five facial key points of a face. This enables us to apply frontalization by those predicted angles.


pict

Figure 3: Frontalization of facial key points.


Frontalization: We predict azimuth angle (side angles), az and elevation angle, el for each detected face by using GLMs from the extracted facial key points. Let ρ = {ρ1,ρ2,...,ρ5} be the set of five facial key points extracted from a face. Algorithm 1 describes the steps for frontalizing the set of key points ρ. The algorithm uses a function Rotate that rotates a 3D point around a particular axis by a certain angle. Figure 3 shows a sample of frontalization applied on five facial key points.


______________________________________________________________________________
Algorithm 1: Facial Key Point Frontalization
______________
Input ρ,πnose,𝒫,az,el Outputρfrontalized
1:  𝒫 Plane 𝒫 rotated by az around z-axis then by el around x-axis
2:  πnoseRotate(Rotate(π nose,axis = z,az),axis = x,el).
3:  for all ρk ρ do
4:  ρk ρ k + (πnosex,π nosez) ρ 3
5:  Lk Line perpendicular to xz-plane going through ρk
6:  tk Intersecting point between the plane 𝒫 and the line Lk
7:  tk Rotate(Rotate(t k,axis = x,el),axis = z,az)
8:  ρkfrontalized t k orthogonally projected on xz-plane

9:  end for
10:  return  ρfrontalized
___________________________________________________________________________

pict
Figure 4: Twenty angles generated from frontalized five facial key points.

Feature Generation: An important criterion in building a good feature set of a face is that the produced features vary a little for a particular person in different lighting conditions, expressions, and occlusions. Our method for facial key point generation and frontalization aid in generating features with such properties. We generate two sets of features from the frontalized facial key points. The first set is composed of pairwise relative distances between all five facial key points and the second set comprises of twenty angles as shown in Figure 4. Combination of these two sets of features makes a feature vector of length thirty where the first ten are the pairwise relative distances and the last twenty are the angles. In addition to these thirty features, we leverage pixel information of the faces as used in Eigenface. Each face is stored in a 35×35 image resulting in a 1225 pixels per face. We perform a max-min normalization of standardized pixel values to reduce the noise in the images due to color intensity variation. Previously extracted thirty features using frontalized key points are concatenated with these normalized pixel values forming a vector of length 1255 for each face. Finally, we apply principal component analysis on the concatenated feature vectors, and select a number of principle components by plotting and observing the relationship between relative magnitude of eigenvalues and number of components.

4.1.2 Extraction of Named Entities, κ and their Representations

We combine the outputs of a number of named entity extractors including LingPipe [2], OpenNLP [4], and Stanford NER [29] to identify entities within the textual contents of the documents in 𝒟. Although we extracted all standard entity types including people, organization, and locations, this document scopes down the analysis to person entities only, especially because the images are explained using detected human faces. We represent the named entities using vector space modeling of {𝒟,κ,R DE}, which is a subset of the knowledge base 4-partite graph κ. The weight of entity e κ in the document d 𝒟 is calculated as:

W(e,d) = (1 + log(tfe,d)) (log|𝒟| dfe) e dκ (1 + log(tfe,d)) (log |𝒟| dfe) 2 (1)

where tfe,d is the frequency of entity e in document d, dfe is the number of documents containing a connection with entity e, and dκ is the set of entities that are connected to document d. Equation 1 is a variant of tf-idf modeling with cosine normalization. The Wikipedia corpora has descriptions of different sizes. In general, longer descriptions have higher term frequencies because many terms are repeated. The cosine normalization helps lessen the impact of size of the descriptions in the modeling.

4.2 Context Modeling

The context of a face f is a ranked list of named entities ordered by relevance to f, and expressed as a probability distribution over the set of entities κ. The contexts of the faces are leveraged to generate chains of contextual documents in Section 4.3. Using Bayesian rule, the probability of an entity e κ for the given face f can be expressed as

P(e|f) P(e) × P(f|e) (2)

where P(e) is the prior probability of e and P(f|e) is the likelihood. Let f = {f1,f2,,fL} be the feature representation of the face f obtained by the method described in Section 4.1.1. L is the length of the feature vector of f. With an assumption of independence between the features, we can rewrite Equation 2 as

P(e|f) P(e) × l=1LP(fl|e) (3)

The person entity e can appear in multiple documents of 𝒟. Let 𝒟e D be the set of documents that contain entity e. Each document d 𝒟e may in turn contain a number of faces (as expressed by RDI of κ). Let d be the set of faces in document d. The relationships between the face-features and entities in documents can be computed by the entity weights, W(e,d) and face-feature weights in the documents. The likelihood of the lth feature of a face given an entity e can be computed as

P(fl|e) = dDeP(fl|d) × W(e,d) dDeW(e,d) fl (4)

where P(fl|d), which is calculated by Equation 5, is the probability of the face feature fl given document d.

P(fl|d) = ϕdϕl ϕd l=1Lϕl (5)

where ϕl is the lth feature of a face ϕ. Now, putting this expression for P(fl|e)in Equation 3 we get

P(e|f) P(e) × l=1L dDeP(fl|d) × W(e,d) dDeW(e,d) fl (6)

Taking logarithm on both sides of Equation 6, we obtain the following equation.

log(P(e|f)) log(P(e)) l=1L fl × log dDeW(e,d) + l=1L fl × log dDeP(fl|d) × W(e,d) (7) 

We generate context for each face in using Equation 7. In practice, we do not keep the full probability distributions, rather we keep record of maximum of LC of entities with highest probabilities as the context of a face.

4.3 Automatic Summarization as Stories

The core storyboarding task, as described in Section 3, focuses on forming a chain of {article, face-set} pairs using news articles in 𝒜 and the generated face contexts using knowledge base κ. This part requires extraction of person entities from 𝒜 to form 𝒜, discovering faces relevant to each new article based on similarity between a news article and context of images in the knowledge base, and designing a path-finding algorithm in a similarity network of news articles where discovered paths are constrained by text coherence, context similarity, and progression of time in the story.

𝒜 for news articles is formed in the same way we formed κ for the knowledge base (as described in Section 4.1.2). While in reality news articles may contain images but many of those images are repeated from the past to provide a visual context. Many of the news articles do not contain any image. Since the knowledge base covers broad aspects of everything, face images can be reproduced from the images of the knowledge base. In our set of news articles, 𝒜, we considered that there is no image. We reproduce relevant faces for each article of a story using the knowledge base. We order faces for each news article a 𝒜 based on relevance between a and all faces. We compute this relevance using a context matching score between a news articles a and the context C(f) κ of a face f:

β(f,a) = ea𝒜C(f)V (e,a) × (|C(f)| R(e,f)) (8)

where V is a function similar to W defined in Equation 1 and |C(f)| LC. The only difference is that V weights are computed over news articles of 𝒜 instead of the documents in knowledge base 𝒟. R(e,f) is the rank of e in the list of entities C(f) of face f. That is, we have two kinds of information associated with each news articles. One is the weighted list of entities extracted from the text of the article and the other type is the most relevant faces (or contexts) from the knowledge base. Both the types are leveraged in a heuristic search algorithm to build a path of {article, face-set} pairs between two articles {as,at}𝒜. We use a variation of A* search algorithm that uses Soergel distance as the heuristic. Soergel distance is an admissible heuristic for A* search. The Soergel distance between two articles a and a is calculated using the following formula.

SrgDist(a,a) = ea𝒜a𝒜 |V (e,a) V (e,a)| max(V (e,a),V (e,a)) (9)

Equation 9 is also used to compute distance between the combined contexts of the relevant faces of two articles.

Our heuristic algorithm maintains the following properties during exploration.

1.
The complete search space, i.e., the network of articles is not precomputed. Instead, neighboring articles during the search are generated on-the-fly by looking up a precomputed ball tree of the articles dataset to compute b-nearest neighbors.
2.
Any two consecutive articles during the search must maintain a maximum allowable distance 𝜃.
3.
Face contexts of one article, as combined from certain number of most relevant faces retrieved by using Equation 8, cannot be more than τ distant from the face contexts of a neighboring article.
4.
The search must have a progression over time, i.e., Timestamp(ai) Timestamp(ai+1) for two consecutive articles ai and ai+1.

Notice that all these constraints can be applied during a candidate evaluation phase of any heuristic search algorithm. We generate the b nearest neighbors based on text content of the articles, and rank the candidate articles for exploration based on their contexts. b is also considered the branching factor for the heuristic search algorithm. Empirical studies with different b, 𝜃, and τ values are shown in Section 6.

5 Complexity Analysis

We estimate the scalability of the Storyboarding framework by analyzing time complexities of the major components of the pipeline. The angle predictor for facial key point frontalization learns a linear model from Ns synthetic observations of Ls features and predicts the azimuth and elevation angles for each face in , incurring a total complexity of 𝒪(Ls2 × N s + Ls ×||). Frontalization of five key points of a face using Algorithm 1 takes constant time. The feature generation step involves Eigenface computation on the feature matrix, and takes 𝒪(Lfeature2 ×|| + L feature3) time, where Lfeature is the length of each feature vector. Generation of context for all faces cost 𝒪(L ×||× Nd ×|κ|), where Nd is the average number of documents associated with each entity. Complexities described so far can be considered as a one-time cost for a knowledge base. The constrained heuristic search to generate a story runs in O(bLsol ) time, where Lsol is the length of the solution path. There is also an additional cost of 𝒪(|𝒜|× log(|𝒜|)) for generating nearest neighbors for all articles using Ball tree but this cost is a one-time cost for a news article dataset 𝒜 prepared for a major event.

6 Experimental Results

The specific questions we seek to answer in this section are:

1.
Which method is more capable of detecting facial key points for this particular knowledge base data set? (Section 6.1)
2.
Does frontalization of key points result in better face recognition accuracy? (Section 6.2)
3.
Which learning mechanism we should use to attain greater performance in predicting the proper rotation angle for frontalization? (Section 6.3)
4.
How good are the contexts generated for the face images? (Section 6.4)
5.
Do the stories generated by the framework provide meaningful summarization of events? (Section 6.5)
6.
What is the impact of face context on the quality of the stories? (Section 6.6)
7.
How do the search parameters control the characteristics of the stories? (Section 6.7)
8.
How does the runtime behavior of the search algorithm get affected by its parameters? (Section 6.8)

We collected all the Wikipedia pages to build our knowledge base, and accumulated the news articles published in The New York Times on a particular event to generate the stories from. Table 1 summarizes the knowledge base (κ) and news article (𝒜) data sets used in this work.



Table 1: Description of the data sets


Total number of pages in Wikipedia 15,018,369


Number of selected pages in κ 118,290


Number of images in κ 524,928


Number of faces in κ 100,928


Number of person entities in κ 384,457


Number of news articles in 𝒜 1,028


Number of person entities in 𝒜 8,277



6.1 Facial Key Point Selection

pict
Figure 5: Comparison of two facial key point detection techniques.

As described in 4.1, use of key facial points to generate features provides higher quality of face contexts later. Detection of an appropriate number of key points is challenging because of different poses faces can have. We applied two methods, Boosted Regression with Markov Networks (BoRMaN) [33] and a Deep Convolutional Network Cascade (CNN) method, for facial key point generation. The former discovers twenty facial key points and the later finds five. Our observation is that the BoRMaN method performs well with faces taken under controlled environment, e.g., photos taken in laboratories. When we used detected faces from Wikipedia images, the BoRMaN method was able to detect facial points properly for 70% of the faces. For this experiment, we randomly picked up 300 faces from our database and manually checked if the points detected are in the vicinity of the expected pixels. With the same 300 faces, the accuracy of the CNN method in detecting five facial key points was 99%. Figure 5 shows two examples where five key points are detected correctly by CNN but the twenty points detected by the BoRMaN method are cluttered in one region of the face.

6.2 Impact of Frontalization on Recognition


pict

Figure 6: Comparison of face recognition accuracies using different combinations of Eigenface with and without Frontalization and mirror images.


Although Storyboarding targets a different problem than face recognition, we evaluate the proposed frontalization technique using face recognition to study its impact. In most face recognition literature, ground truth faces of each people are taken in different poses under the same environment (e.g., illumination). However, faces detected from the Wikipedia images are not annotated and limited in number and poses. In addition to the original face, we included mirror image of each face to make sure that we have at least two poses of the same face. The experiment in this subsection compares face recognition accuracies with and without frontalization and mirror faces. We used a subset of 293 face images of 9 persons from the Labeled Faces in the Wild (LFW) dataset [16], which has annotated faces and the images are not taken in a controlled environment. Figure 6 compares face recognition accuracies at different training and test rations with different combinations of techniques. The figure shows that Eigenface with frontalization and mirror faces gives better accuracy than any other combination. The standard deviation of the accuracies was 2%-5%.

6.3 Prediction of Frontalization Angles

Our frontalization technique relies on a generalized linear model based classifier to compute the azimuth and elevation angles of a face in an original image. To compare the linear model based classifier with a few other alternative mechanisms, we created a synthetic face model and rotated it in different angles to create different poses. That is, the ground truth angles are known for all the poses and an error can be computed for prediction of those angles. Figure 7 shows a comparison of mean square errors using three methods: Generalized Linear Model (GLM), Regression Tree and Ensemble Regression Tree to predict rotation angles. Figure 7(a) is for sideway rotations and Figure 7(b) shows the errors with elevation of the faces. In both the cases, GLM has lower errors than any other method at most of the angles. In this experiment, the sideway angles were varied from 45 to + 45 and the elevations were varied from 15 to + 15.


pictpict
(a) (b)

Figure 7: Mean square errors of (a) azimuth angle predictors and (b) elevation angle predictors.


6.4 Quality of Face-Contexts

We evaluate the quality of the context of a face by comparing the context with Google image search results. Although Google search has a different objective than ours, we are interested to check if some of the entities in the context of a face can be discovered using Google image search by uploading the face image. Since Google image search API limits number of queries and the process is time consuming, we randomly picked up 300 faces and computed how many of the terms of the context of each face were found in the first page where the search result was returned by Google after uploading the face. Then we calculate the number of entities it has in common with our context of that face. Figure 8 shows a distribution of percentage of faces for different number of overlaps of context and Google search. As expected, the percentage of faces goes down when we look for more common entities.The plot shows that almost 75% of the faces have at least five entities in common with corresponding Google search result when we keep the context size to 80. These 80 entities are selected from over 384K total entities available in Wikipedia.


pict

Figure 8: Resemblance between a face context and a relevant Google image search.


In an additional experiment to evaluate the quality of the generated contexts against ground truth information, we sample 21 faces from , and manually attach most appropriate person entities to each of them with the help of human experts. We then compare these ground truth contexts with our automatically generated context. Figure 9 shows the accuracy of generated context for four context generation mechanisms. Eigenface combined with both frontalization and mirroring was found to be providing the best average accuracy for different context sizes.


pict

Figure 9: Comparison of context generation methods for human annotated test data. Eigenface with Frontalization and mirror faces performs better than others.


The two evaluations used above are quantitative but we believe that the qualitative improvement of the use of frontalization and mirroring approach is much higher and difficult to measure. To depict this, we provided two sample contexts of two faces in Figure 10. Relevant entities (to the best of our knowledge) are highlighted in bold text in the figure. Eigenface with frontalization and mirror faces provides the best contexts. Later in Section 6.6 we show that use of contexts in storyboarding provides better stories. Hence, it is important to generate high quality context using frontalizaion and mirror faces so that the generated stories later are more coherent.


pict

Figure 10: Example Context comparison for different methods. Each column depicts context of two persons of a method. Last column shows the outcome of best method of figure 9.




Table 2: Stories using Boston Marathon Bombing data. (URLs of the original news articles are provided with the article IDs in blue and can be reached by clicking on the IDs.)


Story

Explanation



NY286 (2014/04/15) NY370 (2014/10/24) NY334 (2014/11/12) NY609 (2015/03/02) NY677 (2015/04/10)

This story associates Boston bombers’ involvement in the Waltham triple murder. The story includes trial phase.



NY158 (2014/10/16) NY379 (2014/10/16) NY206 (2014/10/28) NY280 (2014/10/28)

Former governor of Massachusetts testifies for Tsarnaev’s friend Robel Phillipos. Phillipos was found guilty of making false statement to authorities.



NY435 (2014/05/02) NY317 (2014/06/18) NY340 (2014/08/14) NY525 (2015/02/06)

Tsarnaev’s lawyers urge appeals to move trial location to Washington, delaying trial date. The story shows that the request was denied.



NY121 (2014/04/22) NY273 (2014/07/10) NY177 (2014/08/20) NY338 (2014/09/27)

This story highlights the trials of three friends of Tsarnaev. All three were accused of obstructing justice by lying and destroying evidence.




6.5 Examples of Generated Stories using Boston Marathon Bombing Data

New York Times returns 1028 articles with the query “Boston Marathon Bombing”. The storyboarding framework discovered a number of sub-events that provide a fine mental model of branches of the Boston Marathon Bombing tragedy and happenings afterwards. Table 2 lists a few of the stories discovered by the Storyboarding framework. The first story of Table 2 is illustrated in Figure 1 on a storyboard using term clouds and faces.

The story of Figure 1 describes a connection between Boston bombers and Waltham triple murder. The story moves forward till the penalty phase. The term cloud of the storyboard highlights a person named Todashev, a victim of triple murder, along with the bombers Tamerlan and Tsarnaev. Each related face list automatically selected from the knowledge base by the Storyboarding framework captures faces of relevant people very well. For example, the first face of the second article of the story is Todashev, whose face is found repeatedly in the consecutive articles. Rest of the faces in the story are of the Boston bombers Tamerlan and Tsarnaev, police, and rescue crews.

The third sub-events of Table 2 summarizes Tsarnaev’s lawyers’ appeal to move trial location to Washington which delayed the trial date. The fourth story describes the trials of three friends of Tsarnaev who were accused of obstructing justice, lying, and destroying evidences.


pict pict pict

Figure 11: Impact of search parameters on characteristics of stories.




Table 3: Sample stories generated by storyboarding with and without context. (URLs of the original news articles are provided with the article IDs in blue and can be reached by clicking on the IDs.)


Story with context

Story without context



[NY435 NY317 NY175 NY427 NY642 NY552 NY525]

[NY435 NY201 NY525]

The story describes the request for delay and change of trial location.

The intermediate article digresses from the focus and brings investigation in Russia into consideration.



[NY178 NY370 NY334 NY609]

[NY178 NY505 NY458 NY609]

The story connects triple murder with current cases.

The intermediate articles are about trial announcements.



[NY265 NY129 NY279]

[NY265 NY129 NY124 NY279]

The story focuses on the trial of a friend of Tsarnaev and the jury selection for the suspected friends.

One of the intermediate articles is off-topic and is about a citation of winning a video game at trial of the accused Boston bomber’s friend.



[NY158 NY379 NY206 NY280]

[NY158 NY280]

This story describes former Governor’s testimony for Tsarnaev’s friend Robel Phillipos.

The testimony of the former Governor does not show up in the story without context.





pict pict pict

Figure 12: Runtime characteristics of Storyboarding’s heuristic path finding algorithm.


6.6 Impact of the use of Context in Story Generation

While inclusion of image context in the core heuristic path finding part of the Storyboarding process imposes an additional constraint, the outcome of the use of this constraint becomes evident when we compare the stories side by side. For a fair comparison we make sure that both the methods have the same parameters (e.g., same 𝜃, branching factor b, and start-end pairs). Table 3 shows four pairs of sample stories generated with and without face contexts.

Our observation is that Storyboarding generates more coherent stories when the context is used, which is to be expected because context overlaps between consecutive articles reinforces a constraint that each story must be weaved using a certain theme. For all start-end pair of articles in Table 3, the stories with context are more coherent than the stories without context. From the table, we observe that each story without the use of context has some off-topic articles that may be relevant but the flow of the theme is broken. Sometimes we have the same story with and without the use of context but those stories are not reported here.

6.7 Storyboarding Characteristics in Terms of User-settable Parameters

Two important user-settable parameters in Storyboarding are maximum allowable distance 𝜃 and branching factor (nearest neighbor), b. Figure 11 shows the impact of 𝜃 and b on the statistical significance, average length and number of stories. To calculate the statistical significance, p-value, we randomly pick up b documents from the entire candidate pool and check if the documents picked satisfy the distance threshold 𝜃, iterating the test 5,000 times. We repeat this process for every junction-article of a discovered story. The overall p-value of story is calculated by multiplying all the p-values of every document of the story except for the last one. Figure 11(left) shows that the significance decreases (i.e. p-value increases) with higher values of 𝜃 and b. This is an expected outcome since higher 𝜃 values imply less stringent overlap of content between consecutive articles. Less stringent constraint may result in stories with loose connections between consecutive articles. Similar argument can explain the plot in Figure 11(middle). Increasing the 𝜃 value and branching factor b leads to shorter stories with loosely connected neighbors. The curve for branching factor 20 and 35 are exception which gives even shorter stories than larger branching factor until 𝜃 = 0.75. This exception is justified by the right plot where we see that there were not enough stories for those two branching factors until 𝜃 = 0.75. For other branching factors, number of stories follow a similar upward trend with increasing 𝜃.

6.8 Runtime Characteristics of Story Finding Algorithm

In this subsection we describe the runtime of the core Storyboarding heuristic algorithm for path finding with different parameters, e.g., distance threshold 𝜃, context threshold τ, and branching factor b. Figure 12 (left) shows that increase of b in general leads to higher execution time since more candidate neighbors are explored from each node. Results in this plots are shown with different 𝜃. As observed in Figure 12(left), the curves may not monotonically increase because higher b value also lowers the average length of the stories as shown in Figure 11(middle). Figure 12(middle) depicts that the runtime goes higher with the increase in maximum allowed distance, 𝜃 but remains uninfluenced by the value of maximum allowed distance between face contexts, τ. This is due to the fact that τ only determines the order of exploring the neighbor during the search process. The rightmost plot also shows no impact of τ on the average runtime for varying b. The runtimes are averaged over all values of 𝜃.

7 Conclusions

Our storyboarding framework complements conventional summarization techniques by weaving through textual and imagery data and presenting sub-events as chains of documents. Our directions for future work are two fold. One of the first directions is to incorporate multiple news archives in the knowledge base. This will provide the ability to connect one major event with another. Secondly, we will systematically study the trade-off between the standalone use of text content and image context as an additional information to text content in the story-building process. This study will allow an analyst explore different alternative stories smoothly by varying textual and imagery coherence requirements.

8 References

[1]     T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. PAMI, 28(12):2037–2041, 2006.

[2]     Alias-i. LingPipe 4.1.0. Accessed: May 8, 2015, http://alias-i.com/lingpipe/.

[3]     R. Alonso and H. Li. Model-guided information discovery for intelligence analysis. In CIKM ’05, pages 269–270, 2005.

[4]     Apache Software Foundation. OpenNLP. Accessed: May 8, 2015, http://opennlp.apache.org.

[5]     E. Bier, S. Card, and J. Bodnar. Principles and tools for collaborative entity-based intelligence analysis. TVCG, 16(2):178–191, 2010.

[6]     S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising patterns using temporal description length. In VLDB ’98, volume 98, pages 606–617, 1998.

[7]     G. Chin, O. A. Kuchar, P. D. Whitney, M. Powers, and K. E. Johnson. Graph-based comparisons of scenarios in intelligence analysis. In SMC ’04, volume 4, 2004.

[8]     F. Das-Neves, E. A. Fox, and X. Yu. Connecting topics in document collections with stepping stones and pathways. In CIKM ’05, pages 91–98, 2005.

[9]     C. Faloutsos, K. S. McCurley, and A. Tomkins. Fast discovery of connection subgraphs. In KDD ’04, pages 118–127, 2004.

[10]     L. Fang, A. D. Sarma, C. Yu, and P. Bohannon. Rex: explaining relationships between entity pairs. Proc. VLDB Endow., 5(3):241–252, 2011.

[11]     J. Gersh, B. Lewis, J. Montemayor, C. Piatko, and R. Turner. Supporting insight-based information exploration in intelligence analysis. Commun. ACM, 49, 2006.

[12]     J. Gung and J. Kalita. Summarization of historical articles using temporal event clustering. In HLT-NAACL ’12, pages 631–635, 2012.

[13]     K. Heath, N. Gelfand, M. Ovsjanikov, M. Aanjaneya, and L. J. Guibas. Image webs: Computing and exploiting connectivity in image collections. In CVPR ’10, pages 3432–3439, 2010.

[14]     M. S. Hossain, P. Butler, A. P. Boedihardjo, and N. Ramakrishnan. Storytelling in entity networks to support intelligence analysts. In KDD ’12, pages 1375–1383, 2012.

[15]     M. S. Hossain, J. Gresock, Y. Edmonds, R. Helm, M. Potts, and N. Ramakrishnan. Connecting the dots between pubmed abstracts. PloS one, 7(1):e29509, 2012.

[16]     G. Huang, M. Mattar, H. Lee, and E. G. Learned-miller. Learning to align from scratch. In NIPS ’12, pages 764–772. 2012.

[17]     H. Kang, C. Plaisant, B. Lee, and B. B. Bederson. Netlens: Iterative exploration of content-actor network data. Info. Vis., 6(1):18–31, 2007.

[18]     H. Khurana, J. Basney, M. Bakht, M. Freemon, V. Welch, and R. Butler. Palantir: a framework for collaborative incident response and investigation. In IDtrust, 2009.

[19]     D. Kumar, N. Ramakrishnan, R. F. Helm, and M. Potts. Algorithms for storytelling. In KDD ’06, 2006.

[20]     S. Laxman, P. Sastry, and K. Unnikrishnan. Discovering frequent episodes and learning hidden markov models: A formal connection. KDE, 17(11):1505–1517, 2005.

[21]     H.-J. Lee, W.-S. Lee, and J.-H. Chung. Face recognition using fisherface algorithm and elastic graph matching. In ICIP ’01, volume 1, pages 998–1001, 2001.

[22]     P. Li, J. Ma, and S. Gao. Learning to summarize web image and text mutually. In ICMR ’12, page 28, 2012.

[23]     P. Li, Y. Wang, W. Gao, and J. Jiang. Generating aspect-oriented multi-document summarization with event-aspect model. In EMNLP ’11, pages 1137–1146, 2011.

[24]     A. S. Maiya, J. P. Thompson, F. Loaiza-Lemos, and R. M. Rolfe. Exploratory analysis of highly heterogeneous document collections. In KDD ’13, pages 1375–1383, 2013.

[25]     PNNL. Pacific northwest national laboratory, inspire visual document analysis. Last accessed: May 26, 2011, http://in-spire.pnl.gov.

[26]     G.-J. Qi, C. Aggarwal, and T. Huang. Towards semantic knowledge propagation from text corpus to web images. In WWW ’11, pages 297–306, 2011.

[27]     B. Rafkind, M. Lee, S.-F. Chang, and H. Yu. Exploring text and image features to classify images in bioscience literature. In BioNLP ’06, pages 73–80, 2006.

[28]     D. Shahaf and C. Guestrin. Connecting the dots between news articles. In KDD ’10, pages 623–632, 2010.

[29]     Stanford Natural Language Processing Group. Stanford NER. Accessed: May 8, 2015, http://nlp.stanford.edu/software/CRF-NER.shtml.

[30]     Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In CVPR ’13, pages 3476–3483, 2013.

[31]     M. Turk and A. Pentland. Eigenfaces for recognition. Cognitive Neuroscience, 3(1):71–86, 1991.

[32]     University of Maryland. Global Terrorism Database. Accessed: May 8, 2015, http://www.start.umd.edu/gtd/.

[33]     M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression and graph models. In CVPR ’10, pages 2729–2736, 2010.

[34]     P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR ’01, volume 1, pages I–511, 2001.

[35]     M. Wu. Investigations on event-based summarization. In COLING ’06, pages 37–42, 2006.

[36]     J. Xu and T.-C. Lu. Seeing the big picture from microblogs: Harnessing social signals for visual event summarization. In IUI ’15, pages 62–66, 2015.

[37]     Q. Zhao, P. Mitra, and B. Chen. Temporal and information flow based event detection from social text streams. In AAAI ’07, volume 7, pages 1501–1506, 2007.