Goal Event Detection in Soccer Videos via Collaborative Multimodal Analysis

Goal Event Detection in Soccer Videos via Collaborative Multimodal Analysis

1* and Mandava Rajeswari2

1Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, 43400 Serdang, Selangor, Malaysia2School of Computer Sciences, Universiti Sains Malaysia, 11800 Minden, Penang, Malaysia

ABSTRACT

Keywords:indexing, webcasting-text

Article history:

E-mail addresses:

*Corresponding Author

repositories has spurred interest in automatic indexing and retrieval techniques, especially those that cater for content-based semantics

restricting the domain being addressed is to some extent, imperative in order to bridge the semantic gap between low-level features and

INTRODUCTION

Technological advances have greatly enhanced broadcast, capture, transfer and storage of digital video (Tjondronegoro et al

sports has been used to extract important semantic concepts such as tennis that serves and rallies (Huang et al et al

et al

posterity logging (Assfalg et al

running time and sparseness of event occurrences further complicate matters, where traditional

RELATED WORKS

A great body of literature has been dedicated to events or highlights detection in soccer, as well

et al

Jinjun et al

et al

is carried out using supervised learning algorithms, which discovers the audiovisual patterns

patterns are not detected due to feature patterns being less prominent during event occurrences, et al

et al

1 et al

algorithms face the challenge of recognizing event patterns from the majority of non-event

As for supervised learning algorithms, their robustness may be questionable since for some

CONTRIBUTIONS

directly extracted from the video, an external textual resource was utilized to initiate event

considerations were solely made within each video itself, and without relying on any pre-

as the issue of the huge and asymmetric search space, are solved by utilizing the minute-by-

detailed and reliable annotations of a match’s progression, two crucial cues (namely, the

1

as by Changsheng et al et al

this study analyzed the visual and aural information only from the particular video under

All the audiovisual considerations and assumptions are uncomplicated and therefore able to

FRAMEWORK FOR GOAL EVENT DETECTION

Video Pre-processing

Shot Boundary Detection

et al

V into m-shots, represented as 1{ , , }i i mV S S S

far-views or close-up views

1. Dominant Hue Detection

peakidx

this range are detected, an immediate close-up view label can be assigned since it highly

2. peak peakidx idx was

determined with the optimal value for saturation

valueregion, morphological image processing and connected components analysis are applied (Halin et al

3. indicate close-up views, whereas smaller objects indicate far views

et al et al et alratio alone, which as suggested in Halin et al

et alclose-up view

as either a close-up or far view based on the majority voting of all the frame labels within

Textual Cues Utilization

source, namely, the event name and its minute time-stamp

Goal Event Keyword Matching and Time Stamp Extraction

{ !, , , , , }G goal goal by scored scores own goal convert (1)

g, that has itime-stamp of each of the i

These can be written as a set g giT t , where i

Then, for each i, the goal event search is initiated within the one-minute segment of each git

Text-Video Synchronization and Event Search Localization

The time-stamp git

mapping to the corresponding video frames can be erroneous due to the misalignment with the git and its corresponding video

reft and reff

The values reft and reff can then be used to localize the event search to the one-minute git being the minute time-stamp of the goal event, the beginning

( ,g

i beginf ) and ending ( ,g

i endf

,g ref ref g

i begin if f fr t t

, ,g g

i end i beginf f fr

where fr is the video frame rate and Note that for ,

gi beginf , the time-stamp g

it (after being converted to seconds) is subtracted by 1,g g

i it tfor ,

gi endf end, fr

one-minute after ,g

i beginf

, ,,g g gi i begin i endf f

Candidate Shortlist Generation

goal segments within gi

broadcasters (including the footage used in this paper), three generic visual-related properties

These properties were exploited to decompose the one-minute segment into a shortlist of

The camera transitions from a far-view to a close-up view

2. Close-up views during goals normally last at least 6-seconds;

has already been localized to the one-minute eventful segment, detecting other events is very

the n-number of candidate segments is generated from gi

1, ,g gi ikC c for k n

where g gi iC is the set containing the shortlisted candidates, and g

ikc s the kcandidate segment within g

iC

Candidate Ranking

At this stage, we have obtained the candidate segments gikc , where one of them is the actual

needed from each gikc

pitch or the fundamental frequency of an audio signal f is reliable to detect excited human f

is called shrp.m

on the other hand, was chosen as it managed to accurately capture the average measurement

ff

g

ikf

The rule being applied here is that the candidate with the goal event will cause commentator f0 values across audio frames, leading

to high gikf c*, with the maximum

g

ikf

EXPERIMENTAL RESULTS AND DISCUSSION

Note that for the following sub-sections of and Candidate Shortlist Generation, the evaluation criteria used are precision and recall

Sub-section Candidate Shortlist Generationto cater for each of these contexts, which will be further explained in detail within each of the

subsets from different matches were used to demonstrate the robustness of the algorithm across Precision and recall

true positives, false positives and false negatives are explained, supposing that the positive class being predicted is a far-view

True positive far-view, when the actual class is indeed a far-view;

False positive far-view, when the actual class is a close-up view;

False negative close-up view, when the actual class is a far-view

## #

true positivesPrecision true positives false positives (5)

1

Teams1

56789

11

## #

true positivesRecall true positives false negatives (6)

The results are encouraging where very high recalls

Shot Type # of shots Precision

Close-up view

Average 98.27% 96.27%

Average 91.65% 96.49%

Candidate Shortlist Generation gikc

precision and recall are

Relevant refers to the number of candidate segments generated that actually Retrieved refers to the total number of candidates generated based on the

# ##

relevant retrievedPrecision

retrieved (7)

# ##

relevant retrievedRecall

retr ed iev (8)

can be observed that the Average Number of Candidates per Shortlist

Sub-section Candidate Ranking), the actual segment can still be retrieved without the need

recallcases since it is mandatory that an actual goal event segment be present within each of the

Candidate Ranking

Average Number of Candidates per ShortlistPrecision

Candidate Ranking

n represents the number of candidate segments g

ikc i-thgikf for each k-th k = 1 n) is recorded in the sub-columns of column 5, where

the numeric boldface values indicate the maximum g

ikf of that time-stamp, which is the top-

X

shown in 1

numbergT N

( g g

ik inf f )

11gt 285.55gt = 56 290.08gt = 86 271.64

2 11gt 301.55

3 5

1gt 264.59gt 1 275.99gt 273.13gt = 77 267.05

5gt = 91 293.08

4 11gt 305.79

5 1gt 284.33gt = 95 281.00

6 1gt 299.02gt 277.35

7 1gt 290.87gt 286.89

8 11gt 295.47

9

1gt 279.46,gt 1 283.63gt 274.09gt 278.12

10 5

1gt = 18 298.86gt 301.02gt 279.22gt = 51 300.38

5gt = 68 298.66

11 1gt = 56 281.59gt 2 295.39

COMPARISON

et al

replay shot, and the replay must directly follow the close-up shot;

et al

et al

Note that the approach proposed in this paper only considers two shot labels; the far and close-up replay

were able to obtain

12 5

1gt 281.84gt 272.70gt 273.82gt = 56 306.16

5gt 280.70

131gt = 57 300.00gt 270.98gt 293.71 X

and depend on viewers’ preferences (Changsheng et al

5

matches shown in 1

number TruthPrecision

5

CT

Proposed

10

CT

5

55

7Proposed

5

12

CT

5

5 85

Proposed5

CONCLUSION

distinctly identify event occurrences and to localize the video search space to only relevant and

REFERENCESOnline, simultaneous shot boundary detection and key frame extraction for

sports videos using rank tracing th

.

Computer Vision and Image Understanding, 92,

Sport news images

IEEE Transactions on Multimedia, 10,


Journal of Information Science and Engineering, 24,


Unsupervised soccer video abstraction based on pitch, dominant color and camera motion analysis th ,

IEEE Transactions on Image Processing, 12,

Soccer video summarization using enhanced logo detection th

sports video.

Expert Systems with Applications, 36,

Sports highlight detection from keyword sequences using HMM

IEEE Transactions on Circuits and Systems for Video Technology, 14,

Hierarchical temporal association mining for video event detection in video databases. rd

IEEE Signal Processing Magazine, 23,

Audio-visual football video analysis, from structure detection to attention analysis.


A decision tree-based multimodal data mining framework for soccer goal detection.

The authoring metaphor to machine understanding of multimedia

Time interval maximum entropy based event indexing in soccer video.

Live Match

Content-based video indexing for sports applications using multi-modal approach

ACM Transactions on Multimedia Computing, Communications, and Applications, 4,

Uefa champions League, Match Season 2011.

Algorithms and system for segmentation and structure analysis in soccer video.

Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio.

IEEE Signal Processing Magazine, IEEE, 17

Goal event detection in broadcast soccer videos by combining heuristic rules with unsupervised fuzzy c-means algorithm. th


Goal Event Detection in Soccer Videos via Collaborative Multimodal Analysis

Documents

Transcript of Goal Event Detection in Soccer Videos via Collaborative Multimodal Analysis