Goal Event Detection in Soccer Videos via Collaborative Multimodal Analysis
Transcript of Goal Event Detection in Soccer Videos via Collaborative Multimodal Analysis
Goal Event Detection in Soccer Videos via Collaborative Multimodal Analysis
1* and Mandava Rajeswari2
1Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, 43400 Serdang, Selangor, Malaysia2School of Computer Sciences, Universiti Sains Malaysia, 11800 Minden, Penang, Malaysia
ABSTRACT
Keywords:indexing, webcasting-text
Article history:
E-mail addresses:
*Corresponding Author
repositories has spurred interest in automatic indexing and retrieval techniques, especially those that cater for content-based semantics
restricting the domain being addressed is to some extent, imperative in order to bridge the semantic gap between low-level features and
INTRODUCTION
Technological advances have greatly enhanced broadcast, capture, transfer and storage of digital video (Tjondronegoro et al
sports has been used to extract important semantic concepts such as tennis that serves and rallies (Huang et al et al
et al
posterity logging (Assfalg et al
running time and sparseness of event occurrences further complicate matters, where traditional
RELATED WORKS
A great body of literature has been dedicated to events or highlights detection in soccer, as well
et al
Jinjun et al
et al
is carried out using supervised learning algorithms, which discovers the audiovisual patterns
patterns are not detected due to feature patterns being less prominent during event occurrences, et al
et al
1 et al
algorithms face the challenge of recognizing event patterns from the majority of non-event
As for supervised learning algorithms, their robustness may be questionable since for some
CONTRIBUTIONS
directly extracted from the video, an external textual resource was utilized to initiate event
considerations were solely made within each video itself, and without relying on any pre-
as the issue of the huge and asymmetric search space, are solved by utilizing the minute-by-
detailed and reliable annotations of a match’s progression, two crucial cues (namely, the
1
as by Changsheng et al et al
this study analyzed the visual and aural information only from the particular video under
All the audiovisual considerations and assumptions are uncomplicated and therefore able to
FRAMEWORK FOR GOAL EVENT DETECTION
Video Pre-processing
Shot Boundary Detection
et al
V into m-shots, represented as 1{ , , }i i mV S S S
far-views or close-up views
1. Dominant Hue Detection
peakidx
this range are detected, an immediate close-up view label can be assigned since it highly
2. peak peakidx idx was
determined with the optimal value for saturation
valueregion, morphological image processing and connected components analysis are applied (Halin et al
3. indicate close-up views, whereas smaller objects indicate far views
et al et al et alratio alone, which as suggested in Halin et al
et alclose-up view
as either a close-up or far view based on the majority voting of all the frame labels within
Textual Cues Utilization
source, namely, the event name and its minute time-stamp
Goal Event Keyword Matching and Time Stamp Extraction
{ !, , , , , }G goal goal by scored scores own goal convert (1)
g, that has itime-stamp of each of the i
These can be written as a set g giT t , where i
Then, for each i, the goal event search is initiated within the one-minute segment of each git
Text-Video Synchronization and Event Search Localization
The time-stamp git
mapping to the corresponding video frames can be erroneous due to the misalignment with the git and its corresponding video
reft and reff
The values reft and reff can then be used to localize the event search to the one-minute git being the minute time-stamp of the goal event, the beginning
( ,g
i beginf ) and ending ( ,g
i endf
,g ref ref g
i begin if f fr t t
, ,g g
i end i beginf f fr
where fr is the video frame rate and Note that for ,
gi beginf , the time-stamp g
it (after being converted to seconds) is subtracted by 1,g g
i it tfor ,
gi endf end, fr
one-minute after ,g
i beginf
, ,,g g gi i begin i endf f
Candidate Shortlist Generation
goal segments within gi
broadcasters (including the footage used in this paper), three generic visual-related properties
These properties were exploited to decompose the one-minute segment into a shortlist of
The camera transitions from a far-view to a close-up view
2. Close-up views during goals normally last at least 6-seconds;
has already been localized to the one-minute eventful segment, detecting other events is very
the n-number of candidate segments is generated from gi
1, ,g gi ikC c for k n
where g gi iC is the set containing the shortlisted candidates, and g
ikc s the kcandidate segment within g
iC
Candidate Ranking
At this stage, we have obtained the candidate segments gikc , where one of them is the actual
needed from each gikc
pitch or the fundamental frequency of an audio signal f is reliable to detect excited human f
is called shrp.m
on the other hand, was chosen as it managed to accurately capture the average measurement
ff
g
ikf
The rule being applied here is that the candidate with the goal event will cause commentator f0 values across audio frames, leading
to high gikf c*, with the maximum
g
ikf
EXPERIMENTAL RESULTS AND DISCUSSION
Note that for the following sub-sections of and Candidate Shortlist Generation, the evaluation criteria used are precision and recall
Sub-section Candidate Shortlist Generationto cater for each of these contexts, which will be further explained in detail within each of the
subsets from different matches were used to demonstrate the robustness of the algorithm across Precision and recall
true positives, false positives and false negatives are explained, supposing that the positive class being predicted is a far-view
True positive far-view, when the actual class is indeed a far-view;
False positive far-view, when the actual class is a close-up view;
False negative close-up view, when the actual class is a far-view
## #
true positivesPrecision true positives false positives (5)
1
Teams1
56789
11
## #
true positivesRecall true positives false negatives (6)
The results are encouraging where very high recalls
Shot Type # of shots Precision
Close-up view
Average 98.27% 96.27%
Average 91.65% 96.49%
Candidate Shortlist Generation gikc
precision and recall are
Relevant refers to the number of candidate segments generated that actually Retrieved refers to the total number of candidates generated based on the
# ##
relevant retrievedPrecision
retrieved (7)
# ##
relevant retrievedRecall
retr ed iev (8)
can be observed that the Average Number of Candidates per Shortlist
Sub-section Candidate Ranking), the actual segment can still be retrieved without the need
recallcases since it is mandatory that an actual goal event segment be present within each of the
Candidate Ranking
Average Number of Candidates per ShortlistPrecision
Candidate Ranking
n represents the number of candidate segments g
ikc i-thgikf for each k-th k = 1 n) is recorded in the sub-columns of column 5, where
the numeric boldface values indicate the maximum g
ikf of that time-stamp, which is the top-
X
shown in 1
numbergT N
( g g
ik inf f )
11gt 285.55gt = 56 290.08gt = 86 271.64
2 11gt 301.55
3 5
1gt 264.59gt 1 275.99gt 273.13gt = 77 267.05
5gt = 91 293.08
4 11gt 305.79
5 1gt 284.33gt = 95 281.00
6 1gt 299.02gt 277.35
7 1gt 290.87gt 286.89
8 11gt 295.47
9
1gt 279.46,gt 1 283.63gt 274.09gt 278.12
10 5
1gt = 18 298.86gt 301.02gt 279.22gt = 51 300.38
5gt = 68 298.66
11 1gt = 56 281.59gt 2 295.39
COMPARISON
et al
replay shot, and the replay must directly follow the close-up shot;
et al
et al
Note that the approach proposed in this paper only considers two shot labels; the far and close-up replay
were able to obtain
12 5
1gt 281.84gt 272.70gt 273.82gt = 56 306.16
5gt 280.70
131gt = 57 300.00gt 270.98gt 293.71 X
and depend on viewers’ preferences (Changsheng et al
5
matches shown in 1
number TruthPrecision
5
CT
Proposed
10
CT
5
55
7Proposed
5
12
CT
5
5 85
Proposed5
CONCLUSION
distinctly identify event occurrences and to localize the video search space to only relevant and
REFERENCESOnline, simultaneous shot boundary detection and key frame extraction for
sports videos using rank tracing th
.
Computer Vision and Image Understanding, 92,
Sport news images
IEEE Transactions on Multimedia, 10,
IEEE Transactions on Multimedia, 10,
Journal of Information Science and Engineering, 24,
IEEE Transactions on Multimedia, 8,
Unsupervised soccer video abstraction based on pitch, dominant color and camera motion analysis th ,
IEEE Transactions on Image Processing, 12,
Soccer video summarization using enhanced logo detection th
sports video.
Expert Systems with Applications, 36,
Sports highlight detection from keyword sequences using HMM
IEEE Transactions on Circuits and Systems for Video Technology, 14,
Hierarchical temporal association mining for video event detection in video databases. rd
IEEE Signal Processing Magazine, 23,
Audio-visual football video analysis, from structure detection to attention analysis.
IEEE Transactions on Circuits and Systems for Video Technology, 15,
A decision tree-based multimodal data mining framework for soccer goal detection.
The authoring metaphor to machine understanding of multimedia
Time interval maximum entropy based event indexing in soccer video.
Live Match
Content-based video indexing for sports applications using multi-modal approach
ACM Transactions on Multimedia Computing, Communications, and Applications, 4,
Uefa champions League, Match Season 2011.
Algorithms and system for segmentation and structure analysis in soccer video.
Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio.
IEEE Signal Processing Magazine, IEEE, 17
Goal event detection in broadcast soccer videos by combining heuristic rules with unsupervised fuzzy c-means algorithm. th
IEEE Transactions on Circuits and Systems for Video Technology, 17,