Perception of differences in naturalistic dynamic scenes, and a V1-based model

13
Perception of differences in naturalistic dynamic scenes, and a V1-based model Michelle P. S. To # $ Department of Psychology, Lancaster University, Lancaster, UK Iain D. Gilchrist # $ School of Experimental Psychology, University of Bristol, Bristol, UK David J. Tolhurst # $ Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge, UK We investigate whether a computational model of V1 can predict how observers rate perceptual differences between paired movie clips of natural scenes. Observers viewed 198 pairs of movies clips, rating how different the two clips appeared to them on a magnitude scale. Sixty-six of the movie pairs were naturalistic and those remaining were low-pass or high-pass spatially filtered versions of those originals. We examined three ways of comparing a movie pair. The Spatial Model compared corresponding frames between each movie pairwise, combining those differences using Minkowski summation. The Temporal Model compared successive frames within each movie, summed those differences for each movie, and then compared the overall differences between the paired movies. The Ordered-Temporal Model combined elements from both models, and yielded the single strongest predictions of observers’ ratings. We modeled naturalistic sustained and transient impulse functions and compared frames directly with no temporal filtering. Overall, modeling naturalistic temporal filtering improved the models’ performance; in particular, the predictions of the ratings for low-pass spatially filtered movies were much improved by employing a transient impulse function. The correlations between model predictions and observers’ ratings rose from 0.507 without temporal filtering to 0.759 (p ¼ 0.01%) when realistic impulses were included. The sustained impulse function and the Spatial Model carried more weight in ratings for normal and high-pass movies, whereas the transient impulse function with the Ordered-Temporal Model was most important for spatially low-pass movies. This is consistent with models in which high spatial frequency channels with sustained responses primarily code for spatial details in movies, while low spatial frequency channels with transient responses code for dynamic events. Introduction Our perception of the world is a product of the patterns of neural activation across the sensory system. In vision, there has been extensive research examining the pattern of activity of V1 neurons to different types of visual stimuli, from simplistic gratings to more complex natural images. Several computational models have been proposed to simulate how the visual system responds to static natural-image stimuli. Although the coding of motion and temporal change by single neurons has been well studied, it is only relatively recently that the perception of natu- ralistic movies has been the subject of psychophysical investigation (e.g., Hasson, Nir, Levy, Fuhrmann, & Malach, 2004; Hirose, Kennedy, & Tatler, 2010; Itti, Dhayale, & Pighin, 2003; Smith & Mital, 2013; Troscianko, Meese, & Hinde, 2012; Wang & Li, 2007; Watson, 1998). Previously, we have shown that a V1-based Visual Difference Predictor (VDP) model, derived from one used to explain the contrast detection thresholds of pairs of sine-wave gratings (Watson & Solomon, 1997), can generate adequate predictions of how observers perceive differences between pairs of static natural scenes (To, Baddeley, Troscianko, & Tolhurst, 2011; To, Lovell, Troscianko, & Tolhurst, 2010; Tolhurst et al., 2010). We presented observers with pairs of photographically derived still images and asked them to rate the perceived difference between the two images. We then built different models that compared the paired images in a head-to-head fashion, and found that they were moderately successful in predicting observers’ ratings for a large variety of forms of image Citation: To, M. P. S., Gilchrist, I. D., & Tolhurst, D. J. (2015). Perception ofdifferences in naturalistic dynamic scenes, and a V1- based model. Journal of Vision, 15(1):19, 1–13. http://www.journalofvision.org/content/15/1/19, doi: 10.1167/15.1.19. Journal of Vision (2015) 15(1):19, 1–13 1 http://www.journalofvision.org/content/15/1/19 doi: 10.1167/15.1.19 ISSN 1534-7362 Ó 2015 ARVO Received August 18, 2014; published January 16, 2015

Transcript of Perception of differences in naturalistic dynamic scenes, and a V1-based model

Perception of differences in naturalistic dynamic scenes and aV1-based model

Michelle P S To $Department of Psychology Lancaster University

Lancaster UK

Iain D Gilchrist $School of Experimental Psychology University of Bristol

Bristol UK

David J Tolhurst $Department of Physiology Development and

Neuroscience University of Cambridge Cambridge UK

We investigate whether a computational model of V1can predict how observers rate perceptual differencesbetween paired movie clips of natural scenes Observersviewed 198 pairs of movies clips rating how differentthe two clips appeared to them on a magnitude scaleSixty-six of the movie pairs were naturalistic and thoseremaining were low-pass or high-pass spatially filteredversions of those originals We examined three ways ofcomparing a movie pair The Spatial Model comparedcorresponding frames between each movie pairwisecombining those differences using Minkowskisummation The Temporal Model compared successiveframes within each movie summed those differences foreach movie and then compared the overall differencesbetween the paired movies The Ordered-TemporalModel combined elements from both models andyielded the single strongest predictions of observersrsquoratings We modeled naturalistic sustained and transientimpulse functions and compared frames directly with notemporal filtering Overall modeling naturalistictemporal filtering improved the modelsrsquo performance inparticular the predictions of the ratings for low-passspatially filtered movies were much improved byemploying a transient impulse function The correlationsbetween model predictions and observersrsquo ratings rosefrom 0507 without temporal filtering to 0759 (pfrac14001) when realistic impulses were included Thesustained impulse function and the Spatial Model carriedmore weight in ratings for normal and high-pass movieswhereas the transient impulse function with theOrdered-Temporal Model was most important forspatially low-pass movies This is consistent with modelsin which high spatial frequency channels with sustainedresponses primarily code for spatial details in movieswhile low spatial frequency channels with transientresponses code for dynamic events

Introduction

Our perception of the world is a product of thepatterns of neural activation across the sensorysystem In vision there has been extensive researchexamining the pattern of activity of V1 neurons todifferent types of visual stimuli from simplisticgratings to more complex natural images Severalcomputational models have been proposed to simulatehow the visual system responds to static natural-imagestimuli Although the coding of motion and temporalchange by single neurons has been well studied it isonly relatively recently that the perception of natu-ralistic movies has been the subject of psychophysicalinvestigation (eg Hasson Nir Levy Fuhrmann ampMalach 2004 Hirose Kennedy amp Tatler 2010 IttiDhayale amp Pighin 2003 Smith amp Mital 2013Troscianko Meese amp Hinde 2012 Wang amp Li 2007Watson 1998)

Previously we have shown that a V1-based VisualDifference Predictor (VDP) model derived from oneused to explain the contrast detection thresholds ofpairs of sine-wave gratings (Watson amp Solomon 1997)can generate adequate predictions of how observersperceive differences between pairs of static naturalscenes (To Baddeley Troscianko amp Tolhurst 2011To Lovell Troscianko amp Tolhurst 2010 Tolhurst etal 2010) We presented observers with pairs ofphotographically derived still images and asked them torate the perceived difference between the two imagesWe then built different models that compared thepaired images in a head-to-head fashion and foundthat they were moderately successful in predictingobserversrsquo ratings for a large variety of forms of image

Citation To M P S Gilchrist I D amp Tolhurst D J (2015) Perception of differences in naturalistic dynamic scenes and a V1-based model Journal of Vision 15(1)19 1ndash13 httpwwwjournalofvisionorgcontent15119 doi 10116715119

Journal of Vision (2015) 15(1)19 1ndash13 1httpwwwjournalofvisionorgcontent15119

doi 10 1167 15 1 19 ISSN 1534-7362 2015 ARVOReceived August 18 2014 published January 16 2015

difference The models were most successful when theyincluded physiologically realistic mechanisms such asnonspecific suppression or contrast normalization(Heeger 1992) surround suppression (Blakemore ampTobin 1972) and Gabor receptive fields that areelongated and whose bandwidth changes with bestspatial frequency (Tolhurst amp Thompson 1981) Amodel based on complex cells was more successful thanone based on simple cells Here we ask whether suchmodels can be extended to the perception of natural-image movie stimuli

These visual discrimination models can generatedecent predictions of how well the visual system candiscriminate between static images However the visualenvironment is obviously in constant flux and naturalscenes are rarely still It is therefore important toinvestigate whether such modeling can be extended todynamic movie clip pairs Modeling the task ofdiscriminating between two naturalistic movie clipsneeds extra consideration over the task of discriminat-ing two static images

First since movie clips are composed of multipleframes observers must first process all the frameswithin each movie separately and must then comparesome percept or memory of the frames between the twomovies While comparing static natural scenes requiresonly one head-to-head computational comparison inthe case of comparing movies that are composed ofnumerous frames multiple comparisons need to bemade and many comparison rules are possible It ispossible that observers might compare the framesbetween movies frame by frame this is suggested byWatson (1998) and Watson Hu and McGowan (2001)in their biologically inspired model for evaluating thedegree of perceived distortion caused by lossy com-pression of video streams Observers might alsocompare the frames within each movie clip separatelybefore comparing the two clips perhaps by codingobject movement within the movies as has beensuggested for enhanced measures of video distortion(eg Seshadrinathan amp Bovik 2010 Vu amp Chandler2014 Wang amp Li 2007) It is important to identify thebest rule to describe how movie frames across moviesare compared

Second when constructing a V1-based computa-tional model of visual discrimination of dynamicimages one must also obviously consider the temporalsensitivity of the visual system One hypothesis has itthat in the early stages of the visual system there aretwo parallel pathways The transient or magnocellular(M-cell) pathway acts primarily on low spatialfrequencies while the sustained or parvocellular (P-cell) pathway acts primarily on middle and highspatial frequencies (Derrington amp Lennie 1984Gouras 1968 Hess amp Snowden 1992 HoriguchiNakadomari Misaki amp Wandell 2009 Keesey 1972

Kulikowski amp Tolhurst 1973 Tolhurst 1973 1975)The two pathways have been proposed to conveydifferent perceptual information about temporal andspatial structure (Keesey 1972 Kulikowski amp Tol-hurst 1973 Tolhurst 1973)

In the experiment described here we presentedobservers with pairs of stimuli derived from achromaticmovies of natural scenes and have asked them to ratehow different these clips appear to them We haveexamined different derivations of our static V1-basedVDP model for static images to accommodate observ-ersrsquo magnitude estimation ratings of dynamic naturalscenes We examine the different contributions of thetransient low spatial-frequency channels compared tothe sustained high spatial-frequency channels We alsoexamine different possible strategies for breaking thecomparison of two movies into multiple head-to-headcomparisons between pairs of frames Some of theresults have been reported briefly (To Gilchrist ampTolhurst 2014 To et al 2012)

Methods

Observers viewed monochrome movie clips present-ed on a 19-in Sony CRT display at 57 cm The movieframes were 240 pixels square subtending 128 square inthe center of the screen Apart from that central squarethe rest of the 800 middot 600 pixels of the display were mid-gray (88 cdm2) In the times between movie clips thewhole screen was maintained at that same grayPresentation was through a ViSaGe system (CambridgeResearch Systems Rochester UK) so that nonlinear-ities in the displayrsquos luminance output were correctedwithout loss of bit resolution The display frame ratewas 120 fps (frames per s)

In a single trial the observers viewed two slightlydifferent movie clips successively and they were askedto give a numerical magnitude rating of how large theyperceived the differences between the clips to be Eachclip usually lasted 125 s and there was a gap of 80 msbetween the two clips in a trial

Experimental procedure

There were 198 movie clip pairs (see below) and theywere presented once each to each of seven observers Inaddition a standard movie pair described below waspresented after every ten test-trials the perceivedmagnitude difference of this pair was deemed to belsquolsquo20rsquorsquo Observers were instructed to rate how differentpairs of movies appeared to them They were asked tobase their ratings on their overall general visualexperience and were not given any specific instructions

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 2

on what aspects of the videos (eg spatial temporalandor spatio-temporal) they should rely on They weretold that if they perceived the difference between a testmovie pair to be less equal or greater than thestandard pair their rating should obviously be lessequal or greater than 20 respectively They were to usea ratio scale so that if for instance a given movie pairseemed to have a difference twice as great as that of thestandard pair they would assign a value twice as largeto that pair (ie 40) No upper limit was set so thatobservers could rate large differences as highly as theysaw fit Observers were also told that sometimes moviepairs might be identical in which case they should setthe rating to zero (in fact all the movie pairs did differto some extent)

Before an experiment each observer underwent atraining session when they were asked to rate 50 pairsof movie clips containing various types of differencesthat could be presented to them later on in theexperiment All movie clips (apart from the standardmovie pair) used in demonstration or training phaseswere made from three original movies that weredifferent from the seven used in the testing phaseproper

The testing phase was divided into three blocks of 66image pairs Each block lasted about 15 min andstarted with the presentation of the standard moviepair which was subsequently presented after every 10trials to remind the observers of the standard differenceof 20 The image presentation sequence was random-ized differently for each observer

Data collation and analysis

For each observer their 198 magnitude ratings werenormalized by dividing by that observerrsquos averagerating Then for each stimulus pair the normalizedratings of the seven observers were averaged togetherThese 198 averages of normalized ratings were thenmultiplied by a single scalar value to bring their grandaverage up to the same value as the grand average of allthe ratings given by all the observers in the wholeexperiment Our experience (To Lovell Troscianko ampTolhurst 2010) is that generally averages of normal-ized ratings have a lower coefficient of variation thando averages of raw ratings

Observers

The experiment tested seven observers recruitedfrom the student or postdoctoral researcher popula-tions at the University of Cambridge UK They allgave informed consent and had normal vision afterprescription correction as verified using Landolt Cacuity chart and the Ishihara color test (10th Edition)

Construction of stimuli

Original movies

The monochrome test stimuli were constructed fromsix original color video sequences each lasting 10 sThree of the stimulus movies were of lsquolsquoenvironmentalmovementrsquorsquo eg grass blowing or water rippling orsplashing One was of a duck swimming on a pond Theother two were of peoplersquos hands one person waspeeling a potato and the other was demonstrating signlanguage for the deaf A seventh video sequence wasused to make a standard video pair

The video sequences were originally taken at 200 fpswith a Pulnix TMC-6740 CCD camera (JAI-Pulnix AS Copenhagen Denmark Lovell Gilchrist Tolhurstamp Troscianko 2009) Each frame was saved as aseparate uncompressed image 10 bits in each of threecolor planes on a Bayer matrix of 640 middot 480 pixelsEach frame image was converted to a floating-point 640middot 480 pixel RGB image file with values corrected forany nonlinearities in the luminance versus outputrelation Each image was then reduced to 320 middot 240pixels by taking alternate rows and columns (the nativepixelation of R and B in the Bayer) and was convertedto monochrome by averaging together the three colorplanes Finally the images were cropped to 240 middot 240pixels Some very bright highlights were removed byclipping the brightest image pixels to a lower value

Stimulus movie clips

From each of the 10-s 200-fps movies we took twononoverlapping sequences of 300 frames these will bereferred to as clips A and B For each of the six testingmovies the 300-frame clips A and B were each subjectto six temporal manipulations (shown schematically inFigure 1)

A Alternate frames were simply discarded to leave 150frames to be displayed at 120 fps (Figure 1A) Thesehad natural movement These clips were the refer-ences for pairwise rating comparison with themodified clips below

B A glitch or pause was inserted midway byrepeating one frame image near the middle of themovie for a number of frames before jumping backto the original linear sequence of frames (Figure1B) The duration of the glitch was varied from 83ndash660 ms between the six movies and between clips Aand B After discarding alternate frames themodified clips still lasted 150 frames and still beganand ended on the same frame images

C The movie was temporally coarse quantized tomake it judder throughout by displaying every nthframe n times (Figure 1C) The degree of temporalcoarse quantization (effectively reducing the videoclip from 120 fps to 6ndash25 fps) varied between stimuli

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 3

but all clips still lasted 150 frames and still beganand ended on the same frame images

D The speed of the middle part of the movie wasincreased by skipping frames (Figure 1D) Thedegree of skipping (a factor of 12ndash37) and theduration (from 200ndash500 ms) of the speeded partvaried between clips and movies After discardingalternate frames the speeded movies still comprised150 frames they began at the same point as thenatural movement clips but they ended furtherthrough the original 10 s video sequence

E The speed of the middle part of the movie could bereduced by replicating frames in the middle of theclip (Figure 1E) The degree of replication (speedreduced by factor of 025ndash08) and the duration(from 300ndash660 ms) of the slowed part variedbetween clips and movies After discarding alternateframes the slowed movies still comprised 150frames overall they began at the same point as thenatural movement clips but they ended at an earlierpoint in the original sequence

F The speed of the middle part of the movie could beincreased as in Figure 1D except that once normalspeed was resumed frames were added only untilthe movie reached the same point as in the naturalmovement controls thus these speeded-short mov-ies lasted less than 150 frames (Figure 1F)

The magnitudes and durations of the manipulationsvaried between movies and between clips A and B Ofthe five manipulations the first four (B through E) leftthe modified clips with the 150 frames However in thecase of the last (F) there was a change in the totalnumber of frames these stimuli were excluded frommost of our modeling analysis (see Supplementary

Materials) For a given original video sequence theframes in all the derived video clips were scaled by asingle maximum so that the brightest single pixel in thewhole set was 255 In summary for each original videosequence we had two clips and five modified variantsof each clip

In the experiment each clip was paired against itsfive variants and the natural clip A was also pairedagainst natural clip B This resulted in 11 pairs of clipsper video sequence for a total of 66 pairs in thecomplete experiment Some examples of movie clippairs are given in the Supplementary Materials

Spatial frequency filtering

In addition to the temporal manipulations each ofthe 66 pairs of video clips was replicated with (a) low-pass spatial filtering and (b) high-pass spatial filteringThe frames of each clip (natural and modified) werefiltered with very steep filters in the spatial frequencydomain (Figure 2A B) The low-pass filter cut off at108 c8 while the high-pass filter cut-in at 217 c8 butalso included 0 c8 retaining the average luminance ofthat frame After filtering the frames were scaled by thesame factor used for the unfiltered clips Overall therewere thus 3 middot 66 movie-clip pairs used in theexperiment or 198

Some examples of single frames from spatiallyfiltered and unfiltered movie clips are shown in Figure2C through E

Spatio-temporal blending

A fuzzy border (30 pixels or 158 wide) was appliedaround each edge of each square frame image to blendit spatially into the gray background The 30 pixelscovered half a Gaussian with spread of 10 pixels Theonset and offset of the movie clip were ramped up anddown in contrast to blend the clip temporally into thebackground interstimulus gray The contrast rampswere raised cosines lasting 25 frames (208 ms)

Standard stimulus pair

In addition to the 198 test pairs a standard pair wasgenerated from a seventh video sequence of waves onwater The two clips in this pair differed by a noticeablebut not extreme glitch or pause (291 ms) midway in oneof the pair The magnitude of this difference wasdefined as 20 in the ratings experiment Observers hadto rely on this 20 difference to generate their ratings(see above)

This standard movie pair served as a commonanchor and its purpose was to facilitate observers usinga same reference point Its role would be especiallyimportant at the beginning of the experiments when

Figure 1 Schematics of the sequences of frames used to make

movie clips Panel A shows natural movement alternate frames

are taken from the original clip and are played back at about

half speed Panels BndashF show various ways (described in the text)

for distorting the natural movement Note that method F

results in movie clips with fewer frames than the other five

methods

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 4

observers generated an internal rating scale This wouldprevent a situation in which for example an observermight rate some pairs using some maximum allowedvalue only to come across subsequent pairs that theyperceive to be even more different This standard pairwas made from a different movie clip from the teststimuli and was repeatedly presented to the observerthroughout the experiments in the demonstrationtraining and testing phases Observers were instructedthat their ratings of the subjective difference betweenany other movie pair must be based on this standardpair using a ratio scale even when the test pairs differedalong different stimulus dimensions from the standardpair Our previous discrimination experiments withstatic natural images relied on a similar use of ratingsbased on a standard pair (eg To Lovell Trosciankoamp Tolhurst 2008 To et al 2010)

While our choice of the particular standard pair andthe particular standard difference may have affected

the magnitudes of the observersrsquo difference ratings toall other stimuli it is unlikely that a different choicemight have modified the rank order across theexperiment

Visual Difference Predictor (VDP) modeling ofperceived differences

Comparing a single pair of frames

We have been developing a model of V1 simple andcomplex cells in order to help explain perceiveddifferences between pairs of static naturalistic images(To et al 2011 To et al 2010 Tolhurst et al 2010)This model is inspired by the cortex transform ofWatson (1987) as developed by Watson and Solomon(1997) The model computes how millions of stylizedV1 neurons would respond to each of the images undercomparison the responses of the neurons are thencompared pairwise to give millions of tiny differencecues which are combined into a single number byMinkowski summation This final value is a predictionof an observerrsquos magnitude rating of the differencebetween those two images (Tolhurst et al 2010) Forpresent purposes we are interested only in themonochrome (luminance contrast) version of themodel

Briefly each of the two images under comparison isconvolved with 60 Gabor-type simple cell receptivefield templates six optimal orientations five optimalspatial frequencies and odd- and even-symmetry (eachat thousands of different locations within the scene)These linear interim responses in luminance areconverted to contrast responses by dividing by the localmean luminance (Peli 1990 Tadmor amp Tolhurst 19942000) Complex cell responses are obtained by takingthe rms of the responses of paired odd- and even-symmetric simple cells To et al (2010) found that acomplex-cell model was a better predictor than asimple-cell model for ratings for static images Thecomplex-cell responses are weighted to match the waythat a typical observerrsquos contrast sensitivity dependsupon spatial frequency The response versus contrastfunction becomes sigmoidal when the interim complexcell responses are divided by the sum of twonormalizing terms (Foley 1994 Tolhurst et al 2010)First there is nonspecific suppression (Heeger 1992)where the response of each neuron is divided by thesum of the responses of all the neurons whose receptivefields are centered on the same point Second there isorientation- and spatial-frequency-specific suppressionby neurons whose receptive fields surround the field ofthe neuron in question (Blakemore amp Tobin 1972Meese 2004) The details assumptions and parametersare given in To et al (2010) and Tolhurst et al (2010)

Figure 2 (A) The low-pass spatial filter applied to the raw movie

clips to make low-pass movies (B) The high-pass spatial filter

note that it does retain 0 c8 the average brightness level (CndashE)

Single frames taken from three of the movie families showing

from left to right the raw movie the low-pass version and the

high-pass version

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 5

Comparing all the frames in a pair of movie clips

Such VDP modeling is confined to extracting a singlepredicted difference value from comparing two imagesHere we elaborate the VDP to comparing two movieclips each of which comprises 150 possibly differentframes or brief static images We have investigatedthree different methods of breaking movies into staticimage pairs for application of the VDP model Thesemethods tap different spatio-temporal aspects of anydifferences between the paired movies and are shownschematically in Figure 3The Spatial Model Stepwise between-movie comparison(Figure 3A) Each of the 150 frames in one movie iscompared with the corresponding frame of the othermovie giving 150 VDP outputs which are thencombined by Minkowski summation into a singlenumber (Watson 1998 Watson et al 2001) This isone prediction of how different the two movie clipswould seem to be The VDP could show differences fortwo reasons First if the movies are essentially of thesame scene but are moving temporally at a differentrate the compared frames will get out of synchronySecond at an extreme the VDP would show differ-ences even if the two movie clips were actually staticbut had different spatial content This method ofcomparing frames will be sensitive to differences inspatial structure as well as to differences in dynamicsand is a measure of overall difference in spatio-temporalstructureThe Temporal Model Stepwise within-movie compari-son Here the two movies are first analyzed separatelyEach frame in a movie clip is compared with thesuccessive frame in the same movie with 150 framesthis gives 149 VDP comparison values which arecombined to a single value by Minkowski summation(Figure 3B) This is a measure of the overall dynamiccontent of the movie clip and is calculated for each ofthe paired movies separately The difference between

the values for the two clips is a measure of the differencein overall dynamic content It would be uninfluenced byany spatial differences in scene contentThe Ordered-Temporal Model Hybrid two-step moviecomparison The previous measure gives overalldynamic structure but may not for instance illustratewhether particular dynamic events occur at differenttimes within the two movie clips (Figure 3C) Hereeach frame in a movie clip is compared directly withthe succeeding frame (as before) but this dynamicframe difference is now compared with the dynamicdifference between the two equivalent frames in theother movie clip So within each movie the nth frameis compared to the (nthorn 1)th frame and then these twodifferences (the nth difference from each movie) arecompared to generate a between-movie difference The149 between-movie differences are combined again byMinkowski summation This gives a measure of theperceived difference in ordered dynamic contentirrespective of any spatial scene content differencesThis model has the advantage of being sensitive todynamic changes occurring both within each movieand between the two

Modeling realistic impulse functions

Direct VDP comparisons of paired frames is not arealistic model since the human visual system is almostcertainly unable to code each 8 ms frame as a separateevent As a result it is necessary to model humandynamic sensitivity Thus before performing the pairedVDP comparisons we convolved the movie clips withtwo different impulse functions (Figure 4A) represent-ing lsquolsquosustainedrsquorsquo (P-cell red curve) or lsquolsquotransientrsquorsquo (M-cell blue curve) channels The impulse functions haveshape and duration that matches the temporal-frequency sensitivity functions (Figure 4B) inferred forthe two kinds of channel by Kulikowski and Tolhurst(1973) The three methods for comparing movie frames

Figure 3 Schematics showing the three methods we used to compare movie clips based on the idea that we should compare pairs of

movie frames directly either between the movies (A) or within movies (B C) See details in the text

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 6

were performed separately on the movies convolvedwith the sustained and transient impulse functionsgiving six potential predictors of human performanceNote that in the absence of realistic temporal filteringthe analysis would effectively be with a very briefsustained impulse function lasting no more than 8 ms(compared to the 100 ms duration of the realisticsustained impulse Figure 4A)

Experimental results

Reliability of the observersrsquo ratings

Seven naıve observers were presented with 198 pairsof monochrome movies and were asked to rate howdifferent they perceived the two movie clips in each pairto be We have previously shown in To et al (2010) thatobserversrsquo ratings for static natural images are reliableTo verify the robustness of the difference ratings formovies pairs in the present experiment we consideredthe correlation coefficient between the ratings given byany one observer with those given by each otherobserver this value ranged from 036 to 062 (mean050)

In our previous visual discrimination studies (To etal 2010) we have scaled and averaged the differenceratings of all seven observers to each stimulus pair (seeMethods) and have used the averages to compare withvarious model predictions Although this procedure

might remove any potential between-observer differ-ences in strategy it was used to average out within-observer variability As in our previous experimentsand analyses averaging together the results of observ-ers produced a reliable datasets for modeling

When observersrsquo averaged ratings for unfilteredhigh-pass and low-pass movies were compared (seeFigure 5) we found that the ratings for filtered movieswere generally lower than those for unfiltered moviesThis would seem understandable as unfiltered moviescontain more details

Observersrsquo ratings versus model predictions

The average of the normalized ratings for each moviepair was compared with the predictions obtained froma variety of different models of V1 complex cellprocessing The correlation coefficients between theratings and predictions are presented in Table 1 Thedifferent versions of the model all rely ultimately on aVisual Difference Predictor (To et al 2010 Tolhurst etal 2010) which calculates the perceived differencebetween a pair of frames Given that the movie clips arecomposed of 150 frames each there are a number ofways of choosing paired frames to compare a pair ofmovie clips (see Methods) The Spatial Model takeseach frame in one movie and compares it with theequivalent frame in the second movie and is sensitiveto differences in spatial content and dynamic events inthe movie clips The Temporal Model comparessuccessive frames within a movie clip to assess theoverall dynamic content of a clip The Ordered-Temporal Model is similar but it compares the timingsof dynamic events between movie clips

Figure 4 (A) The impulse functions used to convolve the movie

clips Sustained (red) and Transient (blue) These impulse

functions have the temporal-frequency filter curves shown in

(B) and are like those proposed by Kulikowski and Tolhurst

(1973) for sustained and transient channels

Figure 5 The averaged ratings given by observers for the

spatially filtered movies are plotted against the ratings for the

raw unfiltered versions of the movies Low-pass filtered (blue)

and high-pass filtered (red)

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 7

Without temporal filtering

We first used the VDP to compare pairs of frames asthey were actually presented on the display thispresumes unrealistically that an observer could treateach 8 ms frame as a separate event and that the humancontrast sensitivity function (CSF) is flat out to about60 Hz Overall the best predictions by just one modelwere generated by the Ordered-Temporal Model (r frac140438) The difference predictions were slightly im-proved when the three models were combined into amultiple linear regression model and the correlationcoefficient increased to rfrac14 0507 When considering theaccuracy of the modelsrsquo predictions in relation to thedifferent types of movies presented (ie spatiallyunfiltered low-pass and high-pass) the predictionswere consistently superior for the high-pass moviesand poorest predictions were for difference ratings forthe low-pass movies More specifically in the case ofthe Multilinear Model the correlation values betweenpredictions and measured ratings for unfiltered low-pass and high-pass were 0608 0392 and 0704respectively

Realistic temporal filtering

The predictions discussed so far were generated bymodels that compare the movies pairs frame-by-framewithout taking into account the temporal properties ofV1 neurons Now we examine whether the inclusion ofsustained orand transient temporal filters (Figure 4)improves the modelsrsquo performance at predicting per-ceived differences between two movie clips

The Multilinear Model based only on the sustainedimpulse function caused little improvement overall (rfrac14

0509 now vs 0507 originally) or separately to thethree different kinds of spatially filtered movie clipsand the predictions given by the different modelsseparately for the different spatially filtered movieversions were also little improved In particular the fitfor low-pass movies remained poor (rfrac14 0408 vs rfrac140392) This is hardly surprising given that low spatialfrequencies are believed to be detected primarily bytransient channels rather than sustained channels(Kulikowski amp Tolhurst 1973) However it is worthnoting that for the high-pass spatially filtered moviesthe realistic sustained impulse improves the predictionsof the Spatial Model (r increases from 0336 to 0582)but lessens the predictions of the Ordered-TemporalModel (r decreases from 0584 to 0488) This wouldseem to be consistent with the idea that higher spatialfrequencies are predominantly processed by sustainedchannels (Kulikowski amp Tolhurst 1973)

As expected when the modeling was based on thetransient impulse function alone with its substantiallydifferent form (Figure 4) the nature of the correlationschanged In particular it generated much stronger fitsfor low-pass spatially filtered (rfrac14 0636 vs rfrac14 0408 inprevious case) However the fits for high-pass werenoticeably weaker (rfrac140675 vs rfrac140749) When takingall the movies into account the improvement of thepredictions for the low-pass movies meant that theoverall correlation for the multilinear fit improved from0509 to 0598 It therefore becomes clear that thesustained and transient filters operate quite specificallyto improve the predictions for only one type of moviehigh-pass movies in the case of sustained impulsefunctions and low-pass movies in the case of transientimpulse functions

Models

Transient andor

sustained

All movies

(n frac14 162)

Unfiltered

(n frac14 54)

Low-pass

(n frac14 54)

High-pass

(n frac14 54)

Spatial ndash 0288 0338 0274 0336

Temporal ndash 0357 0409 0102 0546

Ordered-Temporal ndash 0438 0505 0266 0584

Multilinear (three parameters) ndash 0507 0608 0392 0704

Spatial S 0396 0425 0306 0582

Temporal S 0299 0292 0241 0414

Ordered-Temporal S 0415 0503 0370 0488

Multilinear (three parameters) S 0509 0631 0408 0749

Spatial T 0453 0391 0444 0565

Temporal T 0454 0518 0481 0425

Ordered-Temporal T 0342 0311 0301 0469

Multilinear (three parameters) T 0598 0627 0636 0675

Multilinear (six parameters) TS 0759 0788 0702 0806

Table1 The correlation coefficients between the observersrsquo ratings (average of 7 per stimulus) and various V1-based models Thecorrelations for the spatially unfiltered for the low-pass and the high-pass filtered subsets are shown separately as well as the fits forthe entire set of 162 movie pairs Correlations are shown for raw movies and for temporally-filtered (sustained and transient) modelsThe final six-parameter multilinear fit is a prediction of ratings from the three models and the two kinds of impulse function

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 8

Finally the predictions of the V1 based model weresubstantially improved by incorporating both thetransient and sustained filters that is a multiple linearfit with six parameters (three ways for comparingframes for each of the sustained and transienttemporal filters) Figure 6 plots the observersrsquo averagedratings against the six-parameter model predictionsThe correlation coefficients between the predictionsgenerated by this six-parameter multilinear model andobserversrsquo measured ratings was 0759 for the full set of162 movie pairs a very substantial improvement overour starting value of 0507 (P frac14 00001 n frac14 162) Thecorrelations for the six parameter fits also improvedseparately for the spatially unfiltered (black symbols)the low-pass movies (blue) and the high-pass movies(red) The greatest improvement from our startingpoint without temporal filtering was in the multilinearfit for the low-pass spatially filtered movies (r increasesfrom 0392 to 0702)

Discussion

We have investigated whether it is possible to modelhow human observers rate the differences between twodynamic naturalistic scenes Observers were presentedwith pairs of movie clips that differed on variousdimensions such as glitches and speed and were askedto rate how different the clips appeared to them Wethen compared the predictions from a variety of V1-

based computational models with the measured ratingsAll the models were based on multiple comparisons ofpairs of static frames but the models differed in howpairs of frames were chosen for comparison Thepairwise VDP comparison is the same as we havereported for static photographs of natural scenes (To etal 2010) but augmented by prior convolution in timeof the movie clips with separate realistic sustained andtransient impulse functions (Kulikowski amp Tolhurst1973)

The use of magnitude estimation ratings in visualpsychophysical experiments might be considered toosubjective and hence not very reliable We havepreviously shown that ratings are reliable (To et al2010) repeated ratings within subjects are highlycorrelatedreproducible (even when the stimulus setwas 900 image pairs) and the correlation for ratingsbetween observers is generally good We have chosenthis method in this and previous studies for threereasons First when observers generate ratings of howdifferent pairs of stimuli appear to them they have thefreedom to rate the pairs as they please regardless ofwhich features were changed The ratings thereforeallow us to measure discrimination across and inde-pendently of feature dimensions separately and incombination (To et al 2008 To et al 2011) Secondunlike threshold measurements ratings can be collectedquickly We are interested in studying vision withnaturalistic images and naturalistic tasks and it is acommon criticism of such study that natural images areso varied that we must always study very many (198movie pairs in this study) Using staircase thresholdtechniques for so many stimuli would be impracticallyslow Third ratings are not confined to thresholdstimuli and we consider it necessary to be able tomeasure visual performance with suprathreshold stim-uli as well Overall averaging the ratings of severalobservers together has provided us with data sets thatare certainly good enough to allow us to distinguishbetween different V1-based models of visual coding (Toet al 2010 To et al 2011) Although our study is notaimed toward the invention of image quality metricswe have essentially borrowed our ratings techniquefrom that applied field

We have investigated a number of different measuresfor comparing the amount and timing of dynamicevents in within a pair of movies Our Spatial Modelcomputed the differences between correspondingframes in each clip in the movie pair and then combinedthe values using Minkowski summation This is themethod used by Watson (1998) and Watson et al(2001) in their model for evaluating the degree ofperceived distortion in compressed video streams Itseems particularly appropriate in such a context wherethe compressed video would have the same funda-mental spatial and synchronous temporal structure as

Figure 6 The averaged ratings given by observers for the 162

movie pairs consisting of exactly 150 frames are plotted against

our most comprehensive spatiotemporal VDP model raw

movies (black) low-pass filtered (blue) and high-pass filtered

(red) The model is a multilinear regression with six model terms

and an intercept (a seventh parameter) The six-model terms

are for the three methods of comparing frames each with

sustained and transient impulse functions The correlation

coefficient with seven parameters is 0759 See the Supple-

mentary Materials for more details of this regression

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 9

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

difference The models were most successful when theyincluded physiologically realistic mechanisms such asnonspecific suppression or contrast normalization(Heeger 1992) surround suppression (Blakemore ampTobin 1972) and Gabor receptive fields that areelongated and whose bandwidth changes with bestspatial frequency (Tolhurst amp Thompson 1981) Amodel based on complex cells was more successful thanone based on simple cells Here we ask whether suchmodels can be extended to the perception of natural-image movie stimuli

These visual discrimination models can generatedecent predictions of how well the visual system candiscriminate between static images However the visualenvironment is obviously in constant flux and naturalscenes are rarely still It is therefore important toinvestigate whether such modeling can be extended todynamic movie clip pairs Modeling the task ofdiscriminating between two naturalistic movie clipsneeds extra consideration over the task of discriminat-ing two static images

First since movie clips are composed of multipleframes observers must first process all the frameswithin each movie separately and must then comparesome percept or memory of the frames between the twomovies While comparing static natural scenes requiresonly one head-to-head computational comparison inthe case of comparing movies that are composed ofnumerous frames multiple comparisons need to bemade and many comparison rules are possible It ispossible that observers might compare the framesbetween movies frame by frame this is suggested byWatson (1998) and Watson Hu and McGowan (2001)in their biologically inspired model for evaluating thedegree of perceived distortion caused by lossy com-pression of video streams Observers might alsocompare the frames within each movie clip separatelybefore comparing the two clips perhaps by codingobject movement within the movies as has beensuggested for enhanced measures of video distortion(eg Seshadrinathan amp Bovik 2010 Vu amp Chandler2014 Wang amp Li 2007) It is important to identify thebest rule to describe how movie frames across moviesare compared

Second when constructing a V1-based computa-tional model of visual discrimination of dynamicimages one must also obviously consider the temporalsensitivity of the visual system One hypothesis has itthat in the early stages of the visual system there aretwo parallel pathways The transient or magnocellular(M-cell) pathway acts primarily on low spatialfrequencies while the sustained or parvocellular (P-cell) pathway acts primarily on middle and highspatial frequencies (Derrington amp Lennie 1984Gouras 1968 Hess amp Snowden 1992 HoriguchiNakadomari Misaki amp Wandell 2009 Keesey 1972

Kulikowski amp Tolhurst 1973 Tolhurst 1973 1975)The two pathways have been proposed to conveydifferent perceptual information about temporal andspatial structure (Keesey 1972 Kulikowski amp Tol-hurst 1973 Tolhurst 1973)

In the experiment described here we presentedobservers with pairs of stimuli derived from achromaticmovies of natural scenes and have asked them to ratehow different these clips appear to them We haveexamined different derivations of our static V1-basedVDP model for static images to accommodate observ-ersrsquo magnitude estimation ratings of dynamic naturalscenes We examine the different contributions of thetransient low spatial-frequency channels compared tothe sustained high spatial-frequency channels We alsoexamine different possible strategies for breaking thecomparison of two movies into multiple head-to-headcomparisons between pairs of frames Some of theresults have been reported briefly (To Gilchrist ampTolhurst 2014 To et al 2012)

Methods

Observers viewed monochrome movie clips present-ed on a 19-in Sony CRT display at 57 cm The movieframes were 240 pixels square subtending 128 square inthe center of the screen Apart from that central squarethe rest of the 800 middot 600 pixels of the display were mid-gray (88 cdm2) In the times between movie clips thewhole screen was maintained at that same grayPresentation was through a ViSaGe system (CambridgeResearch Systems Rochester UK) so that nonlinear-ities in the displayrsquos luminance output were correctedwithout loss of bit resolution The display frame ratewas 120 fps (frames per s)

In a single trial the observers viewed two slightlydifferent movie clips successively and they were askedto give a numerical magnitude rating of how large theyperceived the differences between the clips to be Eachclip usually lasted 125 s and there was a gap of 80 msbetween the two clips in a trial

Experimental procedure

There were 198 movie clip pairs (see below) and theywere presented once each to each of seven observers Inaddition a standard movie pair described below waspresented after every ten test-trials the perceivedmagnitude difference of this pair was deemed to belsquolsquo20rsquorsquo Observers were instructed to rate how differentpairs of movies appeared to them They were asked tobase their ratings on their overall general visualexperience and were not given any specific instructions

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 2

on what aspects of the videos (eg spatial temporalandor spatio-temporal) they should rely on They weretold that if they perceived the difference between a testmovie pair to be less equal or greater than thestandard pair their rating should obviously be lessequal or greater than 20 respectively They were to usea ratio scale so that if for instance a given movie pairseemed to have a difference twice as great as that of thestandard pair they would assign a value twice as largeto that pair (ie 40) No upper limit was set so thatobservers could rate large differences as highly as theysaw fit Observers were also told that sometimes moviepairs might be identical in which case they should setthe rating to zero (in fact all the movie pairs did differto some extent)

Before an experiment each observer underwent atraining session when they were asked to rate 50 pairsof movie clips containing various types of differencesthat could be presented to them later on in theexperiment All movie clips (apart from the standardmovie pair) used in demonstration or training phaseswere made from three original movies that weredifferent from the seven used in the testing phaseproper

The testing phase was divided into three blocks of 66image pairs Each block lasted about 15 min andstarted with the presentation of the standard moviepair which was subsequently presented after every 10trials to remind the observers of the standard differenceof 20 The image presentation sequence was random-ized differently for each observer

Data collation and analysis

For each observer their 198 magnitude ratings werenormalized by dividing by that observerrsquos averagerating Then for each stimulus pair the normalizedratings of the seven observers were averaged togetherThese 198 averages of normalized ratings were thenmultiplied by a single scalar value to bring their grandaverage up to the same value as the grand average of allthe ratings given by all the observers in the wholeexperiment Our experience (To Lovell Troscianko ampTolhurst 2010) is that generally averages of normal-ized ratings have a lower coefficient of variation thando averages of raw ratings

Observers

The experiment tested seven observers recruitedfrom the student or postdoctoral researcher popula-tions at the University of Cambridge UK They allgave informed consent and had normal vision afterprescription correction as verified using Landolt Cacuity chart and the Ishihara color test (10th Edition)

Construction of stimuli

Original movies

The monochrome test stimuli were constructed fromsix original color video sequences each lasting 10 sThree of the stimulus movies were of lsquolsquoenvironmentalmovementrsquorsquo eg grass blowing or water rippling orsplashing One was of a duck swimming on a pond Theother two were of peoplersquos hands one person waspeeling a potato and the other was demonstrating signlanguage for the deaf A seventh video sequence wasused to make a standard video pair

The video sequences were originally taken at 200 fpswith a Pulnix TMC-6740 CCD camera (JAI-Pulnix AS Copenhagen Denmark Lovell Gilchrist Tolhurstamp Troscianko 2009) Each frame was saved as aseparate uncompressed image 10 bits in each of threecolor planes on a Bayer matrix of 640 middot 480 pixelsEach frame image was converted to a floating-point 640middot 480 pixel RGB image file with values corrected forany nonlinearities in the luminance versus outputrelation Each image was then reduced to 320 middot 240pixels by taking alternate rows and columns (the nativepixelation of R and B in the Bayer) and was convertedto monochrome by averaging together the three colorplanes Finally the images were cropped to 240 middot 240pixels Some very bright highlights were removed byclipping the brightest image pixels to a lower value

Stimulus movie clips

From each of the 10-s 200-fps movies we took twononoverlapping sequences of 300 frames these will bereferred to as clips A and B For each of the six testingmovies the 300-frame clips A and B were each subjectto six temporal manipulations (shown schematically inFigure 1)

A Alternate frames were simply discarded to leave 150frames to be displayed at 120 fps (Figure 1A) Thesehad natural movement These clips were the refer-ences for pairwise rating comparison with themodified clips below

B A glitch or pause was inserted midway byrepeating one frame image near the middle of themovie for a number of frames before jumping backto the original linear sequence of frames (Figure1B) The duration of the glitch was varied from 83ndash660 ms between the six movies and between clips Aand B After discarding alternate frames themodified clips still lasted 150 frames and still beganand ended on the same frame images

C The movie was temporally coarse quantized tomake it judder throughout by displaying every nthframe n times (Figure 1C) The degree of temporalcoarse quantization (effectively reducing the videoclip from 120 fps to 6ndash25 fps) varied between stimuli

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 3

but all clips still lasted 150 frames and still beganand ended on the same frame images

D The speed of the middle part of the movie wasincreased by skipping frames (Figure 1D) Thedegree of skipping (a factor of 12ndash37) and theduration (from 200ndash500 ms) of the speeded partvaried between clips and movies After discardingalternate frames the speeded movies still comprised150 frames they began at the same point as thenatural movement clips but they ended furtherthrough the original 10 s video sequence

E The speed of the middle part of the movie could bereduced by replicating frames in the middle of theclip (Figure 1E) The degree of replication (speedreduced by factor of 025ndash08) and the duration(from 300ndash660 ms) of the slowed part variedbetween clips and movies After discarding alternateframes the slowed movies still comprised 150frames overall they began at the same point as thenatural movement clips but they ended at an earlierpoint in the original sequence

F The speed of the middle part of the movie could beincreased as in Figure 1D except that once normalspeed was resumed frames were added only untilthe movie reached the same point as in the naturalmovement controls thus these speeded-short mov-ies lasted less than 150 frames (Figure 1F)

The magnitudes and durations of the manipulationsvaried between movies and between clips A and B Ofthe five manipulations the first four (B through E) leftthe modified clips with the 150 frames However in thecase of the last (F) there was a change in the totalnumber of frames these stimuli were excluded frommost of our modeling analysis (see Supplementary

Materials) For a given original video sequence theframes in all the derived video clips were scaled by asingle maximum so that the brightest single pixel in thewhole set was 255 In summary for each original videosequence we had two clips and five modified variantsof each clip

In the experiment each clip was paired against itsfive variants and the natural clip A was also pairedagainst natural clip B This resulted in 11 pairs of clipsper video sequence for a total of 66 pairs in thecomplete experiment Some examples of movie clippairs are given in the Supplementary Materials

Spatial frequency filtering

In addition to the temporal manipulations each ofthe 66 pairs of video clips was replicated with (a) low-pass spatial filtering and (b) high-pass spatial filteringThe frames of each clip (natural and modified) werefiltered with very steep filters in the spatial frequencydomain (Figure 2A B) The low-pass filter cut off at108 c8 while the high-pass filter cut-in at 217 c8 butalso included 0 c8 retaining the average luminance ofthat frame After filtering the frames were scaled by thesame factor used for the unfiltered clips Overall therewere thus 3 middot 66 movie-clip pairs used in theexperiment or 198

Some examples of single frames from spatiallyfiltered and unfiltered movie clips are shown in Figure2C through E

Spatio-temporal blending

A fuzzy border (30 pixels or 158 wide) was appliedaround each edge of each square frame image to blendit spatially into the gray background The 30 pixelscovered half a Gaussian with spread of 10 pixels Theonset and offset of the movie clip were ramped up anddown in contrast to blend the clip temporally into thebackground interstimulus gray The contrast rampswere raised cosines lasting 25 frames (208 ms)

Standard stimulus pair

In addition to the 198 test pairs a standard pair wasgenerated from a seventh video sequence of waves onwater The two clips in this pair differed by a noticeablebut not extreme glitch or pause (291 ms) midway in oneof the pair The magnitude of this difference wasdefined as 20 in the ratings experiment Observers hadto rely on this 20 difference to generate their ratings(see above)

This standard movie pair served as a commonanchor and its purpose was to facilitate observers usinga same reference point Its role would be especiallyimportant at the beginning of the experiments when

Figure 1 Schematics of the sequences of frames used to make

movie clips Panel A shows natural movement alternate frames

are taken from the original clip and are played back at about

half speed Panels BndashF show various ways (described in the text)

for distorting the natural movement Note that method F

results in movie clips with fewer frames than the other five

methods

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 4

observers generated an internal rating scale This wouldprevent a situation in which for example an observermight rate some pairs using some maximum allowedvalue only to come across subsequent pairs that theyperceive to be even more different This standard pairwas made from a different movie clip from the teststimuli and was repeatedly presented to the observerthroughout the experiments in the demonstrationtraining and testing phases Observers were instructedthat their ratings of the subjective difference betweenany other movie pair must be based on this standardpair using a ratio scale even when the test pairs differedalong different stimulus dimensions from the standardpair Our previous discrimination experiments withstatic natural images relied on a similar use of ratingsbased on a standard pair (eg To Lovell Trosciankoamp Tolhurst 2008 To et al 2010)

While our choice of the particular standard pair andthe particular standard difference may have affected

the magnitudes of the observersrsquo difference ratings toall other stimuli it is unlikely that a different choicemight have modified the rank order across theexperiment

Visual Difference Predictor (VDP) modeling ofperceived differences

Comparing a single pair of frames

We have been developing a model of V1 simple andcomplex cells in order to help explain perceiveddifferences between pairs of static naturalistic images(To et al 2011 To et al 2010 Tolhurst et al 2010)This model is inspired by the cortex transform ofWatson (1987) as developed by Watson and Solomon(1997) The model computes how millions of stylizedV1 neurons would respond to each of the images undercomparison the responses of the neurons are thencompared pairwise to give millions of tiny differencecues which are combined into a single number byMinkowski summation This final value is a predictionof an observerrsquos magnitude rating of the differencebetween those two images (Tolhurst et al 2010) Forpresent purposes we are interested only in themonochrome (luminance contrast) version of themodel

Briefly each of the two images under comparison isconvolved with 60 Gabor-type simple cell receptivefield templates six optimal orientations five optimalspatial frequencies and odd- and even-symmetry (eachat thousands of different locations within the scene)These linear interim responses in luminance areconverted to contrast responses by dividing by the localmean luminance (Peli 1990 Tadmor amp Tolhurst 19942000) Complex cell responses are obtained by takingthe rms of the responses of paired odd- and even-symmetric simple cells To et al (2010) found that acomplex-cell model was a better predictor than asimple-cell model for ratings for static images Thecomplex-cell responses are weighted to match the waythat a typical observerrsquos contrast sensitivity dependsupon spatial frequency The response versus contrastfunction becomes sigmoidal when the interim complexcell responses are divided by the sum of twonormalizing terms (Foley 1994 Tolhurst et al 2010)First there is nonspecific suppression (Heeger 1992)where the response of each neuron is divided by thesum of the responses of all the neurons whose receptivefields are centered on the same point Second there isorientation- and spatial-frequency-specific suppressionby neurons whose receptive fields surround the field ofthe neuron in question (Blakemore amp Tobin 1972Meese 2004) The details assumptions and parametersare given in To et al (2010) and Tolhurst et al (2010)

Figure 2 (A) The low-pass spatial filter applied to the raw movie

clips to make low-pass movies (B) The high-pass spatial filter

note that it does retain 0 c8 the average brightness level (CndashE)

Single frames taken from three of the movie families showing

from left to right the raw movie the low-pass version and the

high-pass version

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 5

Comparing all the frames in a pair of movie clips

Such VDP modeling is confined to extracting a singlepredicted difference value from comparing two imagesHere we elaborate the VDP to comparing two movieclips each of which comprises 150 possibly differentframes or brief static images We have investigatedthree different methods of breaking movies into staticimage pairs for application of the VDP model Thesemethods tap different spatio-temporal aspects of anydifferences between the paired movies and are shownschematically in Figure 3The Spatial Model Stepwise between-movie comparison(Figure 3A) Each of the 150 frames in one movie iscompared with the corresponding frame of the othermovie giving 150 VDP outputs which are thencombined by Minkowski summation into a singlenumber (Watson 1998 Watson et al 2001) This isone prediction of how different the two movie clipswould seem to be The VDP could show differences fortwo reasons First if the movies are essentially of thesame scene but are moving temporally at a differentrate the compared frames will get out of synchronySecond at an extreme the VDP would show differ-ences even if the two movie clips were actually staticbut had different spatial content This method ofcomparing frames will be sensitive to differences inspatial structure as well as to differences in dynamicsand is a measure of overall difference in spatio-temporalstructureThe Temporal Model Stepwise within-movie compari-son Here the two movies are first analyzed separatelyEach frame in a movie clip is compared with thesuccessive frame in the same movie with 150 framesthis gives 149 VDP comparison values which arecombined to a single value by Minkowski summation(Figure 3B) This is a measure of the overall dynamiccontent of the movie clip and is calculated for each ofthe paired movies separately The difference between

the values for the two clips is a measure of the differencein overall dynamic content It would be uninfluenced byany spatial differences in scene contentThe Ordered-Temporal Model Hybrid two-step moviecomparison The previous measure gives overalldynamic structure but may not for instance illustratewhether particular dynamic events occur at differenttimes within the two movie clips (Figure 3C) Hereeach frame in a movie clip is compared directly withthe succeeding frame (as before) but this dynamicframe difference is now compared with the dynamicdifference between the two equivalent frames in theother movie clip So within each movie the nth frameis compared to the (nthorn 1)th frame and then these twodifferences (the nth difference from each movie) arecompared to generate a between-movie difference The149 between-movie differences are combined again byMinkowski summation This gives a measure of theperceived difference in ordered dynamic contentirrespective of any spatial scene content differencesThis model has the advantage of being sensitive todynamic changes occurring both within each movieand between the two

Modeling realistic impulse functions

Direct VDP comparisons of paired frames is not arealistic model since the human visual system is almostcertainly unable to code each 8 ms frame as a separateevent As a result it is necessary to model humandynamic sensitivity Thus before performing the pairedVDP comparisons we convolved the movie clips withtwo different impulse functions (Figure 4A) represent-ing lsquolsquosustainedrsquorsquo (P-cell red curve) or lsquolsquotransientrsquorsquo (M-cell blue curve) channels The impulse functions haveshape and duration that matches the temporal-frequency sensitivity functions (Figure 4B) inferred forthe two kinds of channel by Kulikowski and Tolhurst(1973) The three methods for comparing movie frames

Figure 3 Schematics showing the three methods we used to compare movie clips based on the idea that we should compare pairs of

movie frames directly either between the movies (A) or within movies (B C) See details in the text

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 6

were performed separately on the movies convolvedwith the sustained and transient impulse functionsgiving six potential predictors of human performanceNote that in the absence of realistic temporal filteringthe analysis would effectively be with a very briefsustained impulse function lasting no more than 8 ms(compared to the 100 ms duration of the realisticsustained impulse Figure 4A)

Experimental results

Reliability of the observersrsquo ratings

Seven naıve observers were presented with 198 pairsof monochrome movies and were asked to rate howdifferent they perceived the two movie clips in each pairto be We have previously shown in To et al (2010) thatobserversrsquo ratings for static natural images are reliableTo verify the robustness of the difference ratings formovies pairs in the present experiment we consideredthe correlation coefficient between the ratings given byany one observer with those given by each otherobserver this value ranged from 036 to 062 (mean050)

In our previous visual discrimination studies (To etal 2010) we have scaled and averaged the differenceratings of all seven observers to each stimulus pair (seeMethods) and have used the averages to compare withvarious model predictions Although this procedure

might remove any potential between-observer differ-ences in strategy it was used to average out within-observer variability As in our previous experimentsand analyses averaging together the results of observ-ers produced a reliable datasets for modeling

When observersrsquo averaged ratings for unfilteredhigh-pass and low-pass movies were compared (seeFigure 5) we found that the ratings for filtered movieswere generally lower than those for unfiltered moviesThis would seem understandable as unfiltered moviescontain more details

Observersrsquo ratings versus model predictions

The average of the normalized ratings for each moviepair was compared with the predictions obtained froma variety of different models of V1 complex cellprocessing The correlation coefficients between theratings and predictions are presented in Table 1 Thedifferent versions of the model all rely ultimately on aVisual Difference Predictor (To et al 2010 Tolhurst etal 2010) which calculates the perceived differencebetween a pair of frames Given that the movie clips arecomposed of 150 frames each there are a number ofways of choosing paired frames to compare a pair ofmovie clips (see Methods) The Spatial Model takeseach frame in one movie and compares it with theequivalent frame in the second movie and is sensitiveto differences in spatial content and dynamic events inthe movie clips The Temporal Model comparessuccessive frames within a movie clip to assess theoverall dynamic content of a clip The Ordered-Temporal Model is similar but it compares the timingsof dynamic events between movie clips

Figure 4 (A) The impulse functions used to convolve the movie

clips Sustained (red) and Transient (blue) These impulse

functions have the temporal-frequency filter curves shown in

(B) and are like those proposed by Kulikowski and Tolhurst

(1973) for sustained and transient channels

Figure 5 The averaged ratings given by observers for the

spatially filtered movies are plotted against the ratings for the

raw unfiltered versions of the movies Low-pass filtered (blue)

and high-pass filtered (red)

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 7

Without temporal filtering

We first used the VDP to compare pairs of frames asthey were actually presented on the display thispresumes unrealistically that an observer could treateach 8 ms frame as a separate event and that the humancontrast sensitivity function (CSF) is flat out to about60 Hz Overall the best predictions by just one modelwere generated by the Ordered-Temporal Model (r frac140438) The difference predictions were slightly im-proved when the three models were combined into amultiple linear regression model and the correlationcoefficient increased to rfrac14 0507 When considering theaccuracy of the modelsrsquo predictions in relation to thedifferent types of movies presented (ie spatiallyunfiltered low-pass and high-pass) the predictionswere consistently superior for the high-pass moviesand poorest predictions were for difference ratings forthe low-pass movies More specifically in the case ofthe Multilinear Model the correlation values betweenpredictions and measured ratings for unfiltered low-pass and high-pass were 0608 0392 and 0704respectively

Realistic temporal filtering

The predictions discussed so far were generated bymodels that compare the movies pairs frame-by-framewithout taking into account the temporal properties ofV1 neurons Now we examine whether the inclusion ofsustained orand transient temporal filters (Figure 4)improves the modelsrsquo performance at predicting per-ceived differences between two movie clips

The Multilinear Model based only on the sustainedimpulse function caused little improvement overall (rfrac14

0509 now vs 0507 originally) or separately to thethree different kinds of spatially filtered movie clipsand the predictions given by the different modelsseparately for the different spatially filtered movieversions were also little improved In particular the fitfor low-pass movies remained poor (rfrac14 0408 vs rfrac140392) This is hardly surprising given that low spatialfrequencies are believed to be detected primarily bytransient channels rather than sustained channels(Kulikowski amp Tolhurst 1973) However it is worthnoting that for the high-pass spatially filtered moviesthe realistic sustained impulse improves the predictionsof the Spatial Model (r increases from 0336 to 0582)but lessens the predictions of the Ordered-TemporalModel (r decreases from 0584 to 0488) This wouldseem to be consistent with the idea that higher spatialfrequencies are predominantly processed by sustainedchannels (Kulikowski amp Tolhurst 1973)

As expected when the modeling was based on thetransient impulse function alone with its substantiallydifferent form (Figure 4) the nature of the correlationschanged In particular it generated much stronger fitsfor low-pass spatially filtered (rfrac14 0636 vs rfrac14 0408 inprevious case) However the fits for high-pass werenoticeably weaker (rfrac140675 vs rfrac140749) When takingall the movies into account the improvement of thepredictions for the low-pass movies meant that theoverall correlation for the multilinear fit improved from0509 to 0598 It therefore becomes clear that thesustained and transient filters operate quite specificallyto improve the predictions for only one type of moviehigh-pass movies in the case of sustained impulsefunctions and low-pass movies in the case of transientimpulse functions

Models

Transient andor

sustained

All movies

(n frac14 162)

Unfiltered

(n frac14 54)

Low-pass

(n frac14 54)

High-pass

(n frac14 54)

Spatial ndash 0288 0338 0274 0336

Temporal ndash 0357 0409 0102 0546

Ordered-Temporal ndash 0438 0505 0266 0584

Multilinear (three parameters) ndash 0507 0608 0392 0704

Spatial S 0396 0425 0306 0582

Temporal S 0299 0292 0241 0414

Ordered-Temporal S 0415 0503 0370 0488

Multilinear (three parameters) S 0509 0631 0408 0749

Spatial T 0453 0391 0444 0565

Temporal T 0454 0518 0481 0425

Ordered-Temporal T 0342 0311 0301 0469

Multilinear (three parameters) T 0598 0627 0636 0675

Multilinear (six parameters) TS 0759 0788 0702 0806

Table1 The correlation coefficients between the observersrsquo ratings (average of 7 per stimulus) and various V1-based models Thecorrelations for the spatially unfiltered for the low-pass and the high-pass filtered subsets are shown separately as well as the fits forthe entire set of 162 movie pairs Correlations are shown for raw movies and for temporally-filtered (sustained and transient) modelsThe final six-parameter multilinear fit is a prediction of ratings from the three models and the two kinds of impulse function

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 8

Finally the predictions of the V1 based model weresubstantially improved by incorporating both thetransient and sustained filters that is a multiple linearfit with six parameters (three ways for comparingframes for each of the sustained and transienttemporal filters) Figure 6 plots the observersrsquo averagedratings against the six-parameter model predictionsThe correlation coefficients between the predictionsgenerated by this six-parameter multilinear model andobserversrsquo measured ratings was 0759 for the full set of162 movie pairs a very substantial improvement overour starting value of 0507 (P frac14 00001 n frac14 162) Thecorrelations for the six parameter fits also improvedseparately for the spatially unfiltered (black symbols)the low-pass movies (blue) and the high-pass movies(red) The greatest improvement from our startingpoint without temporal filtering was in the multilinearfit for the low-pass spatially filtered movies (r increasesfrom 0392 to 0702)

Discussion

We have investigated whether it is possible to modelhow human observers rate the differences between twodynamic naturalistic scenes Observers were presentedwith pairs of movie clips that differed on variousdimensions such as glitches and speed and were askedto rate how different the clips appeared to them Wethen compared the predictions from a variety of V1-

based computational models with the measured ratingsAll the models were based on multiple comparisons ofpairs of static frames but the models differed in howpairs of frames were chosen for comparison Thepairwise VDP comparison is the same as we havereported for static photographs of natural scenes (To etal 2010) but augmented by prior convolution in timeof the movie clips with separate realistic sustained andtransient impulse functions (Kulikowski amp Tolhurst1973)

The use of magnitude estimation ratings in visualpsychophysical experiments might be considered toosubjective and hence not very reliable We havepreviously shown that ratings are reliable (To et al2010) repeated ratings within subjects are highlycorrelatedreproducible (even when the stimulus setwas 900 image pairs) and the correlation for ratingsbetween observers is generally good We have chosenthis method in this and previous studies for threereasons First when observers generate ratings of howdifferent pairs of stimuli appear to them they have thefreedom to rate the pairs as they please regardless ofwhich features were changed The ratings thereforeallow us to measure discrimination across and inde-pendently of feature dimensions separately and incombination (To et al 2008 To et al 2011) Secondunlike threshold measurements ratings can be collectedquickly We are interested in studying vision withnaturalistic images and naturalistic tasks and it is acommon criticism of such study that natural images areso varied that we must always study very many (198movie pairs in this study) Using staircase thresholdtechniques for so many stimuli would be impracticallyslow Third ratings are not confined to thresholdstimuli and we consider it necessary to be able tomeasure visual performance with suprathreshold stim-uli as well Overall averaging the ratings of severalobservers together has provided us with data sets thatare certainly good enough to allow us to distinguishbetween different V1-based models of visual coding (Toet al 2010 To et al 2011) Although our study is notaimed toward the invention of image quality metricswe have essentially borrowed our ratings techniquefrom that applied field

We have investigated a number of different measuresfor comparing the amount and timing of dynamicevents in within a pair of movies Our Spatial Modelcomputed the differences between correspondingframes in each clip in the movie pair and then combinedthe values using Minkowski summation This is themethod used by Watson (1998) and Watson et al(2001) in their model for evaluating the degree ofperceived distortion in compressed video streams Itseems particularly appropriate in such a context wherethe compressed video would have the same funda-mental spatial and synchronous temporal structure as

Figure 6 The averaged ratings given by observers for the 162

movie pairs consisting of exactly 150 frames are plotted against

our most comprehensive spatiotemporal VDP model raw

movies (black) low-pass filtered (blue) and high-pass filtered

(red) The model is a multilinear regression with six model terms

and an intercept (a seventh parameter) The six-model terms

are for the three methods of comparing frames each with

sustained and transient impulse functions The correlation

coefficient with seven parameters is 0759 See the Supple-

mentary Materials for more details of this regression

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 9

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

on what aspects of the videos (eg spatial temporalandor spatio-temporal) they should rely on They weretold that if they perceived the difference between a testmovie pair to be less equal or greater than thestandard pair their rating should obviously be lessequal or greater than 20 respectively They were to usea ratio scale so that if for instance a given movie pairseemed to have a difference twice as great as that of thestandard pair they would assign a value twice as largeto that pair (ie 40) No upper limit was set so thatobservers could rate large differences as highly as theysaw fit Observers were also told that sometimes moviepairs might be identical in which case they should setthe rating to zero (in fact all the movie pairs did differto some extent)

Before an experiment each observer underwent atraining session when they were asked to rate 50 pairsof movie clips containing various types of differencesthat could be presented to them later on in theexperiment All movie clips (apart from the standardmovie pair) used in demonstration or training phaseswere made from three original movies that weredifferent from the seven used in the testing phaseproper

The testing phase was divided into three blocks of 66image pairs Each block lasted about 15 min andstarted with the presentation of the standard moviepair which was subsequently presented after every 10trials to remind the observers of the standard differenceof 20 The image presentation sequence was random-ized differently for each observer

Data collation and analysis

For each observer their 198 magnitude ratings werenormalized by dividing by that observerrsquos averagerating Then for each stimulus pair the normalizedratings of the seven observers were averaged togetherThese 198 averages of normalized ratings were thenmultiplied by a single scalar value to bring their grandaverage up to the same value as the grand average of allthe ratings given by all the observers in the wholeexperiment Our experience (To Lovell Troscianko ampTolhurst 2010) is that generally averages of normal-ized ratings have a lower coefficient of variation thando averages of raw ratings

Observers

The experiment tested seven observers recruitedfrom the student or postdoctoral researcher popula-tions at the University of Cambridge UK They allgave informed consent and had normal vision afterprescription correction as verified using Landolt Cacuity chart and the Ishihara color test (10th Edition)

Construction of stimuli

Original movies

The monochrome test stimuli were constructed fromsix original color video sequences each lasting 10 sThree of the stimulus movies were of lsquolsquoenvironmentalmovementrsquorsquo eg grass blowing or water rippling orsplashing One was of a duck swimming on a pond Theother two were of peoplersquos hands one person waspeeling a potato and the other was demonstrating signlanguage for the deaf A seventh video sequence wasused to make a standard video pair

The video sequences were originally taken at 200 fpswith a Pulnix TMC-6740 CCD camera (JAI-Pulnix AS Copenhagen Denmark Lovell Gilchrist Tolhurstamp Troscianko 2009) Each frame was saved as aseparate uncompressed image 10 bits in each of threecolor planes on a Bayer matrix of 640 middot 480 pixelsEach frame image was converted to a floating-point 640middot 480 pixel RGB image file with values corrected forany nonlinearities in the luminance versus outputrelation Each image was then reduced to 320 middot 240pixels by taking alternate rows and columns (the nativepixelation of R and B in the Bayer) and was convertedto monochrome by averaging together the three colorplanes Finally the images were cropped to 240 middot 240pixels Some very bright highlights were removed byclipping the brightest image pixels to a lower value

Stimulus movie clips

From each of the 10-s 200-fps movies we took twononoverlapping sequences of 300 frames these will bereferred to as clips A and B For each of the six testingmovies the 300-frame clips A and B were each subjectto six temporal manipulations (shown schematically inFigure 1)

A Alternate frames were simply discarded to leave 150frames to be displayed at 120 fps (Figure 1A) Thesehad natural movement These clips were the refer-ences for pairwise rating comparison with themodified clips below

B A glitch or pause was inserted midway byrepeating one frame image near the middle of themovie for a number of frames before jumping backto the original linear sequence of frames (Figure1B) The duration of the glitch was varied from 83ndash660 ms between the six movies and between clips Aand B After discarding alternate frames themodified clips still lasted 150 frames and still beganand ended on the same frame images

C The movie was temporally coarse quantized tomake it judder throughout by displaying every nthframe n times (Figure 1C) The degree of temporalcoarse quantization (effectively reducing the videoclip from 120 fps to 6ndash25 fps) varied between stimuli

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 3

but all clips still lasted 150 frames and still beganand ended on the same frame images

D The speed of the middle part of the movie wasincreased by skipping frames (Figure 1D) Thedegree of skipping (a factor of 12ndash37) and theduration (from 200ndash500 ms) of the speeded partvaried between clips and movies After discardingalternate frames the speeded movies still comprised150 frames they began at the same point as thenatural movement clips but they ended furtherthrough the original 10 s video sequence

E The speed of the middle part of the movie could bereduced by replicating frames in the middle of theclip (Figure 1E) The degree of replication (speedreduced by factor of 025ndash08) and the duration(from 300ndash660 ms) of the slowed part variedbetween clips and movies After discarding alternateframes the slowed movies still comprised 150frames overall they began at the same point as thenatural movement clips but they ended at an earlierpoint in the original sequence

F The speed of the middle part of the movie could beincreased as in Figure 1D except that once normalspeed was resumed frames were added only untilthe movie reached the same point as in the naturalmovement controls thus these speeded-short mov-ies lasted less than 150 frames (Figure 1F)

The magnitudes and durations of the manipulationsvaried between movies and between clips A and B Ofthe five manipulations the first four (B through E) leftthe modified clips with the 150 frames However in thecase of the last (F) there was a change in the totalnumber of frames these stimuli were excluded frommost of our modeling analysis (see Supplementary

Materials) For a given original video sequence theframes in all the derived video clips were scaled by asingle maximum so that the brightest single pixel in thewhole set was 255 In summary for each original videosequence we had two clips and five modified variantsof each clip

In the experiment each clip was paired against itsfive variants and the natural clip A was also pairedagainst natural clip B This resulted in 11 pairs of clipsper video sequence for a total of 66 pairs in thecomplete experiment Some examples of movie clippairs are given in the Supplementary Materials

Spatial frequency filtering

In addition to the temporal manipulations each ofthe 66 pairs of video clips was replicated with (a) low-pass spatial filtering and (b) high-pass spatial filteringThe frames of each clip (natural and modified) werefiltered with very steep filters in the spatial frequencydomain (Figure 2A B) The low-pass filter cut off at108 c8 while the high-pass filter cut-in at 217 c8 butalso included 0 c8 retaining the average luminance ofthat frame After filtering the frames were scaled by thesame factor used for the unfiltered clips Overall therewere thus 3 middot 66 movie-clip pairs used in theexperiment or 198

Some examples of single frames from spatiallyfiltered and unfiltered movie clips are shown in Figure2C through E

Spatio-temporal blending

A fuzzy border (30 pixels or 158 wide) was appliedaround each edge of each square frame image to blendit spatially into the gray background The 30 pixelscovered half a Gaussian with spread of 10 pixels Theonset and offset of the movie clip were ramped up anddown in contrast to blend the clip temporally into thebackground interstimulus gray The contrast rampswere raised cosines lasting 25 frames (208 ms)

Standard stimulus pair

In addition to the 198 test pairs a standard pair wasgenerated from a seventh video sequence of waves onwater The two clips in this pair differed by a noticeablebut not extreme glitch or pause (291 ms) midway in oneof the pair The magnitude of this difference wasdefined as 20 in the ratings experiment Observers hadto rely on this 20 difference to generate their ratings(see above)

This standard movie pair served as a commonanchor and its purpose was to facilitate observers usinga same reference point Its role would be especiallyimportant at the beginning of the experiments when

Figure 1 Schematics of the sequences of frames used to make

movie clips Panel A shows natural movement alternate frames

are taken from the original clip and are played back at about

half speed Panels BndashF show various ways (described in the text)

for distorting the natural movement Note that method F

results in movie clips with fewer frames than the other five

methods

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 4

observers generated an internal rating scale This wouldprevent a situation in which for example an observermight rate some pairs using some maximum allowedvalue only to come across subsequent pairs that theyperceive to be even more different This standard pairwas made from a different movie clip from the teststimuli and was repeatedly presented to the observerthroughout the experiments in the demonstrationtraining and testing phases Observers were instructedthat their ratings of the subjective difference betweenany other movie pair must be based on this standardpair using a ratio scale even when the test pairs differedalong different stimulus dimensions from the standardpair Our previous discrimination experiments withstatic natural images relied on a similar use of ratingsbased on a standard pair (eg To Lovell Trosciankoamp Tolhurst 2008 To et al 2010)

While our choice of the particular standard pair andthe particular standard difference may have affected

the magnitudes of the observersrsquo difference ratings toall other stimuli it is unlikely that a different choicemight have modified the rank order across theexperiment

Visual Difference Predictor (VDP) modeling ofperceived differences

Comparing a single pair of frames

We have been developing a model of V1 simple andcomplex cells in order to help explain perceiveddifferences between pairs of static naturalistic images(To et al 2011 To et al 2010 Tolhurst et al 2010)This model is inspired by the cortex transform ofWatson (1987) as developed by Watson and Solomon(1997) The model computes how millions of stylizedV1 neurons would respond to each of the images undercomparison the responses of the neurons are thencompared pairwise to give millions of tiny differencecues which are combined into a single number byMinkowski summation This final value is a predictionof an observerrsquos magnitude rating of the differencebetween those two images (Tolhurst et al 2010) Forpresent purposes we are interested only in themonochrome (luminance contrast) version of themodel

Briefly each of the two images under comparison isconvolved with 60 Gabor-type simple cell receptivefield templates six optimal orientations five optimalspatial frequencies and odd- and even-symmetry (eachat thousands of different locations within the scene)These linear interim responses in luminance areconverted to contrast responses by dividing by the localmean luminance (Peli 1990 Tadmor amp Tolhurst 19942000) Complex cell responses are obtained by takingthe rms of the responses of paired odd- and even-symmetric simple cells To et al (2010) found that acomplex-cell model was a better predictor than asimple-cell model for ratings for static images Thecomplex-cell responses are weighted to match the waythat a typical observerrsquos contrast sensitivity dependsupon spatial frequency The response versus contrastfunction becomes sigmoidal when the interim complexcell responses are divided by the sum of twonormalizing terms (Foley 1994 Tolhurst et al 2010)First there is nonspecific suppression (Heeger 1992)where the response of each neuron is divided by thesum of the responses of all the neurons whose receptivefields are centered on the same point Second there isorientation- and spatial-frequency-specific suppressionby neurons whose receptive fields surround the field ofthe neuron in question (Blakemore amp Tobin 1972Meese 2004) The details assumptions and parametersare given in To et al (2010) and Tolhurst et al (2010)

Figure 2 (A) The low-pass spatial filter applied to the raw movie

clips to make low-pass movies (B) The high-pass spatial filter

note that it does retain 0 c8 the average brightness level (CndashE)

Single frames taken from three of the movie families showing

from left to right the raw movie the low-pass version and the

high-pass version

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 5

Comparing all the frames in a pair of movie clips

Such VDP modeling is confined to extracting a singlepredicted difference value from comparing two imagesHere we elaborate the VDP to comparing two movieclips each of which comprises 150 possibly differentframes or brief static images We have investigatedthree different methods of breaking movies into staticimage pairs for application of the VDP model Thesemethods tap different spatio-temporal aspects of anydifferences between the paired movies and are shownschematically in Figure 3The Spatial Model Stepwise between-movie comparison(Figure 3A) Each of the 150 frames in one movie iscompared with the corresponding frame of the othermovie giving 150 VDP outputs which are thencombined by Minkowski summation into a singlenumber (Watson 1998 Watson et al 2001) This isone prediction of how different the two movie clipswould seem to be The VDP could show differences fortwo reasons First if the movies are essentially of thesame scene but are moving temporally at a differentrate the compared frames will get out of synchronySecond at an extreme the VDP would show differ-ences even if the two movie clips were actually staticbut had different spatial content This method ofcomparing frames will be sensitive to differences inspatial structure as well as to differences in dynamicsand is a measure of overall difference in spatio-temporalstructureThe Temporal Model Stepwise within-movie compari-son Here the two movies are first analyzed separatelyEach frame in a movie clip is compared with thesuccessive frame in the same movie with 150 framesthis gives 149 VDP comparison values which arecombined to a single value by Minkowski summation(Figure 3B) This is a measure of the overall dynamiccontent of the movie clip and is calculated for each ofthe paired movies separately The difference between

the values for the two clips is a measure of the differencein overall dynamic content It would be uninfluenced byany spatial differences in scene contentThe Ordered-Temporal Model Hybrid two-step moviecomparison The previous measure gives overalldynamic structure but may not for instance illustratewhether particular dynamic events occur at differenttimes within the two movie clips (Figure 3C) Hereeach frame in a movie clip is compared directly withthe succeeding frame (as before) but this dynamicframe difference is now compared with the dynamicdifference between the two equivalent frames in theother movie clip So within each movie the nth frameis compared to the (nthorn 1)th frame and then these twodifferences (the nth difference from each movie) arecompared to generate a between-movie difference The149 between-movie differences are combined again byMinkowski summation This gives a measure of theperceived difference in ordered dynamic contentirrespective of any spatial scene content differencesThis model has the advantage of being sensitive todynamic changes occurring both within each movieand between the two

Modeling realistic impulse functions

Direct VDP comparisons of paired frames is not arealistic model since the human visual system is almostcertainly unable to code each 8 ms frame as a separateevent As a result it is necessary to model humandynamic sensitivity Thus before performing the pairedVDP comparisons we convolved the movie clips withtwo different impulse functions (Figure 4A) represent-ing lsquolsquosustainedrsquorsquo (P-cell red curve) or lsquolsquotransientrsquorsquo (M-cell blue curve) channels The impulse functions haveshape and duration that matches the temporal-frequency sensitivity functions (Figure 4B) inferred forthe two kinds of channel by Kulikowski and Tolhurst(1973) The three methods for comparing movie frames

Figure 3 Schematics showing the three methods we used to compare movie clips based on the idea that we should compare pairs of

movie frames directly either between the movies (A) or within movies (B C) See details in the text

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 6

were performed separately on the movies convolvedwith the sustained and transient impulse functionsgiving six potential predictors of human performanceNote that in the absence of realistic temporal filteringthe analysis would effectively be with a very briefsustained impulse function lasting no more than 8 ms(compared to the 100 ms duration of the realisticsustained impulse Figure 4A)

Experimental results

Reliability of the observersrsquo ratings

Seven naıve observers were presented with 198 pairsof monochrome movies and were asked to rate howdifferent they perceived the two movie clips in each pairto be We have previously shown in To et al (2010) thatobserversrsquo ratings for static natural images are reliableTo verify the robustness of the difference ratings formovies pairs in the present experiment we consideredthe correlation coefficient between the ratings given byany one observer with those given by each otherobserver this value ranged from 036 to 062 (mean050)

In our previous visual discrimination studies (To etal 2010) we have scaled and averaged the differenceratings of all seven observers to each stimulus pair (seeMethods) and have used the averages to compare withvarious model predictions Although this procedure

might remove any potential between-observer differ-ences in strategy it was used to average out within-observer variability As in our previous experimentsand analyses averaging together the results of observ-ers produced a reliable datasets for modeling

When observersrsquo averaged ratings for unfilteredhigh-pass and low-pass movies were compared (seeFigure 5) we found that the ratings for filtered movieswere generally lower than those for unfiltered moviesThis would seem understandable as unfiltered moviescontain more details

Observersrsquo ratings versus model predictions

The average of the normalized ratings for each moviepair was compared with the predictions obtained froma variety of different models of V1 complex cellprocessing The correlation coefficients between theratings and predictions are presented in Table 1 Thedifferent versions of the model all rely ultimately on aVisual Difference Predictor (To et al 2010 Tolhurst etal 2010) which calculates the perceived differencebetween a pair of frames Given that the movie clips arecomposed of 150 frames each there are a number ofways of choosing paired frames to compare a pair ofmovie clips (see Methods) The Spatial Model takeseach frame in one movie and compares it with theequivalent frame in the second movie and is sensitiveto differences in spatial content and dynamic events inthe movie clips The Temporal Model comparessuccessive frames within a movie clip to assess theoverall dynamic content of a clip The Ordered-Temporal Model is similar but it compares the timingsof dynamic events between movie clips

Figure 4 (A) The impulse functions used to convolve the movie

clips Sustained (red) and Transient (blue) These impulse

functions have the temporal-frequency filter curves shown in

(B) and are like those proposed by Kulikowski and Tolhurst

(1973) for sustained and transient channels

Figure 5 The averaged ratings given by observers for the

spatially filtered movies are plotted against the ratings for the

raw unfiltered versions of the movies Low-pass filtered (blue)

and high-pass filtered (red)

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 7

Without temporal filtering

We first used the VDP to compare pairs of frames asthey were actually presented on the display thispresumes unrealistically that an observer could treateach 8 ms frame as a separate event and that the humancontrast sensitivity function (CSF) is flat out to about60 Hz Overall the best predictions by just one modelwere generated by the Ordered-Temporal Model (r frac140438) The difference predictions were slightly im-proved when the three models were combined into amultiple linear regression model and the correlationcoefficient increased to rfrac14 0507 When considering theaccuracy of the modelsrsquo predictions in relation to thedifferent types of movies presented (ie spatiallyunfiltered low-pass and high-pass) the predictionswere consistently superior for the high-pass moviesand poorest predictions were for difference ratings forthe low-pass movies More specifically in the case ofthe Multilinear Model the correlation values betweenpredictions and measured ratings for unfiltered low-pass and high-pass were 0608 0392 and 0704respectively

Realistic temporal filtering

The predictions discussed so far were generated bymodels that compare the movies pairs frame-by-framewithout taking into account the temporal properties ofV1 neurons Now we examine whether the inclusion ofsustained orand transient temporal filters (Figure 4)improves the modelsrsquo performance at predicting per-ceived differences between two movie clips

The Multilinear Model based only on the sustainedimpulse function caused little improvement overall (rfrac14

0509 now vs 0507 originally) or separately to thethree different kinds of spatially filtered movie clipsand the predictions given by the different modelsseparately for the different spatially filtered movieversions were also little improved In particular the fitfor low-pass movies remained poor (rfrac14 0408 vs rfrac140392) This is hardly surprising given that low spatialfrequencies are believed to be detected primarily bytransient channels rather than sustained channels(Kulikowski amp Tolhurst 1973) However it is worthnoting that for the high-pass spatially filtered moviesthe realistic sustained impulse improves the predictionsof the Spatial Model (r increases from 0336 to 0582)but lessens the predictions of the Ordered-TemporalModel (r decreases from 0584 to 0488) This wouldseem to be consistent with the idea that higher spatialfrequencies are predominantly processed by sustainedchannels (Kulikowski amp Tolhurst 1973)

As expected when the modeling was based on thetransient impulse function alone with its substantiallydifferent form (Figure 4) the nature of the correlationschanged In particular it generated much stronger fitsfor low-pass spatially filtered (rfrac14 0636 vs rfrac14 0408 inprevious case) However the fits for high-pass werenoticeably weaker (rfrac140675 vs rfrac140749) When takingall the movies into account the improvement of thepredictions for the low-pass movies meant that theoverall correlation for the multilinear fit improved from0509 to 0598 It therefore becomes clear that thesustained and transient filters operate quite specificallyto improve the predictions for only one type of moviehigh-pass movies in the case of sustained impulsefunctions and low-pass movies in the case of transientimpulse functions

Models

Transient andor

sustained

All movies

(n frac14 162)

Unfiltered

(n frac14 54)

Low-pass

(n frac14 54)

High-pass

(n frac14 54)

Spatial ndash 0288 0338 0274 0336

Temporal ndash 0357 0409 0102 0546

Ordered-Temporal ndash 0438 0505 0266 0584

Multilinear (three parameters) ndash 0507 0608 0392 0704

Spatial S 0396 0425 0306 0582

Temporal S 0299 0292 0241 0414

Ordered-Temporal S 0415 0503 0370 0488

Multilinear (three parameters) S 0509 0631 0408 0749

Spatial T 0453 0391 0444 0565

Temporal T 0454 0518 0481 0425

Ordered-Temporal T 0342 0311 0301 0469

Multilinear (three parameters) T 0598 0627 0636 0675

Multilinear (six parameters) TS 0759 0788 0702 0806

Table1 The correlation coefficients between the observersrsquo ratings (average of 7 per stimulus) and various V1-based models Thecorrelations for the spatially unfiltered for the low-pass and the high-pass filtered subsets are shown separately as well as the fits forthe entire set of 162 movie pairs Correlations are shown for raw movies and for temporally-filtered (sustained and transient) modelsThe final six-parameter multilinear fit is a prediction of ratings from the three models and the two kinds of impulse function

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 8

Finally the predictions of the V1 based model weresubstantially improved by incorporating both thetransient and sustained filters that is a multiple linearfit with six parameters (three ways for comparingframes for each of the sustained and transienttemporal filters) Figure 6 plots the observersrsquo averagedratings against the six-parameter model predictionsThe correlation coefficients between the predictionsgenerated by this six-parameter multilinear model andobserversrsquo measured ratings was 0759 for the full set of162 movie pairs a very substantial improvement overour starting value of 0507 (P frac14 00001 n frac14 162) Thecorrelations for the six parameter fits also improvedseparately for the spatially unfiltered (black symbols)the low-pass movies (blue) and the high-pass movies(red) The greatest improvement from our startingpoint without temporal filtering was in the multilinearfit for the low-pass spatially filtered movies (r increasesfrom 0392 to 0702)

Discussion

We have investigated whether it is possible to modelhow human observers rate the differences between twodynamic naturalistic scenes Observers were presentedwith pairs of movie clips that differed on variousdimensions such as glitches and speed and were askedto rate how different the clips appeared to them Wethen compared the predictions from a variety of V1-

based computational models with the measured ratingsAll the models were based on multiple comparisons ofpairs of static frames but the models differed in howpairs of frames were chosen for comparison Thepairwise VDP comparison is the same as we havereported for static photographs of natural scenes (To etal 2010) but augmented by prior convolution in timeof the movie clips with separate realistic sustained andtransient impulse functions (Kulikowski amp Tolhurst1973)

The use of magnitude estimation ratings in visualpsychophysical experiments might be considered toosubjective and hence not very reliable We havepreviously shown that ratings are reliable (To et al2010) repeated ratings within subjects are highlycorrelatedreproducible (even when the stimulus setwas 900 image pairs) and the correlation for ratingsbetween observers is generally good We have chosenthis method in this and previous studies for threereasons First when observers generate ratings of howdifferent pairs of stimuli appear to them they have thefreedom to rate the pairs as they please regardless ofwhich features were changed The ratings thereforeallow us to measure discrimination across and inde-pendently of feature dimensions separately and incombination (To et al 2008 To et al 2011) Secondunlike threshold measurements ratings can be collectedquickly We are interested in studying vision withnaturalistic images and naturalistic tasks and it is acommon criticism of such study that natural images areso varied that we must always study very many (198movie pairs in this study) Using staircase thresholdtechniques for so many stimuli would be impracticallyslow Third ratings are not confined to thresholdstimuli and we consider it necessary to be able tomeasure visual performance with suprathreshold stim-uli as well Overall averaging the ratings of severalobservers together has provided us with data sets thatare certainly good enough to allow us to distinguishbetween different V1-based models of visual coding (Toet al 2010 To et al 2011) Although our study is notaimed toward the invention of image quality metricswe have essentially borrowed our ratings techniquefrom that applied field

We have investigated a number of different measuresfor comparing the amount and timing of dynamicevents in within a pair of movies Our Spatial Modelcomputed the differences between correspondingframes in each clip in the movie pair and then combinedthe values using Minkowski summation This is themethod used by Watson (1998) and Watson et al(2001) in their model for evaluating the degree ofperceived distortion in compressed video streams Itseems particularly appropriate in such a context wherethe compressed video would have the same funda-mental spatial and synchronous temporal structure as

Figure 6 The averaged ratings given by observers for the 162

movie pairs consisting of exactly 150 frames are plotted against

our most comprehensive spatiotemporal VDP model raw

movies (black) low-pass filtered (blue) and high-pass filtered

(red) The model is a multilinear regression with six model terms

and an intercept (a seventh parameter) The six-model terms

are for the three methods of comparing frames each with

sustained and transient impulse functions The correlation

coefficient with seven parameters is 0759 See the Supple-

mentary Materials for more details of this regression

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 9

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

but all clips still lasted 150 frames and still beganand ended on the same frame images

D The speed of the middle part of the movie wasincreased by skipping frames (Figure 1D) Thedegree of skipping (a factor of 12ndash37) and theduration (from 200ndash500 ms) of the speeded partvaried between clips and movies After discardingalternate frames the speeded movies still comprised150 frames they began at the same point as thenatural movement clips but they ended furtherthrough the original 10 s video sequence

E The speed of the middle part of the movie could bereduced by replicating frames in the middle of theclip (Figure 1E) The degree of replication (speedreduced by factor of 025ndash08) and the duration(from 300ndash660 ms) of the slowed part variedbetween clips and movies After discarding alternateframes the slowed movies still comprised 150frames overall they began at the same point as thenatural movement clips but they ended at an earlierpoint in the original sequence

F The speed of the middle part of the movie could beincreased as in Figure 1D except that once normalspeed was resumed frames were added only untilthe movie reached the same point as in the naturalmovement controls thus these speeded-short mov-ies lasted less than 150 frames (Figure 1F)

The magnitudes and durations of the manipulationsvaried between movies and between clips A and B Ofthe five manipulations the first four (B through E) leftthe modified clips with the 150 frames However in thecase of the last (F) there was a change in the totalnumber of frames these stimuli were excluded frommost of our modeling analysis (see Supplementary

Materials) For a given original video sequence theframes in all the derived video clips were scaled by asingle maximum so that the brightest single pixel in thewhole set was 255 In summary for each original videosequence we had two clips and five modified variantsof each clip

In the experiment each clip was paired against itsfive variants and the natural clip A was also pairedagainst natural clip B This resulted in 11 pairs of clipsper video sequence for a total of 66 pairs in thecomplete experiment Some examples of movie clippairs are given in the Supplementary Materials

Spatial frequency filtering

In addition to the temporal manipulations each ofthe 66 pairs of video clips was replicated with (a) low-pass spatial filtering and (b) high-pass spatial filteringThe frames of each clip (natural and modified) werefiltered with very steep filters in the spatial frequencydomain (Figure 2A B) The low-pass filter cut off at108 c8 while the high-pass filter cut-in at 217 c8 butalso included 0 c8 retaining the average luminance ofthat frame After filtering the frames were scaled by thesame factor used for the unfiltered clips Overall therewere thus 3 middot 66 movie-clip pairs used in theexperiment or 198

Some examples of single frames from spatiallyfiltered and unfiltered movie clips are shown in Figure2C through E

Spatio-temporal blending

A fuzzy border (30 pixels or 158 wide) was appliedaround each edge of each square frame image to blendit spatially into the gray background The 30 pixelscovered half a Gaussian with spread of 10 pixels Theonset and offset of the movie clip were ramped up anddown in contrast to blend the clip temporally into thebackground interstimulus gray The contrast rampswere raised cosines lasting 25 frames (208 ms)

Standard stimulus pair

In addition to the 198 test pairs a standard pair wasgenerated from a seventh video sequence of waves onwater The two clips in this pair differed by a noticeablebut not extreme glitch or pause (291 ms) midway in oneof the pair The magnitude of this difference wasdefined as 20 in the ratings experiment Observers hadto rely on this 20 difference to generate their ratings(see above)

This standard movie pair served as a commonanchor and its purpose was to facilitate observers usinga same reference point Its role would be especiallyimportant at the beginning of the experiments when

Figure 1 Schematics of the sequences of frames used to make

movie clips Panel A shows natural movement alternate frames

are taken from the original clip and are played back at about

half speed Panels BndashF show various ways (described in the text)

for distorting the natural movement Note that method F

results in movie clips with fewer frames than the other five

methods

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 4

observers generated an internal rating scale This wouldprevent a situation in which for example an observermight rate some pairs using some maximum allowedvalue only to come across subsequent pairs that theyperceive to be even more different This standard pairwas made from a different movie clip from the teststimuli and was repeatedly presented to the observerthroughout the experiments in the demonstrationtraining and testing phases Observers were instructedthat their ratings of the subjective difference betweenany other movie pair must be based on this standardpair using a ratio scale even when the test pairs differedalong different stimulus dimensions from the standardpair Our previous discrimination experiments withstatic natural images relied on a similar use of ratingsbased on a standard pair (eg To Lovell Trosciankoamp Tolhurst 2008 To et al 2010)

While our choice of the particular standard pair andthe particular standard difference may have affected

the magnitudes of the observersrsquo difference ratings toall other stimuli it is unlikely that a different choicemight have modified the rank order across theexperiment

Visual Difference Predictor (VDP) modeling ofperceived differences

Comparing a single pair of frames

We have been developing a model of V1 simple andcomplex cells in order to help explain perceiveddifferences between pairs of static naturalistic images(To et al 2011 To et al 2010 Tolhurst et al 2010)This model is inspired by the cortex transform ofWatson (1987) as developed by Watson and Solomon(1997) The model computes how millions of stylizedV1 neurons would respond to each of the images undercomparison the responses of the neurons are thencompared pairwise to give millions of tiny differencecues which are combined into a single number byMinkowski summation This final value is a predictionof an observerrsquos magnitude rating of the differencebetween those two images (Tolhurst et al 2010) Forpresent purposes we are interested only in themonochrome (luminance contrast) version of themodel

Briefly each of the two images under comparison isconvolved with 60 Gabor-type simple cell receptivefield templates six optimal orientations five optimalspatial frequencies and odd- and even-symmetry (eachat thousands of different locations within the scene)These linear interim responses in luminance areconverted to contrast responses by dividing by the localmean luminance (Peli 1990 Tadmor amp Tolhurst 19942000) Complex cell responses are obtained by takingthe rms of the responses of paired odd- and even-symmetric simple cells To et al (2010) found that acomplex-cell model was a better predictor than asimple-cell model for ratings for static images Thecomplex-cell responses are weighted to match the waythat a typical observerrsquos contrast sensitivity dependsupon spatial frequency The response versus contrastfunction becomes sigmoidal when the interim complexcell responses are divided by the sum of twonormalizing terms (Foley 1994 Tolhurst et al 2010)First there is nonspecific suppression (Heeger 1992)where the response of each neuron is divided by thesum of the responses of all the neurons whose receptivefields are centered on the same point Second there isorientation- and spatial-frequency-specific suppressionby neurons whose receptive fields surround the field ofthe neuron in question (Blakemore amp Tobin 1972Meese 2004) The details assumptions and parametersare given in To et al (2010) and Tolhurst et al (2010)

Figure 2 (A) The low-pass spatial filter applied to the raw movie

clips to make low-pass movies (B) The high-pass spatial filter

note that it does retain 0 c8 the average brightness level (CndashE)

Single frames taken from three of the movie families showing

from left to right the raw movie the low-pass version and the

high-pass version

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 5

Comparing all the frames in a pair of movie clips

Such VDP modeling is confined to extracting a singlepredicted difference value from comparing two imagesHere we elaborate the VDP to comparing two movieclips each of which comprises 150 possibly differentframes or brief static images We have investigatedthree different methods of breaking movies into staticimage pairs for application of the VDP model Thesemethods tap different spatio-temporal aspects of anydifferences between the paired movies and are shownschematically in Figure 3The Spatial Model Stepwise between-movie comparison(Figure 3A) Each of the 150 frames in one movie iscompared with the corresponding frame of the othermovie giving 150 VDP outputs which are thencombined by Minkowski summation into a singlenumber (Watson 1998 Watson et al 2001) This isone prediction of how different the two movie clipswould seem to be The VDP could show differences fortwo reasons First if the movies are essentially of thesame scene but are moving temporally at a differentrate the compared frames will get out of synchronySecond at an extreme the VDP would show differ-ences even if the two movie clips were actually staticbut had different spatial content This method ofcomparing frames will be sensitive to differences inspatial structure as well as to differences in dynamicsand is a measure of overall difference in spatio-temporalstructureThe Temporal Model Stepwise within-movie compari-son Here the two movies are first analyzed separatelyEach frame in a movie clip is compared with thesuccessive frame in the same movie with 150 framesthis gives 149 VDP comparison values which arecombined to a single value by Minkowski summation(Figure 3B) This is a measure of the overall dynamiccontent of the movie clip and is calculated for each ofthe paired movies separately The difference between

the values for the two clips is a measure of the differencein overall dynamic content It would be uninfluenced byany spatial differences in scene contentThe Ordered-Temporal Model Hybrid two-step moviecomparison The previous measure gives overalldynamic structure but may not for instance illustratewhether particular dynamic events occur at differenttimes within the two movie clips (Figure 3C) Hereeach frame in a movie clip is compared directly withthe succeeding frame (as before) but this dynamicframe difference is now compared with the dynamicdifference between the two equivalent frames in theother movie clip So within each movie the nth frameis compared to the (nthorn 1)th frame and then these twodifferences (the nth difference from each movie) arecompared to generate a between-movie difference The149 between-movie differences are combined again byMinkowski summation This gives a measure of theperceived difference in ordered dynamic contentirrespective of any spatial scene content differencesThis model has the advantage of being sensitive todynamic changes occurring both within each movieand between the two

Modeling realistic impulse functions

Direct VDP comparisons of paired frames is not arealistic model since the human visual system is almostcertainly unable to code each 8 ms frame as a separateevent As a result it is necessary to model humandynamic sensitivity Thus before performing the pairedVDP comparisons we convolved the movie clips withtwo different impulse functions (Figure 4A) represent-ing lsquolsquosustainedrsquorsquo (P-cell red curve) or lsquolsquotransientrsquorsquo (M-cell blue curve) channels The impulse functions haveshape and duration that matches the temporal-frequency sensitivity functions (Figure 4B) inferred forthe two kinds of channel by Kulikowski and Tolhurst(1973) The three methods for comparing movie frames

Figure 3 Schematics showing the three methods we used to compare movie clips based on the idea that we should compare pairs of

movie frames directly either between the movies (A) or within movies (B C) See details in the text

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 6

were performed separately on the movies convolvedwith the sustained and transient impulse functionsgiving six potential predictors of human performanceNote that in the absence of realistic temporal filteringthe analysis would effectively be with a very briefsustained impulse function lasting no more than 8 ms(compared to the 100 ms duration of the realisticsustained impulse Figure 4A)

Experimental results

Reliability of the observersrsquo ratings

Seven naıve observers were presented with 198 pairsof monochrome movies and were asked to rate howdifferent they perceived the two movie clips in each pairto be We have previously shown in To et al (2010) thatobserversrsquo ratings for static natural images are reliableTo verify the robustness of the difference ratings formovies pairs in the present experiment we consideredthe correlation coefficient between the ratings given byany one observer with those given by each otherobserver this value ranged from 036 to 062 (mean050)

In our previous visual discrimination studies (To etal 2010) we have scaled and averaged the differenceratings of all seven observers to each stimulus pair (seeMethods) and have used the averages to compare withvarious model predictions Although this procedure

might remove any potential between-observer differ-ences in strategy it was used to average out within-observer variability As in our previous experimentsand analyses averaging together the results of observ-ers produced a reliable datasets for modeling

When observersrsquo averaged ratings for unfilteredhigh-pass and low-pass movies were compared (seeFigure 5) we found that the ratings for filtered movieswere generally lower than those for unfiltered moviesThis would seem understandable as unfiltered moviescontain more details

Observersrsquo ratings versus model predictions

The average of the normalized ratings for each moviepair was compared with the predictions obtained froma variety of different models of V1 complex cellprocessing The correlation coefficients between theratings and predictions are presented in Table 1 Thedifferent versions of the model all rely ultimately on aVisual Difference Predictor (To et al 2010 Tolhurst etal 2010) which calculates the perceived differencebetween a pair of frames Given that the movie clips arecomposed of 150 frames each there are a number ofways of choosing paired frames to compare a pair ofmovie clips (see Methods) The Spatial Model takeseach frame in one movie and compares it with theequivalent frame in the second movie and is sensitiveto differences in spatial content and dynamic events inthe movie clips The Temporal Model comparessuccessive frames within a movie clip to assess theoverall dynamic content of a clip The Ordered-Temporal Model is similar but it compares the timingsof dynamic events between movie clips

Figure 4 (A) The impulse functions used to convolve the movie

clips Sustained (red) and Transient (blue) These impulse

functions have the temporal-frequency filter curves shown in

(B) and are like those proposed by Kulikowski and Tolhurst

(1973) for sustained and transient channels

Figure 5 The averaged ratings given by observers for the

spatially filtered movies are plotted against the ratings for the

raw unfiltered versions of the movies Low-pass filtered (blue)

and high-pass filtered (red)

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 7

Without temporal filtering

We first used the VDP to compare pairs of frames asthey were actually presented on the display thispresumes unrealistically that an observer could treateach 8 ms frame as a separate event and that the humancontrast sensitivity function (CSF) is flat out to about60 Hz Overall the best predictions by just one modelwere generated by the Ordered-Temporal Model (r frac140438) The difference predictions were slightly im-proved when the three models were combined into amultiple linear regression model and the correlationcoefficient increased to rfrac14 0507 When considering theaccuracy of the modelsrsquo predictions in relation to thedifferent types of movies presented (ie spatiallyunfiltered low-pass and high-pass) the predictionswere consistently superior for the high-pass moviesand poorest predictions were for difference ratings forthe low-pass movies More specifically in the case ofthe Multilinear Model the correlation values betweenpredictions and measured ratings for unfiltered low-pass and high-pass were 0608 0392 and 0704respectively

Realistic temporal filtering

The predictions discussed so far were generated bymodels that compare the movies pairs frame-by-framewithout taking into account the temporal properties ofV1 neurons Now we examine whether the inclusion ofsustained orand transient temporal filters (Figure 4)improves the modelsrsquo performance at predicting per-ceived differences between two movie clips

The Multilinear Model based only on the sustainedimpulse function caused little improvement overall (rfrac14

0509 now vs 0507 originally) or separately to thethree different kinds of spatially filtered movie clipsand the predictions given by the different modelsseparately for the different spatially filtered movieversions were also little improved In particular the fitfor low-pass movies remained poor (rfrac14 0408 vs rfrac140392) This is hardly surprising given that low spatialfrequencies are believed to be detected primarily bytransient channels rather than sustained channels(Kulikowski amp Tolhurst 1973) However it is worthnoting that for the high-pass spatially filtered moviesthe realistic sustained impulse improves the predictionsof the Spatial Model (r increases from 0336 to 0582)but lessens the predictions of the Ordered-TemporalModel (r decreases from 0584 to 0488) This wouldseem to be consistent with the idea that higher spatialfrequencies are predominantly processed by sustainedchannels (Kulikowski amp Tolhurst 1973)

As expected when the modeling was based on thetransient impulse function alone with its substantiallydifferent form (Figure 4) the nature of the correlationschanged In particular it generated much stronger fitsfor low-pass spatially filtered (rfrac14 0636 vs rfrac14 0408 inprevious case) However the fits for high-pass werenoticeably weaker (rfrac140675 vs rfrac140749) When takingall the movies into account the improvement of thepredictions for the low-pass movies meant that theoverall correlation for the multilinear fit improved from0509 to 0598 It therefore becomes clear that thesustained and transient filters operate quite specificallyto improve the predictions for only one type of moviehigh-pass movies in the case of sustained impulsefunctions and low-pass movies in the case of transientimpulse functions

Models

Transient andor

sustained

All movies

(n frac14 162)

Unfiltered

(n frac14 54)

Low-pass

(n frac14 54)

High-pass

(n frac14 54)

Spatial ndash 0288 0338 0274 0336

Temporal ndash 0357 0409 0102 0546

Ordered-Temporal ndash 0438 0505 0266 0584

Multilinear (three parameters) ndash 0507 0608 0392 0704

Spatial S 0396 0425 0306 0582

Temporal S 0299 0292 0241 0414

Ordered-Temporal S 0415 0503 0370 0488

Multilinear (three parameters) S 0509 0631 0408 0749

Spatial T 0453 0391 0444 0565

Temporal T 0454 0518 0481 0425

Ordered-Temporal T 0342 0311 0301 0469

Multilinear (three parameters) T 0598 0627 0636 0675

Multilinear (six parameters) TS 0759 0788 0702 0806

Table1 The correlation coefficients between the observersrsquo ratings (average of 7 per stimulus) and various V1-based models Thecorrelations for the spatially unfiltered for the low-pass and the high-pass filtered subsets are shown separately as well as the fits forthe entire set of 162 movie pairs Correlations are shown for raw movies and for temporally-filtered (sustained and transient) modelsThe final six-parameter multilinear fit is a prediction of ratings from the three models and the two kinds of impulse function

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 8

Finally the predictions of the V1 based model weresubstantially improved by incorporating both thetransient and sustained filters that is a multiple linearfit with six parameters (three ways for comparingframes for each of the sustained and transienttemporal filters) Figure 6 plots the observersrsquo averagedratings against the six-parameter model predictionsThe correlation coefficients between the predictionsgenerated by this six-parameter multilinear model andobserversrsquo measured ratings was 0759 for the full set of162 movie pairs a very substantial improvement overour starting value of 0507 (P frac14 00001 n frac14 162) Thecorrelations for the six parameter fits also improvedseparately for the spatially unfiltered (black symbols)the low-pass movies (blue) and the high-pass movies(red) The greatest improvement from our startingpoint without temporal filtering was in the multilinearfit for the low-pass spatially filtered movies (r increasesfrom 0392 to 0702)

Discussion

We have investigated whether it is possible to modelhow human observers rate the differences between twodynamic naturalistic scenes Observers were presentedwith pairs of movie clips that differed on variousdimensions such as glitches and speed and were askedto rate how different the clips appeared to them Wethen compared the predictions from a variety of V1-

based computational models with the measured ratingsAll the models were based on multiple comparisons ofpairs of static frames but the models differed in howpairs of frames were chosen for comparison Thepairwise VDP comparison is the same as we havereported for static photographs of natural scenes (To etal 2010) but augmented by prior convolution in timeof the movie clips with separate realistic sustained andtransient impulse functions (Kulikowski amp Tolhurst1973)

The use of magnitude estimation ratings in visualpsychophysical experiments might be considered toosubjective and hence not very reliable We havepreviously shown that ratings are reliable (To et al2010) repeated ratings within subjects are highlycorrelatedreproducible (even when the stimulus setwas 900 image pairs) and the correlation for ratingsbetween observers is generally good We have chosenthis method in this and previous studies for threereasons First when observers generate ratings of howdifferent pairs of stimuli appear to them they have thefreedom to rate the pairs as they please regardless ofwhich features were changed The ratings thereforeallow us to measure discrimination across and inde-pendently of feature dimensions separately and incombination (To et al 2008 To et al 2011) Secondunlike threshold measurements ratings can be collectedquickly We are interested in studying vision withnaturalistic images and naturalistic tasks and it is acommon criticism of such study that natural images areso varied that we must always study very many (198movie pairs in this study) Using staircase thresholdtechniques for so many stimuli would be impracticallyslow Third ratings are not confined to thresholdstimuli and we consider it necessary to be able tomeasure visual performance with suprathreshold stim-uli as well Overall averaging the ratings of severalobservers together has provided us with data sets thatare certainly good enough to allow us to distinguishbetween different V1-based models of visual coding (Toet al 2010 To et al 2011) Although our study is notaimed toward the invention of image quality metricswe have essentially borrowed our ratings techniquefrom that applied field

We have investigated a number of different measuresfor comparing the amount and timing of dynamicevents in within a pair of movies Our Spatial Modelcomputed the differences between correspondingframes in each clip in the movie pair and then combinedthe values using Minkowski summation This is themethod used by Watson (1998) and Watson et al(2001) in their model for evaluating the degree ofperceived distortion in compressed video streams Itseems particularly appropriate in such a context wherethe compressed video would have the same funda-mental spatial and synchronous temporal structure as

Figure 6 The averaged ratings given by observers for the 162

movie pairs consisting of exactly 150 frames are plotted against

our most comprehensive spatiotemporal VDP model raw

movies (black) low-pass filtered (blue) and high-pass filtered

(red) The model is a multilinear regression with six model terms

and an intercept (a seventh parameter) The six-model terms

are for the three methods of comparing frames each with

sustained and transient impulse functions The correlation

coefficient with seven parameters is 0759 See the Supple-

mentary Materials for more details of this regression

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 9

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

observers generated an internal rating scale This wouldprevent a situation in which for example an observermight rate some pairs using some maximum allowedvalue only to come across subsequent pairs that theyperceive to be even more different This standard pairwas made from a different movie clip from the teststimuli and was repeatedly presented to the observerthroughout the experiments in the demonstrationtraining and testing phases Observers were instructedthat their ratings of the subjective difference betweenany other movie pair must be based on this standardpair using a ratio scale even when the test pairs differedalong different stimulus dimensions from the standardpair Our previous discrimination experiments withstatic natural images relied on a similar use of ratingsbased on a standard pair (eg To Lovell Trosciankoamp Tolhurst 2008 To et al 2010)

While our choice of the particular standard pair andthe particular standard difference may have affected

the magnitudes of the observersrsquo difference ratings toall other stimuli it is unlikely that a different choicemight have modified the rank order across theexperiment

Visual Difference Predictor (VDP) modeling ofperceived differences

Comparing a single pair of frames

We have been developing a model of V1 simple andcomplex cells in order to help explain perceiveddifferences between pairs of static naturalistic images(To et al 2011 To et al 2010 Tolhurst et al 2010)This model is inspired by the cortex transform ofWatson (1987) as developed by Watson and Solomon(1997) The model computes how millions of stylizedV1 neurons would respond to each of the images undercomparison the responses of the neurons are thencompared pairwise to give millions of tiny differencecues which are combined into a single number byMinkowski summation This final value is a predictionof an observerrsquos magnitude rating of the differencebetween those two images (Tolhurst et al 2010) Forpresent purposes we are interested only in themonochrome (luminance contrast) version of themodel

Briefly each of the two images under comparison isconvolved with 60 Gabor-type simple cell receptivefield templates six optimal orientations five optimalspatial frequencies and odd- and even-symmetry (eachat thousands of different locations within the scene)These linear interim responses in luminance areconverted to contrast responses by dividing by the localmean luminance (Peli 1990 Tadmor amp Tolhurst 19942000) Complex cell responses are obtained by takingthe rms of the responses of paired odd- and even-symmetric simple cells To et al (2010) found that acomplex-cell model was a better predictor than asimple-cell model for ratings for static images Thecomplex-cell responses are weighted to match the waythat a typical observerrsquos contrast sensitivity dependsupon spatial frequency The response versus contrastfunction becomes sigmoidal when the interim complexcell responses are divided by the sum of twonormalizing terms (Foley 1994 Tolhurst et al 2010)First there is nonspecific suppression (Heeger 1992)where the response of each neuron is divided by thesum of the responses of all the neurons whose receptivefields are centered on the same point Second there isorientation- and spatial-frequency-specific suppressionby neurons whose receptive fields surround the field ofthe neuron in question (Blakemore amp Tobin 1972Meese 2004) The details assumptions and parametersare given in To et al (2010) and Tolhurst et al (2010)

Figure 2 (A) The low-pass spatial filter applied to the raw movie

clips to make low-pass movies (B) The high-pass spatial filter

note that it does retain 0 c8 the average brightness level (CndashE)

Single frames taken from three of the movie families showing

from left to right the raw movie the low-pass version and the

high-pass version

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 5

Comparing all the frames in a pair of movie clips

Such VDP modeling is confined to extracting a singlepredicted difference value from comparing two imagesHere we elaborate the VDP to comparing two movieclips each of which comprises 150 possibly differentframes or brief static images We have investigatedthree different methods of breaking movies into staticimage pairs for application of the VDP model Thesemethods tap different spatio-temporal aspects of anydifferences between the paired movies and are shownschematically in Figure 3The Spatial Model Stepwise between-movie comparison(Figure 3A) Each of the 150 frames in one movie iscompared with the corresponding frame of the othermovie giving 150 VDP outputs which are thencombined by Minkowski summation into a singlenumber (Watson 1998 Watson et al 2001) This isone prediction of how different the two movie clipswould seem to be The VDP could show differences fortwo reasons First if the movies are essentially of thesame scene but are moving temporally at a differentrate the compared frames will get out of synchronySecond at an extreme the VDP would show differ-ences even if the two movie clips were actually staticbut had different spatial content This method ofcomparing frames will be sensitive to differences inspatial structure as well as to differences in dynamicsand is a measure of overall difference in spatio-temporalstructureThe Temporal Model Stepwise within-movie compari-son Here the two movies are first analyzed separatelyEach frame in a movie clip is compared with thesuccessive frame in the same movie with 150 framesthis gives 149 VDP comparison values which arecombined to a single value by Minkowski summation(Figure 3B) This is a measure of the overall dynamiccontent of the movie clip and is calculated for each ofthe paired movies separately The difference between

the values for the two clips is a measure of the differencein overall dynamic content It would be uninfluenced byany spatial differences in scene contentThe Ordered-Temporal Model Hybrid two-step moviecomparison The previous measure gives overalldynamic structure but may not for instance illustratewhether particular dynamic events occur at differenttimes within the two movie clips (Figure 3C) Hereeach frame in a movie clip is compared directly withthe succeeding frame (as before) but this dynamicframe difference is now compared with the dynamicdifference between the two equivalent frames in theother movie clip So within each movie the nth frameis compared to the (nthorn 1)th frame and then these twodifferences (the nth difference from each movie) arecompared to generate a between-movie difference The149 between-movie differences are combined again byMinkowski summation This gives a measure of theperceived difference in ordered dynamic contentirrespective of any spatial scene content differencesThis model has the advantage of being sensitive todynamic changes occurring both within each movieand between the two

Modeling realistic impulse functions

Direct VDP comparisons of paired frames is not arealistic model since the human visual system is almostcertainly unable to code each 8 ms frame as a separateevent As a result it is necessary to model humandynamic sensitivity Thus before performing the pairedVDP comparisons we convolved the movie clips withtwo different impulse functions (Figure 4A) represent-ing lsquolsquosustainedrsquorsquo (P-cell red curve) or lsquolsquotransientrsquorsquo (M-cell blue curve) channels The impulse functions haveshape and duration that matches the temporal-frequency sensitivity functions (Figure 4B) inferred forthe two kinds of channel by Kulikowski and Tolhurst(1973) The three methods for comparing movie frames

Figure 3 Schematics showing the three methods we used to compare movie clips based on the idea that we should compare pairs of

movie frames directly either between the movies (A) or within movies (B C) See details in the text

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 6

were performed separately on the movies convolvedwith the sustained and transient impulse functionsgiving six potential predictors of human performanceNote that in the absence of realistic temporal filteringthe analysis would effectively be with a very briefsustained impulse function lasting no more than 8 ms(compared to the 100 ms duration of the realisticsustained impulse Figure 4A)

Experimental results

Reliability of the observersrsquo ratings

Seven naıve observers were presented with 198 pairsof monochrome movies and were asked to rate howdifferent they perceived the two movie clips in each pairto be We have previously shown in To et al (2010) thatobserversrsquo ratings for static natural images are reliableTo verify the robustness of the difference ratings formovies pairs in the present experiment we consideredthe correlation coefficient between the ratings given byany one observer with those given by each otherobserver this value ranged from 036 to 062 (mean050)

In our previous visual discrimination studies (To etal 2010) we have scaled and averaged the differenceratings of all seven observers to each stimulus pair (seeMethods) and have used the averages to compare withvarious model predictions Although this procedure

might remove any potential between-observer differ-ences in strategy it was used to average out within-observer variability As in our previous experimentsand analyses averaging together the results of observ-ers produced a reliable datasets for modeling

When observersrsquo averaged ratings for unfilteredhigh-pass and low-pass movies were compared (seeFigure 5) we found that the ratings for filtered movieswere generally lower than those for unfiltered moviesThis would seem understandable as unfiltered moviescontain more details

Observersrsquo ratings versus model predictions

The average of the normalized ratings for each moviepair was compared with the predictions obtained froma variety of different models of V1 complex cellprocessing The correlation coefficients between theratings and predictions are presented in Table 1 Thedifferent versions of the model all rely ultimately on aVisual Difference Predictor (To et al 2010 Tolhurst etal 2010) which calculates the perceived differencebetween a pair of frames Given that the movie clips arecomposed of 150 frames each there are a number ofways of choosing paired frames to compare a pair ofmovie clips (see Methods) The Spatial Model takeseach frame in one movie and compares it with theequivalent frame in the second movie and is sensitiveto differences in spatial content and dynamic events inthe movie clips The Temporal Model comparessuccessive frames within a movie clip to assess theoverall dynamic content of a clip The Ordered-Temporal Model is similar but it compares the timingsof dynamic events between movie clips

Figure 4 (A) The impulse functions used to convolve the movie

clips Sustained (red) and Transient (blue) These impulse

functions have the temporal-frequency filter curves shown in

(B) and are like those proposed by Kulikowski and Tolhurst

(1973) for sustained and transient channels

Figure 5 The averaged ratings given by observers for the

spatially filtered movies are plotted against the ratings for the

raw unfiltered versions of the movies Low-pass filtered (blue)

and high-pass filtered (red)

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 7

Without temporal filtering

We first used the VDP to compare pairs of frames asthey were actually presented on the display thispresumes unrealistically that an observer could treateach 8 ms frame as a separate event and that the humancontrast sensitivity function (CSF) is flat out to about60 Hz Overall the best predictions by just one modelwere generated by the Ordered-Temporal Model (r frac140438) The difference predictions were slightly im-proved when the three models were combined into amultiple linear regression model and the correlationcoefficient increased to rfrac14 0507 When considering theaccuracy of the modelsrsquo predictions in relation to thedifferent types of movies presented (ie spatiallyunfiltered low-pass and high-pass) the predictionswere consistently superior for the high-pass moviesand poorest predictions were for difference ratings forthe low-pass movies More specifically in the case ofthe Multilinear Model the correlation values betweenpredictions and measured ratings for unfiltered low-pass and high-pass were 0608 0392 and 0704respectively

Realistic temporal filtering

The predictions discussed so far were generated bymodels that compare the movies pairs frame-by-framewithout taking into account the temporal properties ofV1 neurons Now we examine whether the inclusion ofsustained orand transient temporal filters (Figure 4)improves the modelsrsquo performance at predicting per-ceived differences between two movie clips

The Multilinear Model based only on the sustainedimpulse function caused little improvement overall (rfrac14

0509 now vs 0507 originally) or separately to thethree different kinds of spatially filtered movie clipsand the predictions given by the different modelsseparately for the different spatially filtered movieversions were also little improved In particular the fitfor low-pass movies remained poor (rfrac14 0408 vs rfrac140392) This is hardly surprising given that low spatialfrequencies are believed to be detected primarily bytransient channels rather than sustained channels(Kulikowski amp Tolhurst 1973) However it is worthnoting that for the high-pass spatially filtered moviesthe realistic sustained impulse improves the predictionsof the Spatial Model (r increases from 0336 to 0582)but lessens the predictions of the Ordered-TemporalModel (r decreases from 0584 to 0488) This wouldseem to be consistent with the idea that higher spatialfrequencies are predominantly processed by sustainedchannels (Kulikowski amp Tolhurst 1973)

As expected when the modeling was based on thetransient impulse function alone with its substantiallydifferent form (Figure 4) the nature of the correlationschanged In particular it generated much stronger fitsfor low-pass spatially filtered (rfrac14 0636 vs rfrac14 0408 inprevious case) However the fits for high-pass werenoticeably weaker (rfrac140675 vs rfrac140749) When takingall the movies into account the improvement of thepredictions for the low-pass movies meant that theoverall correlation for the multilinear fit improved from0509 to 0598 It therefore becomes clear that thesustained and transient filters operate quite specificallyto improve the predictions for only one type of moviehigh-pass movies in the case of sustained impulsefunctions and low-pass movies in the case of transientimpulse functions

Models

Transient andor

sustained

All movies

(n frac14 162)

Unfiltered

(n frac14 54)

Low-pass

(n frac14 54)

High-pass

(n frac14 54)

Spatial ndash 0288 0338 0274 0336

Temporal ndash 0357 0409 0102 0546

Ordered-Temporal ndash 0438 0505 0266 0584

Multilinear (three parameters) ndash 0507 0608 0392 0704

Spatial S 0396 0425 0306 0582

Temporal S 0299 0292 0241 0414

Ordered-Temporal S 0415 0503 0370 0488

Multilinear (three parameters) S 0509 0631 0408 0749

Spatial T 0453 0391 0444 0565

Temporal T 0454 0518 0481 0425

Ordered-Temporal T 0342 0311 0301 0469

Multilinear (three parameters) T 0598 0627 0636 0675

Multilinear (six parameters) TS 0759 0788 0702 0806

Table1 The correlation coefficients between the observersrsquo ratings (average of 7 per stimulus) and various V1-based models Thecorrelations for the spatially unfiltered for the low-pass and the high-pass filtered subsets are shown separately as well as the fits forthe entire set of 162 movie pairs Correlations are shown for raw movies and for temporally-filtered (sustained and transient) modelsThe final six-parameter multilinear fit is a prediction of ratings from the three models and the two kinds of impulse function

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 8

Finally the predictions of the V1 based model weresubstantially improved by incorporating both thetransient and sustained filters that is a multiple linearfit with six parameters (three ways for comparingframes for each of the sustained and transienttemporal filters) Figure 6 plots the observersrsquo averagedratings against the six-parameter model predictionsThe correlation coefficients between the predictionsgenerated by this six-parameter multilinear model andobserversrsquo measured ratings was 0759 for the full set of162 movie pairs a very substantial improvement overour starting value of 0507 (P frac14 00001 n frac14 162) Thecorrelations for the six parameter fits also improvedseparately for the spatially unfiltered (black symbols)the low-pass movies (blue) and the high-pass movies(red) The greatest improvement from our startingpoint without temporal filtering was in the multilinearfit for the low-pass spatially filtered movies (r increasesfrom 0392 to 0702)

Discussion

We have investigated whether it is possible to modelhow human observers rate the differences between twodynamic naturalistic scenes Observers were presentedwith pairs of movie clips that differed on variousdimensions such as glitches and speed and were askedto rate how different the clips appeared to them Wethen compared the predictions from a variety of V1-

based computational models with the measured ratingsAll the models were based on multiple comparisons ofpairs of static frames but the models differed in howpairs of frames were chosen for comparison Thepairwise VDP comparison is the same as we havereported for static photographs of natural scenes (To etal 2010) but augmented by prior convolution in timeof the movie clips with separate realistic sustained andtransient impulse functions (Kulikowski amp Tolhurst1973)

The use of magnitude estimation ratings in visualpsychophysical experiments might be considered toosubjective and hence not very reliable We havepreviously shown that ratings are reliable (To et al2010) repeated ratings within subjects are highlycorrelatedreproducible (even when the stimulus setwas 900 image pairs) and the correlation for ratingsbetween observers is generally good We have chosenthis method in this and previous studies for threereasons First when observers generate ratings of howdifferent pairs of stimuli appear to them they have thefreedom to rate the pairs as they please regardless ofwhich features were changed The ratings thereforeallow us to measure discrimination across and inde-pendently of feature dimensions separately and incombination (To et al 2008 To et al 2011) Secondunlike threshold measurements ratings can be collectedquickly We are interested in studying vision withnaturalistic images and naturalistic tasks and it is acommon criticism of such study that natural images areso varied that we must always study very many (198movie pairs in this study) Using staircase thresholdtechniques for so many stimuli would be impracticallyslow Third ratings are not confined to thresholdstimuli and we consider it necessary to be able tomeasure visual performance with suprathreshold stim-uli as well Overall averaging the ratings of severalobservers together has provided us with data sets thatare certainly good enough to allow us to distinguishbetween different V1-based models of visual coding (Toet al 2010 To et al 2011) Although our study is notaimed toward the invention of image quality metricswe have essentially borrowed our ratings techniquefrom that applied field

We have investigated a number of different measuresfor comparing the amount and timing of dynamicevents in within a pair of movies Our Spatial Modelcomputed the differences between correspondingframes in each clip in the movie pair and then combinedthe values using Minkowski summation This is themethod used by Watson (1998) and Watson et al(2001) in their model for evaluating the degree ofperceived distortion in compressed video streams Itseems particularly appropriate in such a context wherethe compressed video would have the same funda-mental spatial and synchronous temporal structure as

Figure 6 The averaged ratings given by observers for the 162

movie pairs consisting of exactly 150 frames are plotted against

our most comprehensive spatiotemporal VDP model raw

movies (black) low-pass filtered (blue) and high-pass filtered

(red) The model is a multilinear regression with six model terms

and an intercept (a seventh parameter) The six-model terms

are for the three methods of comparing frames each with

sustained and transient impulse functions The correlation

coefficient with seven parameters is 0759 See the Supple-

mentary Materials for more details of this regression

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 9

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

Comparing all the frames in a pair of movie clips

Such VDP modeling is confined to extracting a singlepredicted difference value from comparing two imagesHere we elaborate the VDP to comparing two movieclips each of which comprises 150 possibly differentframes or brief static images We have investigatedthree different methods of breaking movies into staticimage pairs for application of the VDP model Thesemethods tap different spatio-temporal aspects of anydifferences between the paired movies and are shownschematically in Figure 3The Spatial Model Stepwise between-movie comparison(Figure 3A) Each of the 150 frames in one movie iscompared with the corresponding frame of the othermovie giving 150 VDP outputs which are thencombined by Minkowski summation into a singlenumber (Watson 1998 Watson et al 2001) This isone prediction of how different the two movie clipswould seem to be The VDP could show differences fortwo reasons First if the movies are essentially of thesame scene but are moving temporally at a differentrate the compared frames will get out of synchronySecond at an extreme the VDP would show differ-ences even if the two movie clips were actually staticbut had different spatial content This method ofcomparing frames will be sensitive to differences inspatial structure as well as to differences in dynamicsand is a measure of overall difference in spatio-temporalstructureThe Temporal Model Stepwise within-movie compari-son Here the two movies are first analyzed separatelyEach frame in a movie clip is compared with thesuccessive frame in the same movie with 150 framesthis gives 149 VDP comparison values which arecombined to a single value by Minkowski summation(Figure 3B) This is a measure of the overall dynamiccontent of the movie clip and is calculated for each ofthe paired movies separately The difference between

the values for the two clips is a measure of the differencein overall dynamic content It would be uninfluenced byany spatial differences in scene contentThe Ordered-Temporal Model Hybrid two-step moviecomparison The previous measure gives overalldynamic structure but may not for instance illustratewhether particular dynamic events occur at differenttimes within the two movie clips (Figure 3C) Hereeach frame in a movie clip is compared directly withthe succeeding frame (as before) but this dynamicframe difference is now compared with the dynamicdifference between the two equivalent frames in theother movie clip So within each movie the nth frameis compared to the (nthorn 1)th frame and then these twodifferences (the nth difference from each movie) arecompared to generate a between-movie difference The149 between-movie differences are combined again byMinkowski summation This gives a measure of theperceived difference in ordered dynamic contentirrespective of any spatial scene content differencesThis model has the advantage of being sensitive todynamic changes occurring both within each movieand between the two

Modeling realistic impulse functions

Direct VDP comparisons of paired frames is not arealistic model since the human visual system is almostcertainly unable to code each 8 ms frame as a separateevent As a result it is necessary to model humandynamic sensitivity Thus before performing the pairedVDP comparisons we convolved the movie clips withtwo different impulse functions (Figure 4A) represent-ing lsquolsquosustainedrsquorsquo (P-cell red curve) or lsquolsquotransientrsquorsquo (M-cell blue curve) channels The impulse functions haveshape and duration that matches the temporal-frequency sensitivity functions (Figure 4B) inferred forthe two kinds of channel by Kulikowski and Tolhurst(1973) The three methods for comparing movie frames

Figure 3 Schematics showing the three methods we used to compare movie clips based on the idea that we should compare pairs of

movie frames directly either between the movies (A) or within movies (B C) See details in the text

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 6

were performed separately on the movies convolvedwith the sustained and transient impulse functionsgiving six potential predictors of human performanceNote that in the absence of realistic temporal filteringthe analysis would effectively be with a very briefsustained impulse function lasting no more than 8 ms(compared to the 100 ms duration of the realisticsustained impulse Figure 4A)

Experimental results

Reliability of the observersrsquo ratings

Seven naıve observers were presented with 198 pairsof monochrome movies and were asked to rate howdifferent they perceived the two movie clips in each pairto be We have previously shown in To et al (2010) thatobserversrsquo ratings for static natural images are reliableTo verify the robustness of the difference ratings formovies pairs in the present experiment we consideredthe correlation coefficient between the ratings given byany one observer with those given by each otherobserver this value ranged from 036 to 062 (mean050)

In our previous visual discrimination studies (To etal 2010) we have scaled and averaged the differenceratings of all seven observers to each stimulus pair (seeMethods) and have used the averages to compare withvarious model predictions Although this procedure

might remove any potential between-observer differ-ences in strategy it was used to average out within-observer variability As in our previous experimentsand analyses averaging together the results of observ-ers produced a reliable datasets for modeling

When observersrsquo averaged ratings for unfilteredhigh-pass and low-pass movies were compared (seeFigure 5) we found that the ratings for filtered movieswere generally lower than those for unfiltered moviesThis would seem understandable as unfiltered moviescontain more details

Observersrsquo ratings versus model predictions

The average of the normalized ratings for each moviepair was compared with the predictions obtained froma variety of different models of V1 complex cellprocessing The correlation coefficients between theratings and predictions are presented in Table 1 Thedifferent versions of the model all rely ultimately on aVisual Difference Predictor (To et al 2010 Tolhurst etal 2010) which calculates the perceived differencebetween a pair of frames Given that the movie clips arecomposed of 150 frames each there are a number ofways of choosing paired frames to compare a pair ofmovie clips (see Methods) The Spatial Model takeseach frame in one movie and compares it with theequivalent frame in the second movie and is sensitiveto differences in spatial content and dynamic events inthe movie clips The Temporal Model comparessuccessive frames within a movie clip to assess theoverall dynamic content of a clip The Ordered-Temporal Model is similar but it compares the timingsof dynamic events between movie clips

Figure 4 (A) The impulse functions used to convolve the movie

clips Sustained (red) and Transient (blue) These impulse

functions have the temporal-frequency filter curves shown in

(B) and are like those proposed by Kulikowski and Tolhurst

(1973) for sustained and transient channels

Figure 5 The averaged ratings given by observers for the

spatially filtered movies are plotted against the ratings for the

raw unfiltered versions of the movies Low-pass filtered (blue)

and high-pass filtered (red)

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 7

Without temporal filtering

We first used the VDP to compare pairs of frames asthey were actually presented on the display thispresumes unrealistically that an observer could treateach 8 ms frame as a separate event and that the humancontrast sensitivity function (CSF) is flat out to about60 Hz Overall the best predictions by just one modelwere generated by the Ordered-Temporal Model (r frac140438) The difference predictions were slightly im-proved when the three models were combined into amultiple linear regression model and the correlationcoefficient increased to rfrac14 0507 When considering theaccuracy of the modelsrsquo predictions in relation to thedifferent types of movies presented (ie spatiallyunfiltered low-pass and high-pass) the predictionswere consistently superior for the high-pass moviesand poorest predictions were for difference ratings forthe low-pass movies More specifically in the case ofthe Multilinear Model the correlation values betweenpredictions and measured ratings for unfiltered low-pass and high-pass were 0608 0392 and 0704respectively

Realistic temporal filtering

The predictions discussed so far were generated bymodels that compare the movies pairs frame-by-framewithout taking into account the temporal properties ofV1 neurons Now we examine whether the inclusion ofsustained orand transient temporal filters (Figure 4)improves the modelsrsquo performance at predicting per-ceived differences between two movie clips

The Multilinear Model based only on the sustainedimpulse function caused little improvement overall (rfrac14

0509 now vs 0507 originally) or separately to thethree different kinds of spatially filtered movie clipsand the predictions given by the different modelsseparately for the different spatially filtered movieversions were also little improved In particular the fitfor low-pass movies remained poor (rfrac14 0408 vs rfrac140392) This is hardly surprising given that low spatialfrequencies are believed to be detected primarily bytransient channels rather than sustained channels(Kulikowski amp Tolhurst 1973) However it is worthnoting that for the high-pass spatially filtered moviesthe realistic sustained impulse improves the predictionsof the Spatial Model (r increases from 0336 to 0582)but lessens the predictions of the Ordered-TemporalModel (r decreases from 0584 to 0488) This wouldseem to be consistent with the idea that higher spatialfrequencies are predominantly processed by sustainedchannels (Kulikowski amp Tolhurst 1973)

As expected when the modeling was based on thetransient impulse function alone with its substantiallydifferent form (Figure 4) the nature of the correlationschanged In particular it generated much stronger fitsfor low-pass spatially filtered (rfrac14 0636 vs rfrac14 0408 inprevious case) However the fits for high-pass werenoticeably weaker (rfrac140675 vs rfrac140749) When takingall the movies into account the improvement of thepredictions for the low-pass movies meant that theoverall correlation for the multilinear fit improved from0509 to 0598 It therefore becomes clear that thesustained and transient filters operate quite specificallyto improve the predictions for only one type of moviehigh-pass movies in the case of sustained impulsefunctions and low-pass movies in the case of transientimpulse functions

Models

Transient andor

sustained

All movies

(n frac14 162)

Unfiltered

(n frac14 54)

Low-pass

(n frac14 54)

High-pass

(n frac14 54)

Spatial ndash 0288 0338 0274 0336

Temporal ndash 0357 0409 0102 0546

Ordered-Temporal ndash 0438 0505 0266 0584

Multilinear (three parameters) ndash 0507 0608 0392 0704

Spatial S 0396 0425 0306 0582

Temporal S 0299 0292 0241 0414

Ordered-Temporal S 0415 0503 0370 0488

Multilinear (three parameters) S 0509 0631 0408 0749

Spatial T 0453 0391 0444 0565

Temporal T 0454 0518 0481 0425

Ordered-Temporal T 0342 0311 0301 0469

Multilinear (three parameters) T 0598 0627 0636 0675

Multilinear (six parameters) TS 0759 0788 0702 0806

Table1 The correlation coefficients between the observersrsquo ratings (average of 7 per stimulus) and various V1-based models Thecorrelations for the spatially unfiltered for the low-pass and the high-pass filtered subsets are shown separately as well as the fits forthe entire set of 162 movie pairs Correlations are shown for raw movies and for temporally-filtered (sustained and transient) modelsThe final six-parameter multilinear fit is a prediction of ratings from the three models and the two kinds of impulse function

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 8

Finally the predictions of the V1 based model weresubstantially improved by incorporating both thetransient and sustained filters that is a multiple linearfit with six parameters (three ways for comparingframes for each of the sustained and transienttemporal filters) Figure 6 plots the observersrsquo averagedratings against the six-parameter model predictionsThe correlation coefficients between the predictionsgenerated by this six-parameter multilinear model andobserversrsquo measured ratings was 0759 for the full set of162 movie pairs a very substantial improvement overour starting value of 0507 (P frac14 00001 n frac14 162) Thecorrelations for the six parameter fits also improvedseparately for the spatially unfiltered (black symbols)the low-pass movies (blue) and the high-pass movies(red) The greatest improvement from our startingpoint without temporal filtering was in the multilinearfit for the low-pass spatially filtered movies (r increasesfrom 0392 to 0702)

Discussion

We have investigated whether it is possible to modelhow human observers rate the differences between twodynamic naturalistic scenes Observers were presentedwith pairs of movie clips that differed on variousdimensions such as glitches and speed and were askedto rate how different the clips appeared to them Wethen compared the predictions from a variety of V1-

based computational models with the measured ratingsAll the models were based on multiple comparisons ofpairs of static frames but the models differed in howpairs of frames were chosen for comparison Thepairwise VDP comparison is the same as we havereported for static photographs of natural scenes (To etal 2010) but augmented by prior convolution in timeof the movie clips with separate realistic sustained andtransient impulse functions (Kulikowski amp Tolhurst1973)

The use of magnitude estimation ratings in visualpsychophysical experiments might be considered toosubjective and hence not very reliable We havepreviously shown that ratings are reliable (To et al2010) repeated ratings within subjects are highlycorrelatedreproducible (even when the stimulus setwas 900 image pairs) and the correlation for ratingsbetween observers is generally good We have chosenthis method in this and previous studies for threereasons First when observers generate ratings of howdifferent pairs of stimuli appear to them they have thefreedom to rate the pairs as they please regardless ofwhich features were changed The ratings thereforeallow us to measure discrimination across and inde-pendently of feature dimensions separately and incombination (To et al 2008 To et al 2011) Secondunlike threshold measurements ratings can be collectedquickly We are interested in studying vision withnaturalistic images and naturalistic tasks and it is acommon criticism of such study that natural images areso varied that we must always study very many (198movie pairs in this study) Using staircase thresholdtechniques for so many stimuli would be impracticallyslow Third ratings are not confined to thresholdstimuli and we consider it necessary to be able tomeasure visual performance with suprathreshold stim-uli as well Overall averaging the ratings of severalobservers together has provided us with data sets thatare certainly good enough to allow us to distinguishbetween different V1-based models of visual coding (Toet al 2010 To et al 2011) Although our study is notaimed toward the invention of image quality metricswe have essentially borrowed our ratings techniquefrom that applied field

We have investigated a number of different measuresfor comparing the amount and timing of dynamicevents in within a pair of movies Our Spatial Modelcomputed the differences between correspondingframes in each clip in the movie pair and then combinedthe values using Minkowski summation This is themethod used by Watson (1998) and Watson et al(2001) in their model for evaluating the degree ofperceived distortion in compressed video streams Itseems particularly appropriate in such a context wherethe compressed video would have the same funda-mental spatial and synchronous temporal structure as

Figure 6 The averaged ratings given by observers for the 162

movie pairs consisting of exactly 150 frames are plotted against

our most comprehensive spatiotemporal VDP model raw

movies (black) low-pass filtered (blue) and high-pass filtered

(red) The model is a multilinear regression with six model terms

and an intercept (a seventh parameter) The six-model terms

are for the three methods of comparing frames each with

sustained and transient impulse functions The correlation

coefficient with seven parameters is 0759 See the Supple-

mentary Materials for more details of this regression

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 9

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

were performed separately on the movies convolvedwith the sustained and transient impulse functionsgiving six potential predictors of human performanceNote that in the absence of realistic temporal filteringthe analysis would effectively be with a very briefsustained impulse function lasting no more than 8 ms(compared to the 100 ms duration of the realisticsustained impulse Figure 4A)

Experimental results

Reliability of the observersrsquo ratings

Seven naıve observers were presented with 198 pairsof monochrome movies and were asked to rate howdifferent they perceived the two movie clips in each pairto be We have previously shown in To et al (2010) thatobserversrsquo ratings for static natural images are reliableTo verify the robustness of the difference ratings formovies pairs in the present experiment we consideredthe correlation coefficient between the ratings given byany one observer with those given by each otherobserver this value ranged from 036 to 062 (mean050)

In our previous visual discrimination studies (To etal 2010) we have scaled and averaged the differenceratings of all seven observers to each stimulus pair (seeMethods) and have used the averages to compare withvarious model predictions Although this procedure

might remove any potential between-observer differ-ences in strategy it was used to average out within-observer variability As in our previous experimentsand analyses averaging together the results of observ-ers produced a reliable datasets for modeling

When observersrsquo averaged ratings for unfilteredhigh-pass and low-pass movies were compared (seeFigure 5) we found that the ratings for filtered movieswere generally lower than those for unfiltered moviesThis would seem understandable as unfiltered moviescontain more details

Observersrsquo ratings versus model predictions

The average of the normalized ratings for each moviepair was compared with the predictions obtained froma variety of different models of V1 complex cellprocessing The correlation coefficients between theratings and predictions are presented in Table 1 Thedifferent versions of the model all rely ultimately on aVisual Difference Predictor (To et al 2010 Tolhurst etal 2010) which calculates the perceived differencebetween a pair of frames Given that the movie clips arecomposed of 150 frames each there are a number ofways of choosing paired frames to compare a pair ofmovie clips (see Methods) The Spatial Model takeseach frame in one movie and compares it with theequivalent frame in the second movie and is sensitiveto differences in spatial content and dynamic events inthe movie clips The Temporal Model comparessuccessive frames within a movie clip to assess theoverall dynamic content of a clip The Ordered-Temporal Model is similar but it compares the timingsof dynamic events between movie clips

Figure 4 (A) The impulse functions used to convolve the movie

clips Sustained (red) and Transient (blue) These impulse

functions have the temporal-frequency filter curves shown in

(B) and are like those proposed by Kulikowski and Tolhurst

(1973) for sustained and transient channels

Figure 5 The averaged ratings given by observers for the

spatially filtered movies are plotted against the ratings for the

raw unfiltered versions of the movies Low-pass filtered (blue)

and high-pass filtered (red)

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 7

Without temporal filtering

We first used the VDP to compare pairs of frames asthey were actually presented on the display thispresumes unrealistically that an observer could treateach 8 ms frame as a separate event and that the humancontrast sensitivity function (CSF) is flat out to about60 Hz Overall the best predictions by just one modelwere generated by the Ordered-Temporal Model (r frac140438) The difference predictions were slightly im-proved when the three models were combined into amultiple linear regression model and the correlationcoefficient increased to rfrac14 0507 When considering theaccuracy of the modelsrsquo predictions in relation to thedifferent types of movies presented (ie spatiallyunfiltered low-pass and high-pass) the predictionswere consistently superior for the high-pass moviesand poorest predictions were for difference ratings forthe low-pass movies More specifically in the case ofthe Multilinear Model the correlation values betweenpredictions and measured ratings for unfiltered low-pass and high-pass were 0608 0392 and 0704respectively

Realistic temporal filtering

The predictions discussed so far were generated bymodels that compare the movies pairs frame-by-framewithout taking into account the temporal properties ofV1 neurons Now we examine whether the inclusion ofsustained orand transient temporal filters (Figure 4)improves the modelsrsquo performance at predicting per-ceived differences between two movie clips

The Multilinear Model based only on the sustainedimpulse function caused little improvement overall (rfrac14

0509 now vs 0507 originally) or separately to thethree different kinds of spatially filtered movie clipsand the predictions given by the different modelsseparately for the different spatially filtered movieversions were also little improved In particular the fitfor low-pass movies remained poor (rfrac14 0408 vs rfrac140392) This is hardly surprising given that low spatialfrequencies are believed to be detected primarily bytransient channels rather than sustained channels(Kulikowski amp Tolhurst 1973) However it is worthnoting that for the high-pass spatially filtered moviesthe realistic sustained impulse improves the predictionsof the Spatial Model (r increases from 0336 to 0582)but lessens the predictions of the Ordered-TemporalModel (r decreases from 0584 to 0488) This wouldseem to be consistent with the idea that higher spatialfrequencies are predominantly processed by sustainedchannels (Kulikowski amp Tolhurst 1973)

As expected when the modeling was based on thetransient impulse function alone with its substantiallydifferent form (Figure 4) the nature of the correlationschanged In particular it generated much stronger fitsfor low-pass spatially filtered (rfrac14 0636 vs rfrac14 0408 inprevious case) However the fits for high-pass werenoticeably weaker (rfrac140675 vs rfrac140749) When takingall the movies into account the improvement of thepredictions for the low-pass movies meant that theoverall correlation for the multilinear fit improved from0509 to 0598 It therefore becomes clear that thesustained and transient filters operate quite specificallyto improve the predictions for only one type of moviehigh-pass movies in the case of sustained impulsefunctions and low-pass movies in the case of transientimpulse functions

Models

Transient andor

sustained

All movies

(n frac14 162)

Unfiltered

(n frac14 54)

Low-pass

(n frac14 54)

High-pass

(n frac14 54)

Spatial ndash 0288 0338 0274 0336

Temporal ndash 0357 0409 0102 0546

Ordered-Temporal ndash 0438 0505 0266 0584

Multilinear (three parameters) ndash 0507 0608 0392 0704

Spatial S 0396 0425 0306 0582

Temporal S 0299 0292 0241 0414

Ordered-Temporal S 0415 0503 0370 0488

Multilinear (three parameters) S 0509 0631 0408 0749

Spatial T 0453 0391 0444 0565

Temporal T 0454 0518 0481 0425

Ordered-Temporal T 0342 0311 0301 0469

Multilinear (three parameters) T 0598 0627 0636 0675

Multilinear (six parameters) TS 0759 0788 0702 0806

Table1 The correlation coefficients between the observersrsquo ratings (average of 7 per stimulus) and various V1-based models Thecorrelations for the spatially unfiltered for the low-pass and the high-pass filtered subsets are shown separately as well as the fits forthe entire set of 162 movie pairs Correlations are shown for raw movies and for temporally-filtered (sustained and transient) modelsThe final six-parameter multilinear fit is a prediction of ratings from the three models and the two kinds of impulse function

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 8

Finally the predictions of the V1 based model weresubstantially improved by incorporating both thetransient and sustained filters that is a multiple linearfit with six parameters (three ways for comparingframes for each of the sustained and transienttemporal filters) Figure 6 plots the observersrsquo averagedratings against the six-parameter model predictionsThe correlation coefficients between the predictionsgenerated by this six-parameter multilinear model andobserversrsquo measured ratings was 0759 for the full set of162 movie pairs a very substantial improvement overour starting value of 0507 (P frac14 00001 n frac14 162) Thecorrelations for the six parameter fits also improvedseparately for the spatially unfiltered (black symbols)the low-pass movies (blue) and the high-pass movies(red) The greatest improvement from our startingpoint without temporal filtering was in the multilinearfit for the low-pass spatially filtered movies (r increasesfrom 0392 to 0702)

Discussion

We have investigated whether it is possible to modelhow human observers rate the differences between twodynamic naturalistic scenes Observers were presentedwith pairs of movie clips that differed on variousdimensions such as glitches and speed and were askedto rate how different the clips appeared to them Wethen compared the predictions from a variety of V1-

based computational models with the measured ratingsAll the models were based on multiple comparisons ofpairs of static frames but the models differed in howpairs of frames were chosen for comparison Thepairwise VDP comparison is the same as we havereported for static photographs of natural scenes (To etal 2010) but augmented by prior convolution in timeof the movie clips with separate realistic sustained andtransient impulse functions (Kulikowski amp Tolhurst1973)

The use of magnitude estimation ratings in visualpsychophysical experiments might be considered toosubjective and hence not very reliable We havepreviously shown that ratings are reliable (To et al2010) repeated ratings within subjects are highlycorrelatedreproducible (even when the stimulus setwas 900 image pairs) and the correlation for ratingsbetween observers is generally good We have chosenthis method in this and previous studies for threereasons First when observers generate ratings of howdifferent pairs of stimuli appear to them they have thefreedom to rate the pairs as they please regardless ofwhich features were changed The ratings thereforeallow us to measure discrimination across and inde-pendently of feature dimensions separately and incombination (To et al 2008 To et al 2011) Secondunlike threshold measurements ratings can be collectedquickly We are interested in studying vision withnaturalistic images and naturalistic tasks and it is acommon criticism of such study that natural images areso varied that we must always study very many (198movie pairs in this study) Using staircase thresholdtechniques for so many stimuli would be impracticallyslow Third ratings are not confined to thresholdstimuli and we consider it necessary to be able tomeasure visual performance with suprathreshold stim-uli as well Overall averaging the ratings of severalobservers together has provided us with data sets thatare certainly good enough to allow us to distinguishbetween different V1-based models of visual coding (Toet al 2010 To et al 2011) Although our study is notaimed toward the invention of image quality metricswe have essentially borrowed our ratings techniquefrom that applied field

We have investigated a number of different measuresfor comparing the amount and timing of dynamicevents in within a pair of movies Our Spatial Modelcomputed the differences between correspondingframes in each clip in the movie pair and then combinedthe values using Minkowski summation This is themethod used by Watson (1998) and Watson et al(2001) in their model for evaluating the degree ofperceived distortion in compressed video streams Itseems particularly appropriate in such a context wherethe compressed video would have the same funda-mental spatial and synchronous temporal structure as

Figure 6 The averaged ratings given by observers for the 162

movie pairs consisting of exactly 150 frames are plotted against

our most comprehensive spatiotemporal VDP model raw

movies (black) low-pass filtered (blue) and high-pass filtered

(red) The model is a multilinear regression with six model terms

and an intercept (a seventh parameter) The six-model terms

are for the three methods of comparing frames each with

sustained and transient impulse functions The correlation

coefficient with seven parameters is 0759 See the Supple-

mentary Materials for more details of this regression

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 9

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

Without temporal filtering

We first used the VDP to compare pairs of frames asthey were actually presented on the display thispresumes unrealistically that an observer could treateach 8 ms frame as a separate event and that the humancontrast sensitivity function (CSF) is flat out to about60 Hz Overall the best predictions by just one modelwere generated by the Ordered-Temporal Model (r frac140438) The difference predictions were slightly im-proved when the three models were combined into amultiple linear regression model and the correlationcoefficient increased to rfrac14 0507 When considering theaccuracy of the modelsrsquo predictions in relation to thedifferent types of movies presented (ie spatiallyunfiltered low-pass and high-pass) the predictionswere consistently superior for the high-pass moviesand poorest predictions were for difference ratings forthe low-pass movies More specifically in the case ofthe Multilinear Model the correlation values betweenpredictions and measured ratings for unfiltered low-pass and high-pass were 0608 0392 and 0704respectively

Realistic temporal filtering

The predictions discussed so far were generated bymodels that compare the movies pairs frame-by-framewithout taking into account the temporal properties ofV1 neurons Now we examine whether the inclusion ofsustained orand transient temporal filters (Figure 4)improves the modelsrsquo performance at predicting per-ceived differences between two movie clips

The Multilinear Model based only on the sustainedimpulse function caused little improvement overall (rfrac14

0509 now vs 0507 originally) or separately to thethree different kinds of spatially filtered movie clipsand the predictions given by the different modelsseparately for the different spatially filtered movieversions were also little improved In particular the fitfor low-pass movies remained poor (rfrac14 0408 vs rfrac140392) This is hardly surprising given that low spatialfrequencies are believed to be detected primarily bytransient channels rather than sustained channels(Kulikowski amp Tolhurst 1973) However it is worthnoting that for the high-pass spatially filtered moviesthe realistic sustained impulse improves the predictionsof the Spatial Model (r increases from 0336 to 0582)but lessens the predictions of the Ordered-TemporalModel (r decreases from 0584 to 0488) This wouldseem to be consistent with the idea that higher spatialfrequencies are predominantly processed by sustainedchannels (Kulikowski amp Tolhurst 1973)

As expected when the modeling was based on thetransient impulse function alone with its substantiallydifferent form (Figure 4) the nature of the correlationschanged In particular it generated much stronger fitsfor low-pass spatially filtered (rfrac14 0636 vs rfrac14 0408 inprevious case) However the fits for high-pass werenoticeably weaker (rfrac140675 vs rfrac140749) When takingall the movies into account the improvement of thepredictions for the low-pass movies meant that theoverall correlation for the multilinear fit improved from0509 to 0598 It therefore becomes clear that thesustained and transient filters operate quite specificallyto improve the predictions for only one type of moviehigh-pass movies in the case of sustained impulsefunctions and low-pass movies in the case of transientimpulse functions

Models

Transient andor

sustained

All movies

(n frac14 162)

Unfiltered

(n frac14 54)

Low-pass

(n frac14 54)

High-pass

(n frac14 54)

Spatial ndash 0288 0338 0274 0336

Temporal ndash 0357 0409 0102 0546

Ordered-Temporal ndash 0438 0505 0266 0584

Multilinear (three parameters) ndash 0507 0608 0392 0704

Spatial S 0396 0425 0306 0582

Temporal S 0299 0292 0241 0414

Ordered-Temporal S 0415 0503 0370 0488

Multilinear (three parameters) S 0509 0631 0408 0749

Spatial T 0453 0391 0444 0565

Temporal T 0454 0518 0481 0425

Ordered-Temporal T 0342 0311 0301 0469

Multilinear (three parameters) T 0598 0627 0636 0675

Multilinear (six parameters) TS 0759 0788 0702 0806

Table1 The correlation coefficients between the observersrsquo ratings (average of 7 per stimulus) and various V1-based models Thecorrelations for the spatially unfiltered for the low-pass and the high-pass filtered subsets are shown separately as well as the fits forthe entire set of 162 movie pairs Correlations are shown for raw movies and for temporally-filtered (sustained and transient) modelsThe final six-parameter multilinear fit is a prediction of ratings from the three models and the two kinds of impulse function

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 8

Finally the predictions of the V1 based model weresubstantially improved by incorporating both thetransient and sustained filters that is a multiple linearfit with six parameters (three ways for comparingframes for each of the sustained and transienttemporal filters) Figure 6 plots the observersrsquo averagedratings against the six-parameter model predictionsThe correlation coefficients between the predictionsgenerated by this six-parameter multilinear model andobserversrsquo measured ratings was 0759 for the full set of162 movie pairs a very substantial improvement overour starting value of 0507 (P frac14 00001 n frac14 162) Thecorrelations for the six parameter fits also improvedseparately for the spatially unfiltered (black symbols)the low-pass movies (blue) and the high-pass movies(red) The greatest improvement from our startingpoint without temporal filtering was in the multilinearfit for the low-pass spatially filtered movies (r increasesfrom 0392 to 0702)

Discussion

We have investigated whether it is possible to modelhow human observers rate the differences between twodynamic naturalistic scenes Observers were presentedwith pairs of movie clips that differed on variousdimensions such as glitches and speed and were askedto rate how different the clips appeared to them Wethen compared the predictions from a variety of V1-

based computational models with the measured ratingsAll the models were based on multiple comparisons ofpairs of static frames but the models differed in howpairs of frames were chosen for comparison Thepairwise VDP comparison is the same as we havereported for static photographs of natural scenes (To etal 2010) but augmented by prior convolution in timeof the movie clips with separate realistic sustained andtransient impulse functions (Kulikowski amp Tolhurst1973)

The use of magnitude estimation ratings in visualpsychophysical experiments might be considered toosubjective and hence not very reliable We havepreviously shown that ratings are reliable (To et al2010) repeated ratings within subjects are highlycorrelatedreproducible (even when the stimulus setwas 900 image pairs) and the correlation for ratingsbetween observers is generally good We have chosenthis method in this and previous studies for threereasons First when observers generate ratings of howdifferent pairs of stimuli appear to them they have thefreedom to rate the pairs as they please regardless ofwhich features were changed The ratings thereforeallow us to measure discrimination across and inde-pendently of feature dimensions separately and incombination (To et al 2008 To et al 2011) Secondunlike threshold measurements ratings can be collectedquickly We are interested in studying vision withnaturalistic images and naturalistic tasks and it is acommon criticism of such study that natural images areso varied that we must always study very many (198movie pairs in this study) Using staircase thresholdtechniques for so many stimuli would be impracticallyslow Third ratings are not confined to thresholdstimuli and we consider it necessary to be able tomeasure visual performance with suprathreshold stim-uli as well Overall averaging the ratings of severalobservers together has provided us with data sets thatare certainly good enough to allow us to distinguishbetween different V1-based models of visual coding (Toet al 2010 To et al 2011) Although our study is notaimed toward the invention of image quality metricswe have essentially borrowed our ratings techniquefrom that applied field

We have investigated a number of different measuresfor comparing the amount and timing of dynamicevents in within a pair of movies Our Spatial Modelcomputed the differences between correspondingframes in each clip in the movie pair and then combinedthe values using Minkowski summation This is themethod used by Watson (1998) and Watson et al(2001) in their model for evaluating the degree ofperceived distortion in compressed video streams Itseems particularly appropriate in such a context wherethe compressed video would have the same funda-mental spatial and synchronous temporal structure as

Figure 6 The averaged ratings given by observers for the 162

movie pairs consisting of exactly 150 frames are plotted against

our most comprehensive spatiotemporal VDP model raw

movies (black) low-pass filtered (blue) and high-pass filtered

(red) The model is a multilinear regression with six model terms

and an intercept (a seventh parameter) The six-model terms

are for the three methods of comparing frames each with

sustained and transient impulse functions The correlation

coefficient with seven parameters is 0759 See the Supple-

mentary Materials for more details of this regression

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 9

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

Finally the predictions of the V1 based model weresubstantially improved by incorporating both thetransient and sustained filters that is a multiple linearfit with six parameters (three ways for comparingframes for each of the sustained and transienttemporal filters) Figure 6 plots the observersrsquo averagedratings against the six-parameter model predictionsThe correlation coefficients between the predictionsgenerated by this six-parameter multilinear model andobserversrsquo measured ratings was 0759 for the full set of162 movie pairs a very substantial improvement overour starting value of 0507 (P frac14 00001 n frac14 162) Thecorrelations for the six parameter fits also improvedseparately for the spatially unfiltered (black symbols)the low-pass movies (blue) and the high-pass movies(red) The greatest improvement from our startingpoint without temporal filtering was in the multilinearfit for the low-pass spatially filtered movies (r increasesfrom 0392 to 0702)

Discussion

We have investigated whether it is possible to modelhow human observers rate the differences between twodynamic naturalistic scenes Observers were presentedwith pairs of movie clips that differed on variousdimensions such as glitches and speed and were askedto rate how different the clips appeared to them Wethen compared the predictions from a variety of V1-

based computational models with the measured ratingsAll the models were based on multiple comparisons ofpairs of static frames but the models differed in howpairs of frames were chosen for comparison Thepairwise VDP comparison is the same as we havereported for static photographs of natural scenes (To etal 2010) but augmented by prior convolution in timeof the movie clips with separate realistic sustained andtransient impulse functions (Kulikowski amp Tolhurst1973)

The use of magnitude estimation ratings in visualpsychophysical experiments might be considered toosubjective and hence not very reliable We havepreviously shown that ratings are reliable (To et al2010) repeated ratings within subjects are highlycorrelatedreproducible (even when the stimulus setwas 900 image pairs) and the correlation for ratingsbetween observers is generally good We have chosenthis method in this and previous studies for threereasons First when observers generate ratings of howdifferent pairs of stimuli appear to them they have thefreedom to rate the pairs as they please regardless ofwhich features were changed The ratings thereforeallow us to measure discrimination across and inde-pendently of feature dimensions separately and incombination (To et al 2008 To et al 2011) Secondunlike threshold measurements ratings can be collectedquickly We are interested in studying vision withnaturalistic images and naturalistic tasks and it is acommon criticism of such study that natural images areso varied that we must always study very many (198movie pairs in this study) Using staircase thresholdtechniques for so many stimuli would be impracticallyslow Third ratings are not confined to thresholdstimuli and we consider it necessary to be able tomeasure visual performance with suprathreshold stim-uli as well Overall averaging the ratings of severalobservers together has provided us with data sets thatare certainly good enough to allow us to distinguishbetween different V1-based models of visual coding (Toet al 2010 To et al 2011) Although our study is notaimed toward the invention of image quality metricswe have essentially borrowed our ratings techniquefrom that applied field

We have investigated a number of different measuresfor comparing the amount and timing of dynamicevents in within a pair of movies Our Spatial Modelcomputed the differences between correspondingframes in each clip in the movie pair and then combinedthe values using Minkowski summation This is themethod used by Watson (1998) and Watson et al(2001) in their model for evaluating the degree ofperceived distortion in compressed video streams Itseems particularly appropriate in such a context wherethe compressed video would have the same funda-mental spatial and synchronous temporal structure as

Figure 6 The averaged ratings given by observers for the 162

movie pairs consisting of exactly 150 frames are plotted against

our most comprehensive spatiotemporal VDP model raw

movies (black) low-pass filtered (blue) and high-pass filtered

(red) The model is a multilinear regression with six model terms

and an intercept (a seventh parameter) The six-model terms

are for the three methods of comparing frames each with

sustained and transient impulse functions The correlation

coefficient with seven parameters is 0759 See the Supple-

mentary Materials for more details of this regression

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 9

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

the uncompressed reference In a sense this measureimagines that an observer views and memorizes the firstmovie clip and can then replay the memory while thesecond clip is played In our experiments it will besensitive to any differences in the spatial content of themovies but also to temporal differences caused by lossof synchrony between otherwise identical movies Thegreater the temporal difference or the longer thedifference lasts the more the single frames will becomedecorrelated and the greater the measured differenceOthers have suggested that the video distortion metricswould be enhanced by using additional measureswhere observers evaluate the dynamic content withineach movie clip separately (eg Seshadrinathan ampBovik 2010 Vu amp Chandler 2014 Wang amp Li 2007)We investigated a Temporal Model which combinedthe differences between successive frames within eachmovie and then compared the summary valuesbetween the movies While this measure might sum-marize the degree of dynamic activity within eachmovie clip it will be insensitive to differences in spatialcontent and it is likely to be insensitive to whether twomovie clips have the same overall dynamic content buta different order of events Thus our Ordered-Temporal Model is a hybrid model that includedelements from both the Spatial and Temporal Modelsthis should be particularly enhanced with a transientimpulse function which itself highlights dynamicchange

It is easy to imagine particular kinds of differencebetween movie clips that one or other of our metricswill fail to detect but we feel that most differences willbe shown up by combined use of several metrics Somemay still escape since we have not explicitly modeledthe coding of lateral motion Might our modeling missthe rather obvious difference between a leftward andrightward moving sine-wave grating or a leftright flipin the spatial content of the scene Natural movementin natural scenes is less likely to be so regularlystereotyped as a single moving grating More interest-ingly a literal head-to-head comparison of movieframes will detect almost any spatio-temporal differ-ence but it may be that some kinds of difference arenot perceived by the observer even though a visualcortex model suggests that they should be visibleperhaps because such differences are not of anysurvival benefit (see To et al 2010) Our differentspatial or temporal metrics aim to measure thedifferences between movie clips they are not aimed tospecifically map onto the different kinds of decisionthat the observers actually make Observers oftenreported that they noticed that the lsquolsquorhythmrsquorsquo of one clipwas distorted or that the steady or repetitive movementwithin one clip suddenly speeded up This implies adecision based primarily upon interpreting just one ofthe clips rather than on any direct comparison Our

model can only sense the rhythm of one clip or asudden change within it by comparing it with thecontrol clip that has a steady rhythm and steadycontinuous movement

As well as naturalistic movie clips we alsopresented clips that were subject either to low-passspatial filtering or to high-pass filtering In general thesustained filter was particularly useful at strengtheningpredictions for the movies containing only high spatialfrequencies On the other hand the transient filterimproved predictions for the low-pass spatially filteredmovies These results seem to be consistent with thelsquolsquoclassicalrsquorsquo view of sustained and transient channels(aka the parvocellular vs magnocellular processingpathways) with their different spatial frequency ranges(Hess amp Snowden 1992 Horiguchi et al 2009Keesey 1972 Kulikowski amp Tolhurst 1973 Tolhurst1973 1975) In order to get good predictions ofobserversrsquo ratings it was necessary to have a multi-linear regression that included both sustained andtransient contributions

Although a multiple linear regression fit (Figure 6)with all six candidatesrsquo measures (three ways forcomparing frames for each of the sustained andtransient temporal filters) yielded strong predictions ofobserversrsquo ratings (rfrac14 0759 nfrac14 162) it is necessary topoint out that there are significant correlations betweenthe six measures Using a stepwise linear regression asan exploratory tool to determine the relative weight ofeach we found that of the six factors only two standout as being particularly important (see SupplementaryMaterials) the spatialsustained factor which issensitive to spatial information as well as temporal andthe ordered-temporaltransient factor handling thespecific sequences of dynamic events A multilinearregression with only these two measures gave predic-tions that were almost as strong as with all six factors (rfrac14 0733 Supplementary Materials) Again this findingseems to be consistent with the classical view ofsustained and transient channels that they conveydifferent perceptual information about spatial structureand dynamic content

Our VDP is intended to be a model of low-levelvisual processing as realistic a model of V1 processingas we can deduce from the vast neurophysiologicalsingle-neuron literature The M-cell and P-cell streams(Derrington amp Lennie 1984 Gouras 1968)mdashtheTransient and Sustained candidatesmdashenter V1 sepa-rately and at first sight travel through V1 separatelyto reach different extrastriate visual areas (Merigan ampMaunsell 1993 Nassi amp Callaway 2009) However itis undoubtedly true that there is some convergence ofM-cell and P-cell input onto the same V1 neurons(Sawatari amp Callaway 1996 Vidyasagar KulikowskiLipnicki amp Dreher 2002) and that extrastriate areasreceive convergent input perhaps via different routes

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 10

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

(Nassi amp Callaway 2006 Ninomiya SawamuraInoue amp Takada 2011) Even if the transient andsustained pathways do not remain entirely separate upto the highest levels of perception it still remains thatperception of high spatial frequency information willbe biased towards sustained processing while lowspatial frequency information will be biased towardstransient processing with its implication in motionsensing

Our measures of perceived difference rely primarilyon head-to-head comparisons of correspondingframes between movies or of the dynamic events atcorresponding points This is core to models ofperceived distortion in compressed videos (egWatson 1998) It is suitable for the latter case becausethe original and compressed videos are likely to beexactly the same length However this becomes alimitation in a more general comparison of movieclips when the clips might be of unequal length Thebulk of the stimuli in our experiment were indeedpaired movie clips of equal length but 36 out of thetotal 198 did differ in length We have not includedthese 36 in the analyses in the Results section becausewe could not simply perform the head-to-headcomparisons between the paired clips In theSupplementary Materials we discuss this further bytruncating the longer clip of the pair to the samelength as the shorter one we were able to calculate thepredictive measures The resulting six-parametermultilinear regression for all stimuli had a less good fitthan for the subset of equal length videos whichsuggests that a better method of comparing unequallength movie clips could be sought

The inclusion of sustained and transient filtersallowed us to model low-level temporal processingHowever one surprise is that we achieved very good fitswithout explicitly modeling neuronal responses tolateral motion It may be relevant that in a survey ofsaliency models Borji Sihite and Itti (2013) report thatmodels incorporating motion did not perform betterthan the best static models over video datasets Onepossibility is that lateral motion produces somesignature in the combination of the various measuresthat we have calculated Future work could includelateral movement sensing elements in our model eitherseparately or in addition to the temporal impulsefunctions and a number of such models of corticalmotion sensing both in V1 and in MTV5 are available(Adelson amp Bergen 1985 Johnston McOwan ampBuxton 1992 Perrone 2004 Simoncelli amp Heeger1998) However the current model does surprisinglywell without such a mechanism

Keywords computational modeling visual discrimi-nation natural scenes spatial processing temporalprocessing

Acknowledgments

We are pleased to acknowledge that the late TomTroscianko played a key role in the planning of thisproject and that P George Lovell and Chris Bentoncontributed to the collection of the raw movie clipsThe research was funded by grants (EPE0370971 andEPE0373721) from the EPSRCDstl under the JointGrants Scheme The first author was employed onE0370971 The publication of this paper was fundedby the Department of Psychology Lancaster Univer-sity UK

Commercial relationships NoneCorresponding author Michelle P S ToEmail mtolancasteracukAddress Department of Psychology Lancaster Uni-versity Lancaster UK

References

Adelson E H amp Bergen J R (1985) Spatiotemporalenergy models for the perception of motionJournal of the Optical Society of America A OpticsImage Science and Vision 2(2) 284ndash299

Blakemore C amp Tobin E A (1972) Lateralinhibition between orientation detectors in the catrsquosvisual cortex Experimental Brain Research 15(4)439ndash440

Borji A Sihite D N amp Itti L (2013) Quantitativeanalysis of human-model agreement in visualsaliency modeling A comparative study IEEETransactions on Image Processing 22(1) 55ndash69doi101109TIP20122210727

Derrington A M amp Lennie P (1984) Spatial andtemporal contrast sensitivities of neurones in lateralgeniculate nucleus of macaque Journal of Physiol-ogy 357 219ndash240

Foley J M (1994) Human luminance pattern-visionmechanisms Masking experiments require a newmodel Journal of the Optical Society of America AOptics Image Science and Vision 11(6) 1710ndash1719

Gouras P (1968) Identification of cone mechanisms inmonkey ganglion cells Journal of Physiology199(3) 533ndash547

Hasson U Nir Y Levy I Fuhrmann G ampMalach R (2004) Intersubject synchronization ofcortical activity during natural vision Science303(5664) 1634ndash1640 doi101126science1089506

Heeger D J (1992) Normalization of cell responses in

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 11

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

cat striate cortex Visual Neuroscience 9(2) 181ndash197

Hess R F amp Snowden R J (1992) Temporalproperties of human visual filters Number shapesand spatial covariation Vision Research 32(1) 47ndash59

Hirose Y Kennedy A amp Tatler B W (2010)Perception and memory across viewpoint changesin moving images Journal of Vision 10(4) 2 1ndash19httpwwwjournalofvisionorgcontent1042doi1011671042 [PubMed] [Article]

Horiguchi H Nakadomari S Misaki M ampWandell B A (2009) Two temporal channels inhuman V1 identified using fMRI Neuroimage47(1) 273ndash280 doi101016jneuroimage200903078

Itti L Dhayale N amp Pighin F (2003) Realisticavatar eye and head animation using a neurobio-logical model of visual attention Vision Research44(15) 1733ndash1755

Johnston A McOwan P W amp Buxton H (1992) Acomputational model of the analysis of some first-order and second-order motion patterns by simpleand complex cells Proceedings of the Royal SocietyB Biological Sciences 250(1329) 297ndash306 doi101098rspb19920162

Keesey U T (1972) Flicker and pattern detection Acomparison of thresholds Journal of the OpticalSociety of America 62(3) 446ndash448

Kulikowski J J amp Tolhurst D J (1973) Psycho-physical evidence for sustained and transientdetectors in human vision Journal of Physiology232(1) 149ndash162

Lovell P G Gilchrist I D Tolhurst D J ampTroscianko T (2009) Search for gross illumina-tion discrepancies in images of natural objectsJournal of Vision 9(1)37 1ndash14 httpwwwjournalofvisionorgcontent9137 doi1011679137 [PubMed] [Article]

Meese T S (2004) Area summation and maskingJournal of Vision 4(10)8 930ndash943 httpwwwjournalofvisionorgcontent4108 doi1011674108 [PubMed] [Article]

Merigan W H amp Maunsell J H (1993) Howparallel are the primate visual pathways AnnualReview of Neuroscience 16 369ndash402 doi101146annurevne16030193002101

Nassi J J amp Callaway E M (2006) Multiple circuitsrelaying primate parallel visual pathways to themiddle temporal area Journal of Neuroscience26(49) 12789ndash12798 doi101523JNEUROSCI4044-062006

Nassi J J amp Callaway E M (2009) Parallelprocessing strategies of the primate visual systemNature Reviews Neuroscience 10(5) 360ndash372 doi101038nrn2619

Ninomiya T Sawamura H Inoue K amp Takada M(2011) Differential architecture of multisynapticgeniculo-cortical pathways to V4 and MT CerebralCortex 21(12) 2797ndash2808 doi101093cercorbhr078

Peli E (1990) Contrast in complex images Journal ofthe Optical Society of America A Optics ImageScience and Vision 7(10) 2032ndash2040

Perrone J A (2004) A visual motion sensor based onthe properties of V1 and MT neurons VisionResearch 44(15) 1733ndash1755 doi101016jvisres200403003

Sawatari A amp Callaway E M (1996) Convergenceof magno- and parvocellular pathways in layer 4Bof macaque primary visual cortex Nature380(6573) 442ndash446 doi101038380442a0

Seshadrinathan K amp Bovik A (2010) Motion tunedspatio-temporal quality assessment of naturalvideos IEEE Transactions on Image Processing19(2) 335ndash350

Simoncelli E P amp Heeger D J (1998) A model ofneuronal responses in visual area MT VisionResearch 38(5) 743ndash761

Smith T J amp Mital P K (2013) Attentionalsynchrony and the influence of viewing task ongaze behavior in static and dynamic scenes Journalof Vision 13(8)16 1ndash24 httpwwwjournalofvisionorgcontent13816 doi10116713816 [PubMed] [Article]

Tadmor Y amp Tolhurst D J (1994) Discriminationof changes in the second-order statistics of naturaland synthetic images Vision Research 34(4) 541ndash554

Tadmor Y amp Tolhurst D J (2000) Calculating thecontrasts that retinal ganglion cells and LGNneurones encounter in natural scenes VisionResearch 40(22) 3145ndash3157

To M Lovell P G Troscianko T amp Tolhurst D J(2008) Summation of perceptual cues in naturalvisual scenes Proceedings of the Royal Society BBiological Sciences 275(1649) 2299ndash2308 doi101098rspb20080692

To M P S Baddeley R J Troscianko T ampTolhurst D J (2011) A general rule for sensorycue summation Evidence from photographicmusical phonetic and cross-modal stimuli Pro-ceedings of the Royal Society B Biological Sciences278(1710) 1365ndash1372 doi101098rspb20101888

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 12

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4

To M P S Gilchrist I D amp Tolhurst D J (2014)Transient vs Sustained Modeling temporal im-pulse functions in visual discrimination of dynamicnatural images Perception 43 478

To M P S Lovell P G Troscianko T Gilchrist ID Benton C P amp Tolhurst D J (2012)Modeling human discrimination of moving imagesPerception 41(10) 1270ndash1271

To M P S Lovell P G Troscianko T amp TolhurstD J (2010) Perception of suprathreshold natu-ralistic changes in colored natural images Journalof Vision 10(4)12 1ndash22 httpwwwjournalofvisionorgcontent10412 doi10116710412 [PubMed] [Article]

Tolhurst D J (1973) Separate channels for theanalysis of the shape and the movement of movingvisual stimulus Journal of Physiology 231(3) 385ndash402

Tolhurst D J (1975) Sustained and transient channelsin human vision Vision Research 15 1151ndash1155

Tolhurst D J amp Thompson I D (1981) On thevariety of spatial frequency selectivities shown byneurons in area 17 of the cat Proceedings of theRoyal Society B Biological Sciences 213(1191)183ndash199

Tolhurst D J To M P S Chirimuuta MTroscianko T Chua P Y amp Lovell P G (2010)Magnitude of perceived change in natural imagesmay be linearly proportional to differences inneuronal firing rates Seeing Perceiving 23(4) 349ndash372

Troscianko T Meese T S amp Hinde S (2012)

Perception while watching movies Effects ofphysical screen size and scene type Iperception3(7) 414ndash425 doi101068i0475aap

Vidyasagar T R Kulikowski J J Lipnicki D Mamp Dreher B (2002) Convergence of parvocellularand magnocellular information channels in theprimary visual cortex of the macaque EuropeanJournal of Neuroscience 16(5) 945ndash956

Vu P V amp Chandler D M (2014) Vis3 Analgorithm for video quality assessment via analysisof spatial and spatiotemporal slices Online sup-plement The Journal of Electronic Imaging 23(1)013016

Wang Z amp Li Q (2007) Video quality assessmentusing a statistical model of human visual speedperception Journal of the Optical Society ofAmerica A Optics Image Science and Vision24(12) B61ndashB69

Watson A B (1987) Efficiency of a model humanimage code Journal of the Optical Society ofAmerica A Optics Image Science and Vision4(12) 2401ndash2417

Watson A B (1998) Toward a perceptual videoquality metric Human Vision Visual Processingand Digital Display VIII 3299 139ndash147

Watson A B Hu J amp McGowan J F III (2001)Digital video quality metric based on human visionJournal of Electronic Imaging 10 20ndash29

Watson A B amp Solomon J A (1997) Model ofvisual contrast gain control and pattern maskingJournal of the Optical Society of America A OpticsImage Science and Vision 14(9) 2379ndash2391

Journal of Vision (2015) 15(1)19 1ndash13 To Gilchrist amp Tolhurst 13

  • Introduction
  • Methods
  • f01
  • f02
  • f03
  • Experimental results
  • f04
  • f05
  • t01
  • Discussion
  • f06
  • Adelson1
  • Blakemore1
  • Borji1
  • Derrington1
  • Foley1
  • Gouras1
  • Hasson1
  • Heeger1
  • Hess1
  • Hirose1
  • Horiguchi1
  • Itti1
  • Johnston1
  • Keesey1
  • Kulikowski1
  • Lovell1
  • Meese1
  • Merigan1
  • Nassi1
  • Nassi2
  • Ninomiya1
  • Peli1
  • Perrone1
  • Sawatari1
  • Seshadrinathan1
  • Simoncelli1
  • Smith1
  • Tadmor1
  • Tadmor2
  • To1
  • To2
  • To3
  • To4
  • To5
  • Tolhurst1
  • Tolhurst2
  • Tolhurst3
  • Tolhurst4
  • Troscianko1
  • Vidyasagar1
  • Vu1
  • Wang1
  • Watson1
  • Watson2
  • Watson3
  • Watson4