Towards Image Data Hiding via Facial Stego Synthesis with ...

8
Towards Image Data Hiding via Facial Stego Synthesis With Generative Model Li Dong 1,2 , Jie Wang 1,2 , Rangding Wang 1,2 , Yuanman Li 3 and Weiwei Sun 4 1 Faculty of Electrical Engineering and Computer Science, Ningbo University, Zhejiang, China, 315211 2 Southeast Digital Economic Development Institute, Zhejiang, China, 324000 3 Shenzhen University, Guangdong, China, 518061 4 Alibaba Group, Zhejiang, China, 310052 Abstract Stego synthesis-based data hiding aims to directly produce a plausible natural image to convey secret message. However, most of the existing works neglected the possible communication degradations and forensic actions, which commonly occur in practice. In this paper, we devise a generative adversarial network (GAN)-based framework to synthesize facial stego images. The framework consists of four components: generator, extractor, discriminator and forensic network. Speci๏ฌcally, the generator is deployed to generate a realistic facial stego image from the secret message and key, while the extractor aims at extracting the secret message from the stego image with the provided secret key. To combat forensics, we explicitly integrate a forensic network into the proposed framework, which is responsible for guiding the update of generator. Three degradation layers are further incorporated, enforcing the generator to characterize the communication degradations. Experimental results demonstrate that the proposed framework could accurately extract the secret message and e๏ฌ€ectively resist the forensic detection and certain degradations, while attaining realistic facial stego images. Keywords data hiding, stego synthesis, generative adversarial network 1. Introduction Data hiding aims to embed the secret message into a cover signal, without incurring awareness of an adver- sary. It is widely used in many applications, e.g., covert communication [1] and multimedia data protection [2, 3]. The primitive ad-hoc Least-Signi๏ฌcant Bit (LSB) replaces the bit in least signi๏ฌcant bit-plane of each pixel with the secret bit. While the modern data hiding methods at- tempt to eliminate the traces of data hiding action and im- prove the steganographic capacity. For example, content- adaptive steganography [1] designed sophisticated dis- tortion function according to prior knowledge and used Syndrome-Trellis coding to embed the secret message. Recently, neural network-based data hiding is becoming one of the active research directions. Baluja [4] employed convolutional neural networks to hide an entire secret image into the cover image in an end-to-end fashion. The work SSGAN [5] attempted to exploit GAN to synthesize a cover image which is more suitable for the subsequent steganographic data embedding. ASDL-GAN [6] inte- grated the content-adaptive steganography and GAN, in which the generator was able to produce the modi๏ฌca- International Workshop on Safety & Security of Deep Learning, 21st -26th August, 2021 [email protected] (L. Dong); [email protected] (J. Wang); [email protected] (R. Wang); [email protected] (Y. Li); [email protected] (W. Sun) ยฉ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) tion probability maps. For the methods HayersGAN [7], HiDDeN [8] and SteganoGAN [9], they all designed an encoder-decoder alike framework based on GAN. These methods could automatically learn the suitable areas for embedding the secret bitstream message. For the last several years, the adversarial examples to neural networks meet data hiding, and continuously drawing extensive attention from the community. Some studies, e.g., [10, 11], found that adding slight pertur- bations to the input data would paralyze the prediction capability of learning-based classi๏ฌers. As the opponent of data hiding, steganalysis aims to expose the data hiding on stego signal and usually involves machine-learning classi๏ฌers. Therefore, it is possible for data hiding meth- ods to bypass steganalysis by borrowing some strategies from the adversarial examples-related works. Tang et al. [12] presented the Adversarial Embedding (ADV-EMB) method that adjusts the modi๏ฌcation cost of image ele- ments, according to the gradients that back-propagated from the target steganalytic neural network. The con- structed adversarial stego could e๏ฌ€ectively fool the ste- ganalytic network, revealing the vulnerability of the deep learning-based steganalyzer. Note that, all aforementioned data hiding techniques are based on the cover modi๏ฌcation. The common char- acteristic is that these methods can not be independent of the modi๏ฌcation on the given cover image. As such, it in- evitably leaves artifacts exposing to steganalysis. On the contrary, stego synthesis-based data hiding, e.g., [13, 14], refers to synthesizing the stego image directly from the

Transcript of Towards Image Data Hiding via Facial Stego Synthesis with ...

Towards Image Data Hiding via Facial Stego Synthesis With Generative Model

Li Dong1,2, Jie Wang1,2, Rangding Wang1,2, Yuanman Li3 and Weiwei Sun4

1Faculty of Electrical Engineering and Computer Science, Ningbo University, Zhejiang, China, 3152112Southeast Digital Economic Development Institute, Zhejiang, China, 3240003Shenzhen University, Guangdong, China, 5180614Alibaba Group, Zhejiang, China, 310052

AbstractStego synthesis-based data hiding aims to directly produce a plausible natural image to convey secret message. However,most of the existing works neglected the possible communication degradations and forensic actions, which commonly occurin practice. In this paper, we devise a generative adversarial network (GAN)-based framework to synthesize facial stegoimages. The framework consists of four components: generator, extractor, discriminator and forensic network. Specifically,the generator is deployed to generate a realistic facial stego image from the secret message and key, while the extractor aims atextracting the secret message from the stego image with the provided secret key. To combat forensics, we explicitly integratea forensic network into the proposed framework, which is responsible for guiding the update of generator. Three degradationlayers are further incorporated, enforcing the generator to characterize the communication degradations. Experimental resultsdemonstrate that the proposed framework could accurately extract the secret message and effectively resist the forensicdetection and certain degradations, while attaining realistic facial stego images.

Keywordsdata hiding, stego synthesis, generative adversarial network

1. IntroductionData hiding aims to embed the secret message into acover signal, without incurring awareness of an adver-sary. It is widely used in many applications, e.g., covertcommunication [1] and multimedia data protection [2, 3].The primitive ad-hoc Least-Significant Bit (LSB) replacesthe bit in least significant bit-plane of each pixel withthe secret bit. While the modern data hiding methods at-tempt to eliminate the traces of data hiding action and im-prove the steganographic capacity. For example, content-adaptive steganography [1] designed sophisticated dis-tortion function according to prior knowledge and usedSyndrome-Trellis coding to embed the secret message.Recently, neural network-based data hiding is becomingone of the active research directions. Baluja [4] employedconvolutional neural networks to hide an entire secretimage into the cover image in an end-to-end fashion. Thework SSGAN [5] attempted to exploit GAN to synthesizea cover image which is more suitable for the subsequentsteganographic data embedding. ASDL-GAN [6] inte-grated the content-adaptive steganography and GAN, inwhich the generator was able to produce the modifica-

International Workshop on Safety & Security of Deep Learning, 21st-26th August, 2021Envelope-Open [email protected] (L. Dong); [email protected](J. Wang); [email protected] (R. Wang);[email protected] (Y. Li); [email protected](W. Sun)

ยฉ 2021 Copyright for this paper by its authors. Use permitted under CreativeCommons License Attribution 4.0 International (CC BY 4.0).

CEURWorkshopProceedings

http://ceur-ws.orgISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)

tion probability maps. For the methods HayersGAN [7],HiDDeN [8] and SteganoGAN [9], they all designed anencoder-decoder alike framework based on GAN. Thesemethods could automatically learn the suitable areas forembedding the secret bitstream message.For the last several years, the adversarial examples

to neural networks meet data hiding, and continuouslydrawing extensive attention from the community. Somestudies, e.g., [10, 11], found that adding slight pertur-bations to the input data would paralyze the predictioncapability of learning-based classifiers. As the opponentof data hiding, steganalysis aims to expose the data hidingon stego signal and usually involves machine-learningclassifiers. Therefore, it is possible for data hiding meth-ods to bypass steganalysis by borrowing some strategiesfrom the adversarial examples-related works. Tang et al.[12] presented the Adversarial Embedding (ADV-EMB)method that adjusts the modification cost of image ele-ments, according to the gradients that back-propagatedfrom the target steganalytic neural network. The con-structed adversarial stego could effectively fool the ste-ganalytic network, revealing the vulnerability of the deeplearning-based steganalyzer.Note that, all aforementioned data hiding techniques

are based on the cover modification. The common char-acteristic is that these methods can not be independent ofthe modification on the given cover image. As such, it in-evitably leaves artifacts exposing to steganalysis. On thecontrary, stego synthesis-based data hiding, e.g., [13, 14],refers to synthesizing the stego image directly from the

secret message. It could pose more challenges for ste-ganalysis. Under this concept, traditional methods triedto produce stego image based on some hand-crafted des-ignations. Although the capacity was relatively higher,they were limited to synthesizing patterned images, suchas textures and fingerprints. As an alternative solution,some methods [15, 16] use GAN to synthesize stego im-ages with rich semantics, e.g., face and food. However,the accuracy of message extraction was unsatisfactoryunder image degradations. Moreover, the synthesizedstego images can be easily identified by a well-trainedforensic detector. It is thus urgent to further improvethe robustness of message extraction and anti-forensiccapability of stego synthesis-based data hiding methods.

In this work, we propose a Facial Stego Image Synthe-sis method for data hiding with GAN, which is termed asFSIS-GAN. Unlike the cover modification-based data hid-ing methods, FSIS-GAN is designed without providinga cover image beforehand. Compared with the existingstego synthesis-based methods, FSIS-GAN can not onlysynthesize realistic facial stego images, but also achievesuperior performance in terms of robustness and anti-forensic capability. Experimental results conducted onthe public facial dataset validate such merits of our pro-posed method. The main contributions of this work canbe summarized as follows,

โ€ข We explicitly consider the image degradation dur-ing the covert communication, and integrate mul-tiple degradation layers into the framework. Thisboost the robustness performance in terms of themessage extraction.

โ€ข We incorporate a forensic network during train-ing FSIS-GAN. By exploiting the gradients fromsuch a forensic network, the stego image pro-duced by the learned generator could effectivelyfool the forensic network.

โ€ข We explicitly adopt the secret key into the datahiding procedure of FISI-GAN, which could fur-ther improve the reliability of the secret messageextraction.

The rest of this paper is organized as follows. SectionII briefly reviews the related work on stego synthesis-based data hiding. Section III describes the proposed FSIS-GAN, including network architecture and loss function.Section IV presents the experimental results, and the finalconclusions are drawn in Section V.

2. Stego Synthesis-based DataHiding

The majority of data hiding method involves the modifi-cation on the given cover images. However, such cover

Adversarial Training Anti-Forensics

Facial Stego Image Synthesis and Message Extraction

Genuine image ๐ˆ

Secretmessage ๐ฆ

Secret key ๐ค

Degraded stegoimage ๐“(๐’)

Synthesized stego image ๐’

Extracted message ๐ฆโ€ฒ

10101ยทยทยท11

10101ยทยทยท0110010ยทยทยท01

Discriminator ๐““

Degradation layers ๐“Generator ๐“–

Forensic network ๐“•๐œฝ

Extractor ๐“”

Figure 1: Overview of the proposed FSIS-GAN framework.

modification would leave embedding traces that can bedetected. To resist the detection by steganalyzer, stegosynthesis-based data hiding method could directly pro-duce the stego images from the given secret message.For early attempts, Wu et al. citewu2014steganographyproposed a texture image synthesis-based method, whichselectively distributes the source patches of the originaltexture image onto the synthesized stego image. Themessage hiding and extraction depend on the choice ofsource patches. Motivated by the fingerprint biomet-rics, Li et al. [14] proposed to use the hologram phaseconstructed from the secret message to synthesize fin-gerprint stego image. The hologram phase consists oftwo phases: The first spiral phase encodes the secretmessage to the two-dimensional points with different po-larities, and the second continuous phase is to synthesizefingerprint images. It is worth noting that conventionalstego image synthesis-based methods can only synthe-size patterned stego image such as textures, lacking richsemantics, which limits their practical applications.Instead, Hu et al. [15] suggested using the genera-

tor of GAN to synthesize a facial stego image from thesecret message. Meanwhile, the secret message can beextracted from the stego image by the corresponding ex-tractor network. Similarly, Zhang et al. [16] exploitedGAN to generate stego image with different semanticlabels, which could improve the robustness of data ex-traction but significantly scarifying the steganographiccapacity. The main advantage of the GAN-based works isthat they could synthesize stego images with rich seman-tics. However, we shall note that stego images can beeasily identified by some well-trained forensic networks.In addition, there is no trade-off between capacity andextraction accuracy.

3. Facial Image Data Hiding viaGenerative Stego Synthesis

In this section, we first give an overview of the proposedFSIS-GAN framework and then introduce each compo-nent of the framework, accompanied with thorough dis-cussion on the loss function, network structure and train-

ing procedure.

3.1. Overview of FSIS-GANThe proposed FSIS-GAN framework is illustrated in Fig-ure 1. In general, it is an end-to-end framework consist-ing of three parts, where each part is designed to achievea specific goal. First, the part of facial stego image syn-thesis and message extraction contains a generator G, anextractor E and the degradation layers N. The generatorG is deployed to convert the secret message along withthe secret key into a facial stego image. The degradationlayers N are used to simulate possible common imagedegradations within the communication channel. Theextractor E is learned to recover the secret message fromthe degraded stego image. Second, there is a discrimina-tor D in the part of adversarial training, which aims atdistinguishing the genuine data sample from the ones pro-duced by the generator G. Third, a well-trained existingforensic network F๐œƒ (parameterized by ๐œƒ) is introducedin the part of anti-forensics, which could distinguish thegenuine from the synthesized facial stego image. Notethat this target forensic network is treated as a fixedadversary, and its network parameters are always frozen.

3.2. Stego Image Synthesis and MessageExtraction

The part of facial stego image synthesis and messageextraction achieve two functionalities. First, by usingthe generator G, one can convert the given secret mes-sage into a facial stego image. Second, the extractor E isresponsible for extracting the secret message from theinput stego image. Furthermore, a secret key is intro-duced to ensure the communication reliability and highdiversity of the generated facial stego image.Generally, generator G and extractor E aim to learn

two mappings, i.e., mapping the given secret messageinto a stego image, and vice versa. More formally, letm โˆˆ {0, 1}๐‘™๐‘š and k โˆˆ {0, 1}๐‘™๐‘˜ be the binary secret messageand the secret key, respectively. Generator G is intendedto learn the first mapping, transforming the message malong with the secret key k into a stego image:

S = G(m,k), (1)

where S denotes the synthesized facial stego image ofshape ๐ถ ร— ๐ป ร—๐‘Š. To recover the secret message, we nextintroduce the extractor E. Considering that the facialstego image S may be degraded during transmission, thesecondmapping should be from the degraded stego imagealong with the secret key k to the secret message, whichcan be expressed by

mโ€ฒ = E(N(S),k), (2)

where N(โ‹…) models the image degradation process, andN(S) is the degraded stego image. Here, mโ€ฒ โˆˆ (0, 1)๐‘™๐‘šdenotes the extracted secret message. It shall be notedthat the extracted messagemโ€ฒ shall be (approximately)equals the original secret messagem, and thus one canemploy error correcting mechanism to fully correct theerroneous bits.

To measure the distortion between the original secretmessage m and the extracted message mโ€ฒ, we use thecross-entropy loss to calculate the message extraction lossLE, which is given by

LE(m,mโ€ฒ) = โˆ’ 1๐‘™๐‘š

๐‘™๐‘šโˆ‘๐‘–=1

[๐‘š๐‘–log(๐‘šโ€ฒ๐‘– ) + (1 โˆ’ ๐‘š๐‘–)log(1 โˆ’ ๐‘šโ€ฒ

๐‘– )],

(3)where๐‘š๐‘– and๐‘šโ€ฒ

๐‘– is ๐‘–-th element ofm andmโ€ฒ, respectively.Note that, our proposed FSIS-GAN framework explic-

itly receiving a secret key as an input, which is designedto satisfy the Kerckhoffsโ€™ principle. It means that eventhe extractor E network is completely exposed to an at-tacker, the secret message m will be recovered only ifthe receiver obtain both the secret key k and the facialstego image S. It is worth emphasizing that, for most ofthe existing GAN-based methods, e.g., [15, 16], there isno involvement of a secret key. Further notice that asthe input of the extractor E, the dimensions of secret keyk is greatly smaller than that of the facial stego imageS. Thus, the extractor E tends to discard the secret keybecause it carries much less information. To mitigate thisissue, we propose to use randomly generated incorrectsecret key k โˆˆ {0, 1}๐‘™๐‘˜ , where k โ‰  k, as input during train-ing stage. Instead of directly using the correct secret keyand minimize the difference between the extracted andoriginal message, we maximize the differences betweenthe extracted and original message when applying incor-rect secret key. Mathematically, the loss term inverse lossLE, can be expressed by the negative cross-entropy loss:

LE(m, mโ€ฒ) = 1

๐‘™๐‘š

๐‘™๐‘šโˆ‘๐‘–=1

[๐‘š๐‘–log(๏ฟฝ๏ฟฝโ€ฒ๐‘– ) + (1 โˆ’ ๐‘š๐‘–)log(1 โˆ’ ๏ฟฝ๏ฟฝโ€ฒ

๐‘– )], (4)

where ๏ฟฝ๏ฟฝโ€ฒ๐‘– is the ๐‘–-th element of the extracted message mโ€ฒ

with the incorrect key k, i.e., mโ€ฒ = E(N(S), k).Enhancing robustness with degradation layers:

In a practical communication channel, there often ex-ists degradations on the synthesized stego image S, whentransmitting the stego to a receiver. To this end, the datahiding system requires certain robustness to ensure theaccuracy of message extraction. Therefore, in this work,we take three representative degradations into account,i.e., image noise pollution, blurring, and compression.For noise pollution, we consider the one of the mostwidely-used noise models: Gaussian noise. For blurring,the Gaussian blurring is used. For signal compression,JPEG image compression is employed, which is exten-sively used for reducing the bandwidth of transmission

process. In experiments, we implement these three typesof degradation as neural network layersN to degrade thestego image. Specifically, three network layers are usedfor simulating each type of degradation. Gaussian noiselayer (GNL) is to add Gaussian noise to the facial stegoimage S. Gaussian blurring layer (GBL) blurs S. For JPEGcompression, considering that the quantitation operationis non-differentiable, we approximate the quantizationoperationwith a differentiable polynomial function. Suchdifferentiating technique can also be referred to the workHiDDeN [8].

3.3. Adversarial Training PartAs aforementioned, the hand-crafted stego synthesis-based data hiding methods [13, 14] only could synthesizepatterned images such as texture and fingerprint, limitingtheir practical applications. Synthesizing a natural imagewith semantics is a challenging task. However, this prob-lem can be alleviated with the guidance of adversarialtraining. In this part, the purpose of the discriminatorD is to conduct adversarial training with the generatorG and improve the plausibility of the synthesized facialstego images.

More specifically, let I be the genuine facial image sam-ple of shape ๐ถ ร— ๐ป ร— ๐‘Š from a publicly available genuinefacial image dataset. The discriminator D estimates theprobability that a given image sample belonging to a syn-thesized by the generator G. The generator G attempts tofool the discriminatorD. Through such adversarial train-ing, the generator G is encouraged to synthesize muchmore realistic facial stego images. As a variant of GAN,the network structure and loss function of BEGAN [17]provides a good reference for improving training stability.Thus, we in this work employ the adversarial trainingloss used in BEGAN. Mathematically, the adversarial lossLadv for the generator G can be calculated as

Ladv(D(S), S) = 1๐ถ๐ป๐‘Š

[|D(S) โˆ’ S|], (5)

where the shape of outputD(S) is same as the facial stegoimage. The adversarial loss LD for the discriminator Dis

LD(I, S) = 1๐ถ๐ป๐‘Š

[|D(I) โˆ’ I| โˆ’ โ„Ž๐‘ก โ‹… |D(S) โˆ’ S|], (6)

where โ„Ž๐‘ก controls the discrimination ability of D in the๐‘ก-th training step to equilibrate the adversarial training.It can be computed as

โ„Ž๐‘ก+1 = โ„Ž๐‘ก +๐œ†

๐ถ๐ป๐‘Š[๐›พ|D(I) โˆ’ I| โˆ’ |D(S) โˆ’ S|]. (7)

Here the parameter ๐œ† is the learning rate of training,and ๐›พ is a hyper-parameter to control the diversity ofsynthesized facial images. The quality and diversity ofthe facial stego images can be freely adjusted by tuningthe parameter ๐›พ.

3.4. Anti-forensics PartRemind that there is no explicit cover images involved instego synthesis-based data hiding methods. This meritmakes such type of data hiding method could effectivelyresist to conventional steganalysis detection. However,as pointed in [15], a well-trained forensic network couldreadily distinguish a synthesized stego image from thegenuine one, even the synthesized stego image is of noperceptual differences to an observer.

Although F๐œƒ is an expert in such a detection task, somestudies [10, 11] have shown that deep neural network-based classifiers are vulnerable to adversarial examples.Inspired by this, we propose to apply strategies of ob-taining adversarial examples to evade the stego detectionnetwork as a way for realizing anti-forensics. In FSIS-GAN framework, we consider a white-box scenario, i.e.,assuming one has full knowledge of the target foren-sic network. The target forensic network F is trainedwith the genuine images from a publicly available facialdataset and the synthesized images that produced by BE-GAN [17]. Then, we integrate the well-trained F๐œƒ intothe FSIS-GAN framework, in which F๐œƒ receives the syn-thesized facial stego image S and output the confidence.The gradients that back-propagated by the F๐œƒ are usedto update the parameters of the generator G. To measurethe loss of resisting forensic detection, we define the anti-forensic loss LF๐œƒ to computes the cross-entropy betweenthe output of F๐œƒ and our target genuine image label:

LF๐œƒ(S) = โˆ’ log (1 โˆ’ F๐œƒ(S)), (8)

where F๐œƒ(S) โˆˆ (0, 1) is the confidence output by F๐œƒ.Clearly, the decrement of LF๐œƒ indicates the probabilityincrement of S being identified as a genuine image by F๐œƒ.

3.5. Network Structure and TrainingStrategy

The network architecture of the generator G and theextractor E are shown in Figure 2. For generator G, thesecret key vector k is first concatenated to the secretmessage vector m and then fed to subsequent layers.Then, G applies two fully-connected (FC) layers and threeconvtranspose (ConvT) layers to produce the facial stegoimage S. In particular, after each FC layer or ConvTlayer, we apply batch normalization (BN) [18] and ReLUactivation function to process intermediate vectors. Inexperiments, we found that both m and k are composedof binary number 0 or 1, and such form is not suitableas input and the adversarial training loss would diverge.To solve this issue, additional BN layers were added, andnormalization operation is carried out inside the network.Experiential results show that this trick could greatlyalleviate the divergence problem.

!"#$"%

&"''()"*!

!+,%-"'./"0*

'%")1 .&()"*#

!"#$"%

3"+*"

!"0!

!"#"

!"$0" % 0!&

!"!'()

!"#$%&'#%&'

(!

!"#)*+,-

+.'/0

(!

!"#)*+,-

+.'/0

!"#)*

+*%#1

*)"+

,"-

,

!*"+

("-

(

.("+

)"-

)

/"+"-

(a) Network structure of the generator G

!"#$%&'#%&'

!"#"$

(!)*+,-"+.

(!

!"/!

&"#"$ '& ( !)"#"$

*!+"#

!,"$

!,

!"#$"%

&"'(!

!')%*"+,-".(

+%"3/ ,123"(%

45%$2#%".

1"++23"(#$

,-"#

+"$

+

!+."#

-"$

-

+*,"#

."$

.

!"#/)0'12!"#/)0'12!"#/)0'12!"#/)0'12

!"/"

(b) Network structure of the extractor E

Figure 2: Network structure of the generatorG and the extrac-tor E. โ€œConcatโ€, โ€œFCโ€, โ€œConvTโ€, โ€œBNโ€, โ€œConvโ€ denote the con-catenation, fully-connected layer, convtranspose layer, batchnorm, and convolution layer, respectively.

For extractor E, we shall ensure the secret key vectork and the facial stego image matrix S in a way suchthat the extractor E would not neglect the informationprovided by the secret key. To this end, the extractorE first applies FC layer to the secret key to form theintermediate matrix, i.e., 1ร—๐‘Š ร—๐ป. Then, the facial stegoimage S and the intermediate matrix are concatenated,and then feed the fused tensor to the four convolutional(Conv) layers. Finally, the extractor E applies the FC layerand Sigmoid activation function to produce the messagevector mโ€ฒ (or mโ€ฒ) with size of 1 ร— ๐‘™๐‘š.

For the discriminator D, we adopt the auto-encoderalike structure from BEGAN [17]. For the target forensicnetwork F, we use Ye-Net [19], which is a widely-usedsteganalytic method.

The training process of the proposed FSIS-GAN frame-work is iteratively optimize the loss function of eachnetwork, except the well-trained forensic network F๐œƒ.We apply the extraction loss LE and the adversarial lossLD as the loss function for the extractor E and the dis-criminator D, respectively. In particular, The total lossLG for the generator G is a proper fusion of the fourlosses aforementioned as follows

LG = Ladv + ๐›ผ(LE + LE) + ๐›ฝLF๐œƒ , (9)

where Ladv is the adversarial loss for G, LEis the in-

verse loss, and LF๐œƒ is the anti-forensic loss. The hyper-parameters of ๐›ผ and ๐›ฝ are used to control the relativeimportance among the four losses.

4. Experiment resultsIn this section, we first introduce the experimental setup.Then, to verify the robustness of our proposed FSIS-GAN,it is evaluated under image degradation and withoutdegradation, respectively. Finally, the anti-forensic capa-bility of FSIS-GAN is validated.

4.1. Experimental SetupOur experiments are conducted on the CelebA dataset[20], where the region with face is identified and ex-tracted. All images are reshaped into 3 ร— 64 ร— 64. Thefollowing three metrics are used for evaluation:

โ€ข Frรฉchet Inception Distance (FID) [21], whichis a widely-used perceptual image quality assess-ment metric for synthesized images. FID is a defacto metric for assessing the image quality cre-ated by generator of GANsโ€™. Lower FID scoreindicates better consistency with humanโ€™s per-ception on natural images.

โ€ข Accuracy of message extraction (ACC) that iscomputed byACC = ๐ฟExt

๐ฟ , where ๐ฟExt is the lengthof correctly extracted message and ๐ฟ is the lengthof secret message m.

โ€ข Probability of missed detection (PMD). Thismetric can be calculated by PMD = ๐น๐‘

๐น๐‘+๐‘‡๐‘ƒ , where๐น๐‘ (False Negative) is the ratio for case โ€œsynthe-sized facial image is misclassified as a genuineoneโ€, and ๐‘‡๐‘ƒ (True Positive) is the ratio for caseโ€œsynthesized facial image is correctly detectedโ€.Larger PMD indicates higher resisting ability tothe forensic network.

The proposed FSIS-GAN framework is implementedwith PyTorch and train on four NVIDIA GTX1080TiGPUs with 11GB memory. The number of trainingepochs is set to 400 with a mini batch-size of 64. Weuse Adam [22] as the optimizer with a learning rate of2 ร— 10โˆ’4. For the hyper-parameters ๐›ผ and ๐›ฝ in (9), witha number of trials and errors, we empirically set themas 0.1 in experiments. The parameter ๐›พ in (7) is set to0.7, which is expected to produce reasonably diverse fa-cial stego images. The competing method is the mostrelated work [15]. We implement this work by ourselvesbecause there is no publicly available code. With certaintweaking and fine-tuning, the tested results were com-parable to the originally reported data from [15]. For afair comparison, the length of the secret message ๐‘™๐‘š andthe secret key ๐‘™๐‘˜ are all set to 100, so as to the payload isidentical to that of work [15].

Figure 3: Comparison of exemplar synthesized stego images.Top: Hu et al. [15]; Bottom: Proposed FSIS-GAN-WD.

4.2. Performance Without DegradationsNotice that the competing method [15] does not considerthe image degradations. To verify the effectiveness ofthe proposed method under same settings and make afair comparison. We in this subsection to evaluate theperformance without degradation layers N. The facialstego image S will be transmitted to extractor E withoutany degradation. To avoid confusion, this variation ofour proposed method is termed as FSIS-GAN-WD (WD isabbreviated for Without Degradations). We first comparethe visual quality of the facial stego images with thecompetingmethod [15]. As can be seen from Figure 3, theproposed FSIS-GAN-WD could synthesize more realisticfacial stego images in comparison with Hu et al. [15].With more careful inspection, one can notice that thestego images produced by FSIS-GAN-WD are more vividand with more correct semantic structures. It is difficultfor a common human to aware the inauthenticity of thefacial stego images synthesized by FSIS-GAN-WD. Incontrast, the stego images generated by Hu et al. [15] aretypically blurry and severely distorted, which apparentlydraw attentions from a forensic analyzer. For the FIDevaluation experiment, we use 10, 000 pairs of genuineimages and synthesized facial stego images to computethe FID score. The FID score of FSIS-GAN-WD is 23.20,which is much smaller than that of Hu et al. [15]โ€™s 32.07.

Then, we evaluate the extraction accuracy for the caseof without degradation. The results are tabulated in Table1. To demonstrate the impact of the inverse loss L

Eon

the extraction accuracy, the ablation experiments are alsoconducted, by excluding the inverse loss during training.This L

E-ablated version is denoted as FSIS-GAN-WD (ex

LE). From the Table 1, one can draw the following con-

clusions. First, the extraction accuracy of FSIS-GAN-WDwith the correct secret key k is 98.76%, which dramati-cally outperforms 85.23% of the competing method [15].Second, by comparing FSIS-GAN-WD and FSIS-GAN-WD (ex L

E), one can see that, the extraction accuracy of

FSIS-GAN-WD with a correct secret key k slightly infe-rior to that of FSIS-GAN-WD (ex L

E). This suggests that

the introduced inverse loss would marginally harm theextraction accuracy. However, when comparing the caseof incorrect key k, the participation of the inverse loss L

E

Table 1Comparison of message extraction accuracy (%) for the caseof no communication degradations. Here, k and k denotethe correct and incorrect secret key, respectively. FSIS-GAN-WD is a variant of the proposed method by excluding thedegradation layers, and FSIS-GAN-WD (exLE) represents theFSIS-GAN-WD trained without inverse loss LE.

SchemeHu et al.

[15]

FSIS-GAN-WD FSIS-GAN-WD (ex LE)

with k with k with k with kAccuracy 85.23 98.76 71.50 99.41 97.01

would significantly deduce the extraction accuracy from97.01% to 71.50%, while FSIS-GAN-WD almost retainsthe same extraction accuracy. This phenomena meansthat the involvement of the secret key will not work ifwe exclude the inverse loss. In contrast, FSIS-GAN-WD(ex L

E) with the incorrect key k still attains a quite high

extraction accuracy of (> 97%). In a short summary, with-out the inverse loss L

E, the variant FSIS-GAN-WD (ex

LE) will violate the Kerckhsoffsโ€™ principle.

4.3. Performance With DegradationsIn this subsection, we test the robustness performanceof the proposed framework under certain image degra-dations. The image degradation type and level are givenas prior knowledge. This scenario is common in practicebecause one can obtain some prior knowledge on thedegradation through probing the communication chan-nel. Thus, one can fix the degradation layers N and itsassociated parameters during training stage. Specifically,in our experiments, the standard deviation ๐œŽ1 of the Gaus-sian noise layer (GNL) is set to 0.2. The kernel width ๐‘‘and the standard deviation ๐œŽ2 of the Gaussian blurringlayer (GBL) are set to 3 and 1, respectively. The differ-entiable JPEG compression layer (JCL) is implementedas suggested by the work HiDDEN [8] For referring sim-plicity, this variation is termed as FSIS-GAN-FD (FD isabbreviated for Fixed Degradation) in the sequel.

Firstly, the stego images synthesized by FSIS-GAN-FDare provided in Figure 4. One can observe that somespeckle noises emerge in the generated stego images,which can be clearly seen from the highlighted regionswith red line in Figure 4 (b). Quantitatively, the FID scoreof FSIS-GAN-FD is 41.40, which is inferior to that ofFSIS-GAN-WD (23.20) and Hu et al. [15] (32.07). Never-theless, the stego images produced by FSIS-GAN-FD areintuitively more realistic than that of Hu et al. [15].

Secondly, in Table 2, we report the extraction accuracyperformance under fixed degradations. Not surprisingly,one can notice that the extraction accuracy of Hu et al.[15] and FSIS-GAN-WD greatly degrade, which can beattributed to the overlooking on degradation-resistantmessage extraction issue. In contrast, FSIS-GAN-FD ex-

(a) (b)Figure 4: The comparison of synthesized facial stego images,where four images of (a) are produced by FSIS-GAN-WD;images of (b) are stego images produced by FSIS-GAN-FD.With the introduction of degradation layers, minor specklenoises emerge (highlighted with red rectangular).

Table 2Comparison of message extraction accuracy (%) under variousdegradation conditions. The bold and marked value withan asterisk (*) denote the highest extraction accuracy withcorrect secret key k and the lowest extraction accuracy withthe incorrect secret key k, respectively.

SchemeHu et al.

[15]

FSIS-GAN-WD FSIS-GAN-FD

with k with k with k with k

W/o degradation 85.23 98.76 71.50โˆ— 98.22 72.08Fixed GNL 52.72 59.78 56.23โˆ— 95.58 72.74Fixed GBL 69.68 57.52 54.68โˆ— 98.58 73.78Fixed JCL 65.33 61.38 58.00โˆ— 98.46 72.67

hibits quite promising results. Under three types of degra-dation layers, the extraction accuracy typically exceeds94% (though lower than that of FSIS-GAN-WD, which isspecifically designed for the non-degradation scenario).The results verify that for the case of known degrada-tions, the proposed framework could learn to effectivelyresistant the fixed degradations, by employing the fixeddegradation layers during the training.Finally, to illustrate how the robustness of message

extraction changes under different degradation levels,we test different degradation types with a variety ofdegradation levels. Due to space limit, we only reportthe JPEG compression degradation in Figure 5. As canbe seen, with the decrement of quality factor (๐‘„๐น), theextraction accuracy generally decreases. Although theJCL that adopted from HiDDEN [8] could handle non-differentiable JPEG compression, it cannot perfectly re-produce the JPEG compression artifacts. Nevertheless,FSIS-GAN-FD still achieve superior robustness, whencomparing with other two schemes.

4.4. Performance of Anti-forensicsRecall that, owing to that no cover images are involve-ment for data hiding, our method has a relatively goodundetectability when exposed to a steganalyzer. How-

95 75 55 45 15 5Compression quality factors (QF)

50

60

70

80

90

100

Message extraction accuracy

Hu et al.FSIS-GAN-WDFSIS-GAN-FD

Figure 5: Comparison of the message extraction accuracy (%)under various levels of JPEG compression degradation.

ever, as pointed in [15], a well-trained forensic networkcan effectively identify a synthesized image. To solve thisissue, we explicitly considered the anti-forensics scenarioand introduce the anti-forensic loss LF๐œƒ .

To demonstrate the influence of anti-forensic loss LF๐œƒ ,we conduct the ablation experiment by excluding the losstermLF๐œƒ , and thus this variant is termed as FSIS-GAN (exLF๐œƒ). For a concrete example, we employ thewell-trainedforensic network Ye-Net [19] F๐œƒ to detect 3000 facialstego images produced by different methods, and recordthe probability of missed detection (PMD). The PMDโ€™sof Hu et al. [15], FSIS-GAN (ex LF๐œƒ), and FSIS-GAN are3.23%, 8.84%, and 89.91, respectively. As clearly shown,for FSIS-GAN (ex LF๐œƒ), despite the facial stego imageslook natural for human, they are easily exposed to theforensic network, where the PMD value is lower than 10%.In contrast, by introducing the anti-forensic loss term, thevalue of PMD of FSIS-GAN could reach 89.91%. Thismeans the proposed method FSIS-GAN could effectivelybypass the existing forensic network, retaining an niceanti-forensic capability.

5. ConclusionIn this work, we proposed a stego-synthesis based datahiding method using generative neural network, byexplicitly considering the image degradation and anti-forensic need. Specifically, the generator is to synthe-size a facial stego image from the given secret messageand secret key. The extractor aims to recover the secretmessage with the secret key. Through the adversarialtraining with the discriminator, the generator could pro-duce realistic facial stego images. The degradation layersare introduced during the training, which significantlyenhance the robustness of message extraction. A forensicnetwork is incorporated during training, in response tothe possible adversarial forensic analysis in communi-cation channel. Experimental results verified that, ourapproach could generate more natural facial stego im-ages, while retaining higher message extraction accuracyand nice anti-forensic ability.

AcknowledgmentsThis work was supported in part by the National Natu-ral Science Foundation of China under Grant 61901237,in part by the Open Project Program of the State KeyLaboratory of CADCG, Zhejiang University under GrantA2006, and in part by the Ningbo Natural Science Foun-dation under Grant 2019A610103. Thanks to SoutheastDigital Economic Development Institute for supportingthe computing facility.

References[1] V. Sedighi, R. Cogranne, J. Fridrich, Content-

adaptive steganography by minimizing statisticaldetectability, IEEE Trans. Inf. Forensics Security 11(2015) 221โ€“234.

[2] J. Zhou, W. Sun, L. Dong, X. Liu, O. C. Au, Y. Y. Tang,Secure reversible image data hiding over encrypteddomain via key modulation, IEEE Trans. CircuitsSyst. Video Technol. 26 (2015) 441โ€“452.

[3] L. Dong, J. Zhou, W. Sun, D. Yan, R. Wang, Firststeps toward concealing the traces left by reversibleimage data hiding, IEEE Trans. Circuits Syst. II, Exp.Briefs 67 (2020) 951โ€“955.

[4] S. Baluja, Hiding images within images, IEEE Trans.Pattern Anal. Mach. Intell. 42 (2020) 1685โ€“1697.

[5] H. Shi, J. Dong, W. Wang, Y. Qian, X. Zhang, SS-GAN: secure steganography based on generativeadversarial networks, in: Pacific Rim Conferenceon Multimedia, 2017, pp. 534โ€“544.

[6] W. Tang, S. Tan, B. Li, J. Huang, Automatic stegano-graphic distortion learning using a generative ad-versarial network, IEEE Signal Process. Lett. 24(2017) 1547โ€“1551.

[7] J. Hayes, G. Danezis, Generating steganographicimages via adversarial training, in: Proc. Adv. Neu-ral Inf. Process. Syst., 2017, pp. 1954โ€“1963.

[8] J. Zhu, R. Kaplan, J. Johnson, F. Li, HiDDen: Hid-ing data with deep networks, in: Proc. Eur. Conf.Comput. Vis., 2018, pp. 657โ€“672.

[9] K. A. Zhang, A. Cuesta-Infante, L. Xu, K. Veera-machaneni, SteganoGAN: High capacity im-age steganography with GANs, arXiv preprintarXiv:1901.03892 (2019).

[10] C. Szegedy,W. Zaremba, I. Sutskever, J. Bruna, D. Er-han, I. Goodfellow, R. Fergus, Intriguing propertiesof neural networks, arXiv preprint arXiv:1312.6199(2013).

[11] I. J. Goodfellow, J. Shlens, C. Szegedy, Explain-ing and harnessing adversarial examples, arXivpreprint arXiv:1412.6572 (2014).

[12] W. Tang, B. Li, S. Tan, M. Barni, J. Huang, CNN-based adversarial embedding for image steganogra-

phy, IEEE Trans. Inf. Forensics Security 14 (2019)2074โ€“2087.

[13] K. Wu, C. Wang, Steganography using reversibletexture synthesis, IEEE Trans. Image Process. 24(2014) 130โ€“139.

[14] S. Li, X. Zhang, Toward construction-based datahiding: From secrets to fingerprint images, IEEETrans. Image Process. 28 (2018) 1482โ€“1497.

[15] D. Hu, L.Wang,W. Jiang, S. Zheng, B. Li, A novel im-age steganography method via deep convolutionalgenerative adversarial networks, IEEE Access 6(2018) 38303โ€“38314.

[16] Z. Zhang, G. Fu, R. Ni, J. Liu, X. Yang, A genera-tive method for steganography by cover synthesiswith auxiliary semantics, Tsinghua Science andTechnology 25 (2020) 516โ€“527.

[17] D. Berthelot, T. Schumm, L. Metz, BEGAN: Bound-ary equilibrium generative adversarial networks,arXiv preprint arXiv:1703.10717 (2017).

[18] S. Ioffe, C. Szegedy, Batch normalization: Acceler-ating deep network training by reducing internalcovariate shift, arXiv preprint arXiv:1502.03167(2015).

[19] J. Ye, J. Ni, Y. Yi, Deep learning hierarchical repre-sentations for image steganalysis, IEEE Trans. Inf.Forensics Security 12 (2017) 2545โ€“2557.

[20] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learningface attributes in the wild, in: Proc. IEEE Int. Conf.Comput. Vis., 2015.

[21] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler,S. Hochreiter, Gans trained by a two time-scaleupdate rule converge to a local nash equilibrium,in: Proc. Adv. Neural Inf. Process. Syst., 2017, pp.6629โ€“6640.

[22] D. P. Kingma, J. Ba, Adam: A method for stochas-tic optimization, arXiv preprint arXiv:1412.6980(2014).